LLM Self Confidence

23 min read

Before I begin, I must disclose my expertise is in software engineering, game dev, and application security. I have a growing curiosity and passion to learn ML, but I am still very much a beginner.


A Simple Notion?

While thinking a little about LLM hallucinations, I was curious how difficult it would be to measure the confidence an LLM has in its answer.

So where do I start, I’m a single mom of two active children, my time is….limited. I needed to concoct a simple but accurate heuristic, and it turned out there was one hiding in plain sight?

The idea arose from this series of questions:

  • How do I add a “certainty” or “confidence” score into an LLM without being able to retrain or change the model?
  • If an AI truly “knows” something, shouldn’t it give fairly consistent answers if asked the same question?
  • Conversely, if the temperature on a model is non-zero, won’t the responses be more diverse if it is hallucinating an answer?

Well, you can probably already guess the notion from the above questions, my hypothesis was:

If you input the LLM the same question multiple times and evaluate the semantic similarity of the responses. The measure will correlate with the confidence the model has in its answer.

The Journey

So, I set out on a journey to give my local LLM a very simple sense of confidence and to naively test my hypothesis.

Practically speaking, to implement this algorithm, we would need to measure the semantic similarity between two or more synthesized responses to the same question. But before that, I think we would also need to systematically compensate for how the temperature and normalization constants shaped the final responses.


I began to worry then, about the scope and complexity of implementing semantic similarity. But as I was daydreaming about this, it occurred to me that we could just use sentence embeddings, and/or contextual embeddings for this task. Relieved, I considered the remaining task: compensating for how the temperature and normalization constants shape the final output. For that, let me provide a little more context.

Temperature and normalization constants are applied at the final token selection step during text generation, right after the model computes its raw logits, but before sampling the next token. Logits are the raw, unnormalized scores that a neural network outputs before they get converted into normalized probabilities, commonly using a softmax function for transformer models.

More Details

More precisely, for GPT-2, the transformer produces raw logits logits_i for each possible next token. After that, the temperature scaling is applied:

$$scaled\_logits_i = logits_i / τ$$ Next, softmax normalization converts these scaled logits to probabilities: $$P(token_i) = >exp(scaled\_logits_i) / Z_τ$$ where: $$Z_τ = Σ_j exp(logits_j / τ)$$ is the normalization constant. And then, finally, a token is sampled from this calculated probability distribution.

Crucially here, we don’t need to extract the original tokens that would have been chosen before temperature scaling was applied. If that were the case, I think it would be impossible since sampling is stochastic? Instead, at least for GPT-2 (and some other models), we can extract the token probabilities themselves from the model’s forward pass while it generates a response and then reverse the process.


Self Confidence : Putting The Theory Into Practice

Instead of a binary heuristic true/false detection of hallucinations, I decided instead to score on a confidence gradient. So then, I expect the script to say “I don’t know.”, or “I’m confident that I know this”, or even: “I’m not sure about this, I recall XYZ but you should check it.” when the model is uncertain. With all of the unknowns, now knowns, I was ready to test my theory.

So I started implementing, and the first thing that I found, was GPT-2 hallucinated 100% of the time. I needed to rethink my setup. After some searching, I decided to use Googles Flan-T5 model. It also supports probability extraction, and it is better-suited to answering questions. The large version of this model requires about 1GB of RAM and runs fine on CPU. Most importantly, it worked on my old PC.

Step 1 : Compensate for temperature

As mentioned earlier, for each generated token we take its probability (as extracted by compute_transition_scores), take the logarithm, and multiply by the temperature. This should give us a value proportional to the model’s original confidence in that token before the temperature was applied:

$$logit_i​=τ⋅log(P(token_i​))+constant$$

Then by averaging these values across all tokens in the response, we should get a robust, temperature-independent confidence score for the whole answer. The code for this is below (... means code is removed for brevity):

# Call into our generation function:
...
...
responses, token_probs = self.generate_multiple_responses(
    question, num_samples, max_new_tokens, temperature
)
...
...
# Inside generate_multiple_responses, we extract the probabilities:
with torch.no_grad():
    outputs = self.model.generate(
        input_ids=input_ids,
        attention_mask=attention_mask,
        max_new_tokens=max_new_tokens,
        temperature=temperature,
        do_sample=True,
        return_dict_in_generate=True,
        output_scores=True, # We need to set this to true
        pad_token_id=self.tokenizer.eos_token_id if self.tokenizer.eos_token_id is not None else self.tokenizer.pad_token_id
    )
generated_text = self.tokenizer.decode(outputs.sequences[0], skip_special_tokens=True)
self.log(f"  Raw decoded output: {generated_text}", level=2)
responses.append(generated_text.strip())
generated_ids = outputs.sequences
transition_scores = self.model.compute_transition_scores( # <- here :)
    outputs.sequences, outputs.scores, normalize_logits=True
)
probs = np.exp(transition_scores.numpy())
all_token_probs.append(probs.flatten().tolist())
...
...
# And back in the parent function, after they are returned by 
# generate_multiple_responses as token_probs we pass them and 
# the relative temperatures into compensate_temperature_effects
temperatures = [temperature] * num_samples
confidences = self.compensate_temperature_effects(token_probs, temperatures)
...
...
# Where compensate_temperature_effects just applies the maths 
# we discussed earlier:
    def compensate_temperature_effects(self, token_probs_list, temperatures):
        compensated_sequences = []
        for token_probs, temp in zip(token_probs_list, temperatures):
            if token_probs is None or len(token_probs) == 0:
                compensated_sequences.append(-10.0)
                continue
            sequence_confidence = []
            for prob in token_probs:
                if prob > 1e-10:
                    relative_logit = temp * np.log(prob)
                    sequence_confidence.append(relative_logit)
                else:
                    sequence_confidence.append(-100)
            if sequence_confidence:
                avg_confidence = np.mean(sequence_confidence)
            else:
                avg_confidence = -10.0
            compensated_sequences.append(avg_confidence)
        return compensated_sequences

Step 2 : Semantic Similarity

For our semantic comparison, I started out with a simple MiniLM mean cosine check. However, while it was quite fast, I was disappointed by its accuracy. So after some more reading, I chose a cross-encoders for pairwise scoring approach, and it turns out all of the hard work was already done for this, so adding it was only a couple of lines of code similar to this:

 from sentence_transformers import CrossEncoder
 cross_encoder = CrossEncoder('cross-encoder/stsb-roberta-base')
 score = cross_encoder.predict([("one sentence", "two sentence")])

Step 3 : How Confident Are You? … Ask a friend?

Before we connect all of the parts of our implementation, we need to decide how we select one of the answers after making our confidence determination about our set of answers.

But which one should we choose? We have a set of answers and they are all a bit different due to the temperature effect on the output. Naively picking the first or most frequent answer might be misleading, especially when the model produces outliers. 

Instead, I utilized a clustering approach, specifically I applied K-means clustering (I wrote about K-means before here) to the semantic embeddings of all generated responses, this grouped semantically similar answers together. Then I identified the largest cluster of answers, hopefully the one with the most agreement. After that, I used the geometry of the embeddings and calculated the centroid of that cluster, then got the closest answer to that point as our chosen answer.

I think thats a fairly elegant approach? It should select a consensus response that best reflects the model’s dominant “belief”, rather than being swayed by outliers or noise. At least I think so, if my intuition and mental model for embeddings are accurate. The code for this is:

    def select_consensus_response_kmeans(self, responses, k=2):
        """
        Cluster responses using K-means and select the response closest to the centroid
        of the largest cluster (consensus).
        """
        if len(responses) == 0:
            return "", []
        if len(responses) == 1:
            return responses[0], responses

        # Embed responses
        embeddings = self.similarity_model.encode(responses)
        # Run K-means
        k = min(k, len(responses))  # can't have more clusters than responses
        kmeans = KMeans(n_clusters=k, random_state=42, n_init='auto')
        labels = kmeans.fit_predict(embeddings)
        # Find the largest cluster
        unique, counts = np.unique(labels, return_counts=True)
        largest_cluster = unique[np.argmax(counts)]
        cluster_indices = [i for i, label in enumerate(labels) if label == largest_cluster]
        cluster_embeddings = embeddings[cluster_indices]
        cluster_responses = [responses[i] for i in cluster_indices]
        # Find the response closest to the centroid
        centroid = kmeans.cluster_centers_[largest_cluster]
        dists = np.linalg.norm(cluster_embeddings - centroid, axis=1)
        best_idx = np.argmin(dists)
        best_response = cluster_responses[best_idx]
        return best_response, cluster_responses

Some Runtime Logs

I integrated the remaining scaffolding for our hypothesis test code, and fed it some questions. Here are the responses:

============================================================
Testing: What is 2 + 2?
thinking...: 100%|█████████████████████████████████████████████| 10/10

Final Answer: I don't know the answer to this. 
(Bees Note: It really didn't know the answer to this.)
============================================================

============================================================
Testing: Describe the plot of the movie 'The Quantum Paradox' from 1987.
thinking...: 100%|█████████████████████████████████████████████| 10/10

Final Answer: I don't know the answer to this.
(Bees Note: This is correct, there is no such movie.)
============================================================

============================================================
Testing: Explain the discovery of the element 'Fictonium' by Dr. John Madeupname in the year 1337.
thinking...: 100%|█████████████████████████████████████████████| 10/10

Final Answer: I don't know the answer to this.
(Bees Note: This is correct, there is no such element.)
============================================================

============================================================
Testing: What are the main ingredients in the traditional dish 'Flibbernaught Stew' from medieval England?
thinking...: 100%|█████████████████████████████████████████████| 10/10

Final Answer: I don't know the answer to this.
(Bees Note: This is correct, there is no such stew.)
============================================================

============================================================
Testing: What is the capital of France?
thinking...: 100%|█████████████████████████████████████████████| 10/10

Final Answer: I'm very confident the answer is: Paris
============================================================

============================================================
Testing: Tell me about the life of Dr. Zelinda Farthingbottom, the famous 19th-century astronomer.
thinking...: 100%|█████████████████████████████████████████████| 10/10

Final Answer: I don't know the answer to this.
(Bees Note: This is correct, there is no such Dr.)
============================================================

============================================================
Testing: Who wrote the novel 'To Kill a Mockingbird'?
thinking...: 100%|█████████████████████████████████████████████| 10/10

Final Answer: I don't know the answer to this.
(Bees Note: This is correct, it really was unsure, these were 
its guesses:
Responses:
  1. (temp=0.97 conf=-1.076538428502998)
     Faulkner
  2. (temp=0.97 conf=-0.2628941149206578)
     Harper Lee
  3. (temp=0.97 conf=-0.2628941149206578)
     Harper Lee
  4. (temp=0.97 conf=-0.598906322145417)
     A. J. Paltrow
  5. (temp=0.97 conf=-1.07315808609994)
     Montgomery Clift
  6. (temp=0.97 conf=-0.2628941149206578)
     Harper Lee
  7. (temp=0.97 conf=-1.3127088368370317)
     Tom Hanks
  8. (temp=0.97 conf=-0.7763668985525249)
     Robert Louis Stevenson
  9. (temp=0.97 conf=-0.8343997280612352)
     f scott fitzgerald
  10. (temp=0.97 conf=-0.8507150286995726)
     J.M. Barrie
)
============================================================

============================================================
Testing: What is the chemical formula for water?
thinking...: 100%|█████████████████████████████████████████████| 10/10

Final Answer: I don't know the answer to this.
(Bees Note: Again this is correct, I was surprised it didn't know this.
Its guesses were:
Responses:
  1. (temp=0.97 conf=-0.45178970254642453)
     H2O
  2. (temp=0.97 conf=-1.0207007067520621)
     H 2 O
  3. (temp=0.97 conf=-0.9457521845302133)
     wt
  4. (temp=0.97 conf=-1.9991200647992708)
     w h w
  5. (temp=0.97 conf=-2.413021369189594)
     pH
  6. (temp=0.97 conf=-0.45968350047439777)
     w
  7. (temp=0.97 conf=-0.9560885998466766)
     OH
  8. (temp=0.97 conf=-0.8657086350604895)
     h2o
  9. (temp=0.97 conf=-0.45968350047439777)
     w
  10. (temp=0.97 conf=-0.9560885998466766)
     OH
)
============================================================

Well then, Is this worth further study?

Well, this was a very casual experiment. I didn’t follow a formal method, so its hard to tell for sure ☺️. My intuition says…. maybe? While my approach here is naive, I think AIs will need a measure of self-confidence in order to drive further learning, exploration, and curiosity in the future?

There are edge cases where my heuristics fell short, and the model I chose wasn’t able to answer many questions correctly. For example, for the question What is 2 + 2?, in one run the model was very confident that the answer was 2 + 2, technically this is a correct answer, but I think a human would infer that the asker wanted the answer of 4? So I’m unsure if its worth pursuing and developing a more formal thesis from here.

Watching some of the hallucinations were quite funny though, delights such as Flibbernaught Stew, 100% butter. mmmm delicious.. or hog fat, onions, onions, and salt and pepper, double onions for good measure cooked in hog fat no less.

Question: What are the main ingredients in the traditional dish 
          'Flibbernaught Stew' from medieval England?
Detection: UNCERTAIN
Risk Level: MEDIUM
Avg Similarity: 0.209
Avg Confidence: -2.182
Temperature: 0.97
Individual Similarities: ['0.534', '0.473', '0.154', '0.113', '0.419', '0.188', '0.344', '0.157', '0.104', '0.527', '0.030', '0.056', '0.298', '0.245', '0.278', '0.339', '0.053', '0.373', '0.103', '0.182', '0.538', '0.353', '0.165', '0.058', '0.061', '0.329', '0.313', '0.033', '0.195', '0.047', '0.014', '0.156', '0.260', '0.032', '0.636', '0.123', '0.056', '0.283', '0.073', '0.356', '0.105', '0.075', '0.047', '0.081', '0.060']
Individual Confidences: ['-1.719', '-1.955', '-2.807', '-2.469', '-1.601', '-2.985', '-1.859', '-1.030', '-2.059', '-3.337']
Responses:
  1. (temp=0.97 conf=-1.7185745795482497)
     kiln dried beans, carrots, onion, salt
  2. (temp=0.97 conf=-1.9549108758864049)
     hog fat, onions, onions, and salt and pepper
  3. (temp=0.97 conf=-2.806552017047853)
     beef fat, onions
  4. (temp=0.97 conf=-2.468553621657526)
     beef meatloaf root
  5. (temp=0.97 conf=-1.600746207479049)
     flour, water, milk, butter
  6. (temp=0.97 conf=-2.985059213336243)
     salt pork
  7. (temp=0.97 conf=-1.8594504321931333)
     apple cider, beef stock, cream, bay leaf, onions
  8. (temp=0.97 conf=-1.030147153618215)
     flour, onions, thyme, and apricots
  9. (temp=0.97 conf=-2.058654045511679)
     sage, black pepper, salt, black worms and beef thigh
  10. (temp=0.97 conf=-3.337035475555869)
     butter

Conclusion

I had fun exploring this topic, and thinking/reading/programming through it. I think it would be interesting to more formally explore the notion of what makes up certainty, and how that might drive a model or agent to explore and learn in an autodidactic way. If you made it all the way through, thanks for reading and sharing this journey with me. Until next time, my love <3.

Code

Its messy sketch code, but here is the code if you want to experiment and build on the idea for yourself:


#!/usr/bin/env python3
"""
A stupidly simple approach to implementing a self-confidence for 
LLMs using multi-response semantic similarity

Only python 3.12 works (because sentencepiece breaks for 3.13):
pyenv install 3.12.7
pyenv global 3.12.7

python -m venv self_confidence
source self_confidence/bin/activate
# deactivate to exit

pip install torch transformers sentence-transformers sentencepiece numpy

Usage:
python llm_self_confidence.py
"""

import numpy as np
import torch
import random
from transformers import AutoTokenizer, AutoModelForCausalLM, AutoModelForSeq2SeqLM, T5Tokenizer, T5ForConditionalGeneration
from sentence_transformers import SentenceTransformer, CrossEncoder
import warnings
import os
import sys
from enum import Enum
from sklearn.cluster import KMeans
import time
from tqdm import tqdm

warnings.filterwarnings("ignore")
os.environ["TOKENIZERS_PARALLELISM"] = "false"

# === GLOBAL CONSTANTS ===
SIMILARITY_THRESHOLD_CORRECT = 0.76
# Note, I had to play with this a bit to get a value that worked well. It needs to be fairly 
# high variance to generate randomness in correct answers too
DEFAULT_TEMPERATURE = 0.97 
DEFAULT_CONFIDENCE_THRESHOLD = -3.0
TRUNCATION_LENGTH = 180
GENERATION_ITERATIONS = 10
HIGH_FICTIONAL_THRESHOLD = 0.95
VERBOSITY = 1  # 0 = silent, 1 = summary, 2 = debug

class RiskLevel(Enum):
    HIGH = "HIGH"
    MEDIUM = "MEDIUM"
    LOW = "LOW"
    UNKNOWN = "UNKNOWN"

class DetectionResult(Enum):
    CONFIDENT = "CONFIDENT"
    UNCERTAIN = "UNCERTAIN"
    LOW_CONFIDENCE = "LOW_CONFIDENCE"
    ERROR = "ERROR"

class SelfConfidenceDetector:
    def __init__(self, model_name="google/flan-t5-large", cross_encoder_model="cross-encoder/stsb-roberta-base"):
        self.verbosity = VERBOSITY
        print(f"Loading model: {model_name}")
        try:
            self.tokenizer = T5Tokenizer.from_pretrained(model_name)
            self.model = T5ForConditionalGeneration.from_pretrained(model_name)
            print(f"Loading cross-encoder: {cross_encoder_model} ...")
            self.cross_encoder = CrossEncoder(cross_encoder_model)
            print("Loading embedding model for clustering: all-MiniLM-L6-v2 ...")
            self.similarity_model = SentenceTransformer('all-MiniLM-L6-v2')
            print("✓ Model(s) loaded successfully!")
        except Exception as e:
            print(f"✗ Error loading model: {e}")
            raise
    
    @classmethod
    def set_expected_similarities(cls, expected_dict):
        cls.expected_similarity_by_temp = expected_dict

    def get_expected_similarity(self, temperature):
        return self.expected_similarity_by_temp.get(temperature, 1.0)

    def log(self, msg, level=1, end="\n"):
        if self.verbosity >= level:
            print(msg, end=end)
            sys.stdout.flush()

    def get_token_probs(self, prompt, generated_ids):
        """
        Given a prompt and generated_ids (tensor of generated token ids),
        compute per-token probabilities using a forward pass.

        How this works:
        1. For each generated response, we take the prompt and the generated sequence.
        2. We run a forward pass through the model with the prompt as input and the generated sequence (shifted right) as decoder input.
        3. The model outputs logits (unnormalized scores) for each possible token at each position in the sequence.
        4. We apply the softmax function to the logits to obtain probabilities for each token in the vocabulary at each position.
        5. For each position, we extract the probability assigned to the actual generated token at that position (i.e., the probability the model assigned to the token it chose).
        6. The result is a list of probabilities, one for each token in the generated response, representing the model's confidence in its choice at each step.

        Returns a list of probabilities for each generated token.
        """
        input_ids = self.tokenizer(prompt, return_tensors="pt").input_ids
        decoder_input_ids = generated_ids[:, :-1]
        labels = generated_ids[:, 1:]
        with torch.no_grad():
            outputs = self.model(
                input_ids=input_ids,
                decoder_input_ids=decoder_input_ids,
            )
            logits = outputs.logits  # shape: [1, seq_len, vocab_size]
            probs = torch.softmax(logits, dim=-1)
            token_probs = []
            for i in range(labels.shape[1]):
                token_id = labels[0, i].item()
                prob = probs[0, i, token_id].item()
                token_probs.append(prob)
        return list(token_probs)

    def generate_multiple_responses(self, question, num_samples=GENERATION_ITERATIONS, max_new_tokens=100, temperature=DEFAULT_TEMPERATURE):
        responses = []
        all_token_probs = []
        self.log(f"Generating {num_samples} responses with temperature {temperature}...", level=2)
        prompt = self.prompt_wrapper(question)
        for i in range(num_samples):
            try:
                inputs = self.tokenizer(
                    prompt,
                    return_tensors="pt",
                    padding=True
                )
                input_ids = inputs["input_ids"]
                attention_mask = inputs["attention_mask"]
                with torch.no_grad():
                    outputs = self.model.generate(
                        input_ids=input_ids,
                        attention_mask=attention_mask,
                        max_new_tokens=max_new_tokens,
                        temperature=temperature,
                        do_sample=True,
                        return_dict_in_generate=True,
                        output_scores=True,
                        pad_token_id=self.tokenizer.eos_token_id if self.tokenizer.eos_token_id is not None else self.tokenizer.pad_token_id
                    )
                generated_text = self.tokenizer.decode(outputs.sequences[0], skip_special_tokens=True)
                self.log(f"  Raw decoded output: {generated_text}", level=2)
                responses.append(generated_text.strip())
                generated_ids = outputs.sequences
                transition_scores = self.model.compute_transition_scores(
                    outputs.sequences, outputs.scores, normalize_logits=True
                )
                probs = np.exp(transition_scores.numpy())
                all_token_probs.append(probs.flatten().tolist())
                self.log(f"  Response {i+1} (temp={temperature:.2f}): {generated_text[:150]}...", level=2)
            except Exception as e:
                self.log(f"  ✗ Error generating response {i+1}: {e}", level=2)
                responses.append("")
                all_token_probs.append([])
        return responses, all_token_probs
    
    def compensate_temperature_effects(self, token_probs_list, temperatures):
        """
        Remove temperature bias to get model's true confidence patterns
        
        Mathematical foundation:
        Forward: P(token_i) = exp(logit_i / τ) / Σ_j exp(logit_j / τ)
        Reverse: logit_i = τ * log(P(token_i)) + constant
        Note: The constant is ignored here because only relative confidence is used, not the absolute logit values.

        Args:
            token_probs_list: List of token probability sequences
            temperatures: List of temperatures used
            
        Returns:
            compensated_confidences: List of temperature-compensated confidence scores
        """
        compensated_sequences = []
        for token_probs, temp in zip(token_probs_list, temperatures):
            if token_probs is None or len(token_probs) == 0:
                compensated_sequences.append(-10.0)
                continue
            sequence_confidence = []
            for prob in token_probs:
                if prob > 1e-10:
                    relative_logit = temp * np.log(prob)
                    sequence_confidence.append(relative_logit)
                else:
                    sequence_confidence.append(-100)
            if sequence_confidence:
                avg_confidence = np.mean(sequence_confidence)
            else:
                avg_confidence = -10.0
            compensated_sequences.append(avg_confidence)
        return compensated_sequences

    def prompt_wrapper(self, question):
        system_prompt = (
            "You are a helpful and accurate assistant. "
            "Answer the following question as concisely and factually as possible. "
            "The question is:\n"
            "{question}\n"
        )
        return system_prompt.format(question=question)
    
    def measure_semantic_similarity(self, responses):
        """
        Calculate semantic similarity between all pairs of responses using a cross-encoder.
        For each unique pair of valid responses, the cross-encoder predicts a similarity score (typically in [0, 1]).
        Returns:
            similarities: List of pairwise similarity scores
            avg_similarity: Average similarity across all pairs
        """
        valid_responses = [r for r in responses if r.strip()]
        if len(valid_responses) < 2:
            return [1.0], 1.0
        try:
            pairs = []
            for i in range(len(valid_responses)):
                for j in range(i + 1, len(valid_responses)):
                    pairs.append((valid_responses[i], valid_responses[j]))
            scores = self.cross_encoder.predict(pairs)
            similarities = list(scores)
            avg_similarity = np.mean(similarities) if len(similarities) > 0 else 1.0
            return similarities, avg_similarity
        except Exception as e:
            self.log(f"Error calculating semantic similarity: {e}", level=1)
            return [0.5], 0.5

    # Store expected similarities for each temperature (to be set empirically)
    expected_similarity_by_temp = {}



    def ponder(self, question, num_samples=GENERATION_ITERATIONS, max_new_tokens=100, temperature=DEFAULT_TEMPERATURE,
                           similarity_threshold=SIMILARITY_THRESHOLD_CORRECT, confidence_threshold=DEFAULT_CONFIDENCE_THRESHOLD):
        """
        This method measures model confidence and response consistency, not factuality.
        It is best interpreted as a measure of how certain and consistent the model is in its answers, not whether those answers are true.
        Core algorithm:
        1. Generate multiple responses with the same temperature
        2. Extract token probabilities and compensate for temperature effects  
           (We extract per-token probabilities for each generated response using generate_multiple_responses(),
            then pass these to compensate_temperature_effects() to reverse the temperature scaling and obtain a temperature-independent confidence score for each response.)
        3. Measure semantic similarity between responses
        4. Classify based on similarity and confidence thresholds
        5. Also report an adjusted difference score (semantic similarity minus temperature)
        6. Also report a temperature-normalized similarity score: (Observed Similarity / Expected Similarity at T)
        Args:
            question: Input question/prompt
            num_samples: Number of responses to generate
            max_new_tokens: Maximum tokens per response
            similarity_threshold: Threshold for semantic similarity
            confidence_threshold: Threshold for confidence score
        Returns:
            result: Dictionary with detection results
        """
        self.log_section(f"Testing: {question}")
        try:
            responses, token_probs = self.generate_multiple_responses(
                question, num_samples, max_new_tokens, temperature
            )
            temperatures = [temperature] * num_samples
            confidences = self.compensate_temperature_effects(token_probs, temperatures)
            similarities, _ = self.measure_semantic_similarity(responses)
            avg_similarity = np.mean(similarities) if len(similarities) > 0 else 1.0
            avg_confidence = np.mean(confidences) if len(confidences) > 0 else -10.0
            adjusted_difference_score = avg_similarity - temperature
            expected_sim = self.get_expected_similarity(temperature)
            normalized_similarity = avg_similarity / expected_sim if expected_sim > 0 else 0.0
            if avg_similarity < similarity_threshold:
                if avg_confidence < confidence_threshold:
                    detection_result = DetectionResult.LOW_CONFIDENCE
                    risk_level = RiskLevel.HIGH
                else:
                    detection_result = DetectionResult.UNCERTAIN
                    risk_level = RiskLevel.MEDIUM
            else:
                detection_result = DetectionResult.CONFIDENT
                risk_level = RiskLevel.LOW
            result = {
                'question': question,
                'detection_result': detection_result,
                'risk_level': risk_level,
                'avg_similarity': avg_similarity,
                'avg_confidence': avg_confidence,
                'adjusted_difference_score': adjusted_difference_score,
                'normalized_similarity': normalized_similarity,
                'expected_similarity': expected_sim,
                'individual_similarities': similarities,
                'individual_confidences': confidences,
                'responses': responses,
                'temperatures': temperatures,
                'temperature': temperature,
                'success': True
            }
        except Exception as e:
            print(f"!! Error during detection: {e}")
            if "truth value of an array with more than one element is ambiguous" in str(e):
                try:
                    print("[DEBUG] similarities type:", type(similarities))
                    print("[DEBUG] similarities value:", similarities)
                except Exception as e2:
                    print("[DEBUG] similarities not available:", e2)
                try:
                    print("[DEBUG] confidences type:", type(confidences))
                    print("[DEBUG] confidences value:", confidences)
                except Exception as e2:
                    print("[DEBUG] confidences not available:", e2)
            result = {
                'question': question,
                'detection_result': DetectionResult.ERROR,
                'risk_level': RiskLevel.UNKNOWN,
                'avg_similarity': 0.0,
                'avg_confidence': 0.0,
                'adjusted_difference_score': 0.0,
                'normalized_similarity': 0.0,
                'expected_similarity': 1.0,
                'individual_similarities': [],
                'individual_confidences': [],
                'responses': [],
                'temperatures': [temperature] * num_samples,
                'temperature': temperature,
                'success': False,
                'error': str(e)
            }
        self._print_results(result)
        return result

    def log_section(self, title):
        if self.verbosity > 0:
            self.log(f"\n{'='*60}\n{title}", level=1)
            if self.verbosity == 2:
                self.log(f"{'='*60}", level=2)

    def _progress_bar(self, total, desc="thinking..."):
        if self.verbosity == 1:
            for _ in tqdm(range(total), desc=desc, ncols=70, bar_format='{l_bar}{bar}| {n_fmt}/{total_fmt}'):
                time.sleep(0.1)

    def _print_results(self, result):
        if not result['success']:
            self.log(f"\u2717 Detection failed: {result.get('error', 'Unknown error')}", level=1)
            return
        if self.verbosity == 2:
            self.log(f"\nResults (Model Consistency/Confidence, not Factuality):", level=2)
            self.log(f"  Detection: {result['detection_result'].value}", level=2)
            self.log(f"  Risk Level: {result['risk_level'].value}", level=2)
            self.log(f"  Avg Semantic Similarity: {result['avg_similarity']:.3f}", level=2)
            self.log(f"  Avg Model Confidence: {result['avg_confidence']:.3f}", level=2)
            self.log(f"  Adjusted Difference Score: {result['adjusted_difference_score']:.3f}", level=2)
            self.log(f"  Temperature-Normalized Similarity: {result['normalized_similarity']:.3f}", level=2)
            if result['individual_similarities']:
                self.log(f"  Individual Similarities: {[f'{s:.3f}' for s in result['individual_similarities']]}", level=2)
            if result['individual_confidences']:
                self.log(f"  Individual Confidences: {[f'{c:.3f}' for c in result['individual_confidences']]}", level=2)
            self.log(f"\nGenerated Responses:", level=2)
            for i, (resp, temp, conf) in enumerate(zip(result['responses'], result['temperatures'], result['individual_confidences'])):
                self.log(f"  {i+1}. (temp={temp:.2f}, conf={conf:.3f}): {resp[:80]}...", level=2)
        elif self.verbosity == 1:
            self._progress_bar(GENERATION_ITERATIONS)
        final_answer = self._final_answer(result)
        self.log(f"\nFinal Answer: {final_answer}", level=1)
        if self.verbosity == 1:
            self.log(f"{'='*60}", level=1)

    def _final_answer(self, result):
        similarity = result['avg_similarity']
        try:
            best_answer, _ = self.select_consensus_response_kmeans(result['responses'], k=2)
        except Exception as e:
            best_answer = f"(Error selecting consensus answer: {e})"
        if result['detection_result'] == DetectionResult.LOW_CONFIDENCE:
            return "I'm not confident in this answer."
        elif similarity > 0.95:
            return f"I'm very confident the answer is: {best_answer}"
        elif similarity > 0.85:
            return f"I think the answer is: {best_answer}"
        elif similarity > 0.7:
            return f"I'm fairly sure, but please check: {best_answer}"
        elif similarity > 0.5:
            return f"I'm uncertain about this, so make sure to check, but I think the answer is: {best_answer}"
        else:
            return "I don't know the answer to this."

    def batch_test(self, questions, **kwargs):
        results = []
        for question in questions:
            result = self.ponder(question, **kwargs)
            results.append(result)
        
        self.log_section("BATCH TEST SUMMARY")
        
        successful_results = [r for r in results if r['success']]
        
        for i, result in enumerate(results):
            status = "\u2713" if result['success'] else "\u2717"
            self.log(f"{status} {i+1}. {result['detection_result'].value} - {result['question'][:TRUNCATION_LENGTH]}...", level=1)
        
        if successful_results:
            self.log(f"\nSuccess rate: {len(successful_results)}/{len(results)} ({len(successful_results)/len(results)*100:.1f}%)", level=1)
        
        return results
    
    def cleanup(self):
        if hasattr(self, 'model'):
            del self.model
        if hasattr(self, 'tokenizer'):
            del self.tokenizer
        if hasattr(self, 'cross_encoder'):
            del self.cross_encoder
        
        if torch.cuda.is_available():
            torch.cuda.empty_cache()
        
        self.log("✓ Resources cleaned up", level=1)

    def pretty_print_result(self, result):
        self.log(f"\n{'-'*60}", level=1)
        self.log(f"Question: {result['question']}", level=1)
        self.log(f"Detection: {result['detection_result'].value if hasattr(result['detection_result'], 'value') else result['detection_result']}", level=1)
        self.log(f"Risk Level: {result['risk_level'].value if hasattr(result['risk_level'], 'value') else result['risk_level']}", level=1)
        self.log(f"Avg Similarity: {result['avg_similarity']:.3f}", level=1)
        self.log(f"Avg Confidence: {result['avg_confidence']:.3f}", level=1)
        if result.get('temperature'):
            self.log(f"Temperature: {result['temperature']}", level=1)
        if result.get('individual_similarities'):
            self.log(f"Individual Similarities: {[f'{s:.3f}' for s in result['individual_similarities']]}", level=1)
        if result.get('individual_confidences'):
            self.log(f"Individual Confidences: {[f'{c:.3f}' for c in result['individual_confidences']]}", level=1)
        self.log(f"Responses:", level=1)
        for i, resp in enumerate(result.get('responses', [])):
            temp = result['temperatures'][i] if 'temperatures' in result and i < len(result['temperatures']) else None
            conf = result['individual_confidences'][i] if 'individual_confidences' in result and i < len(result['individual_confidences']) else None
            self.log(f"  {i+1}. (temp={temp if temp is not None else '?'} conf={conf if conf is not None else '?'})", level=1)
            self.log(f"     {resp}", level=1)
        self.log(f"{'-'*60}\n", level=1)

    def select_consensus_response_kmeans(self, responses, k=2):
        """
        Cluster responses using K-means and select the response closest to the centroid
        of the largest cluster (consensus).
        """
        if len(responses) == 0:
            return "", []
        if len(responses) == 1:
            return responses[0], responses

        # Embed responses
        embeddings = self.similarity_model.encode(responses)
        # Run K-means
        k = min(k, len(responses))  # can't have more clusters than responses
        kmeans = KMeans(n_clusters=k, random_state=42, n_init='auto')
        labels = kmeans.fit_predict(embeddings)
        # Find the largest cluster
        unique, counts = np.unique(labels, return_counts=True)
        largest_cluster = unique[np.argmax(counts)]
        cluster_indices = [i for i, label in enumerate(labels) if label == largest_cluster]
        cluster_embeddings = embeddings[cluster_indices]
        cluster_responses = [responses[i] for i in cluster_indices]
        # Find the response closest to the centroid
        centroid = kmeans.cluster_centers_[largest_cluster]
        dists = np.linalg.norm(cluster_embeddings - centroid, axis=1)
        best_idx = np.argmin(dists)
        best_response = cluster_responses[best_idx]
        return best_response, cluster_responses

def get_token_probs(model, tokenizer, prompt, generated_ids):
    # prompt: str, generated_ids: tensor of token ids (the generated answer)
    input_ids = tokenizer(prompt, return_tensors="pt").input_ids
    # prepare decoder input ids (shifted right)
    decoder_input_ids = generated_ids[:, :-1]
    labels = generated_ids[:, 1:]

    with torch.no_grad():
        outputs = model(
            input_ids=input_ids,
            decoder_input_ids=decoder_input_ids,
            output_logits=True,  # or output_hidden_states=True
        )
        logits = outputs.logits  # shape: [1, seq_len, vocab_size]
        probs = torch.softmax(logits, dim=-1)
        # for each position, get the probability of the actual next token
        token_probs = []
        for i in range(labels.shape[1]):
            token_id = labels[0, i].item()
            prob = probs[0, i, token_id].item()
            token_probs.append(prob)
    return list(token_probs)  # Ensure it's always a Python list

def main():
    
    confidence_detector = None
    try:
        print("Initializing Self-Confidence...")
        confidence_detector = SelfConfidenceDetector(
            model_name="google/flan-t5-large",
        )
        confidence_detector.log_section("LLM SELF-CONFIDENCE")
        
        #########################################################
        # these are just the test questions - a little mix of fact and fiction :)

        factual_questions = [
            "What is the capital of France?",
            "What is 2 + 2?",
            "Who wrote the novel 'To Kill a Mockingbird'?",
            "What is the chemical formula for water?",
        ]

        fictional_questions = [
            "Tell me about the life of Dr. Zelinda Farthingbottom, the famous 19th-century astronomer.",
            "What are the main ingredients in the traditional dish 'Flibbernaught Stew' from medieval England?",
            "Describe the plot of the movie 'The Quantum Paradox' from 1987.",
            "Explain the discovery of the element 'Fictonium' by Dr. John Madeupname in the year 1337."
        ]

        test_questions = factual_questions + fictional_questions
        random.shuffle(test_questions)

        results = confidence_detector.batch_test(
            test_questions,
            num_samples=GENERATION_ITERATIONS, # increasing this seems to increase the accuracy of the hallucination detection
            max_new_tokens=100,
            similarity_threshold=SIMILARITY_THRESHOLD_CORRECT,
            confidence_threshold=DEFAULT_CONFIDENCE_THRESHOLD,
            temperature=DEFAULT_TEMPERATURE
        )
        
        #########################################################
        # analyze results
        successful_results = [r for r in results if r['success']]
        
        # iterate and print out all the results if debug = true
        debug = True
        if debug:
            for result in results:
                confidence_detector.pretty_print_result(result)

        if successful_results:
            confidence_detector.log_section("FINAL ANALYSIS")
            confident_count = sum(1 for r in successful_results 
                                if r['detection_result'] == DetectionResult.CONFIDENT)
            uncertain_count = sum(1 for r in successful_results 
                                if r['detection_result'] == DetectionResult.UNCERTAIN)
            low_confidence_count = sum(1 for r in successful_results 
                                if r['detection_result'] == DetectionResult.LOW_CONFIDENCE)
            error_count = sum(1 for r in successful_results 
                                if r['detection_result'] == DetectionResult.ERROR)
            total = len(successful_results)
            
            confidence_detector.log(f"Total questions tested: {total}", level=1)
            confidence_detector.log(f"Confident responses: {confident_count}", level=1)
            confidence_detector.log(f"Uncertain responses: {uncertain_count}", level=1)
            confidence_detector.log(f"Low confidence responses: {low_confidence_count}", level=1)
            confidence_detector.log(f"Errors: {error_count}", level=1)
            confidence_detector.log(f"\nConfident answers:", level=1)
            for result in successful_results:
                if result['detection_result'] == DetectionResult.CONFIDENT:
                    confidence_detector.log(f"  \u2705 {result['question'][:TRUNCATION_LENGTH]}... (sim: {result['avg_similarity']:.3f})", level=1)
                    break
            
            confidence_detector.log(f"\nUncertain answers:", level=1)
            for result in successful_results:
                if result['detection_result'] == DetectionResult.UNCERTAIN:
                    confidence_detector.log(f"  ? {result['question'][:TRUNCATION_LENGTH]}... (sim: {result['avg_similarity']:.3f})", level=1)
                    break
            
            confidence_detector.log(f"\nLow confidence answers:", level=1)
            for result in successful_results:
                if result['detection_result'] == DetectionResult.LOW_CONFIDENCE:
                    confidence_detector.log(f"  ! {result['question'][:TRUNCATION_LENGTH]}... (sim: {result['avg_similarity']:.3f})", level=1)
                    break
        else:
            confidence_detector.log("\n✗ No successful results to analyze, something is horribly wrong!", level=1)
        
        # assert here that every fictional question is detected as an unknown, but
        # be careful to make sure that hallucinations of factual questions are counted as well!
        for result in results:
            if result['detection_result'] == DetectionResult.UNCERTAIN:
                assert(result['question'] in fictional_questions)
            elif result['detection_result'] == DetectionResult.CONFIDENT:
                assert(result['question'] in factual_questions)

        results = []

        return confidence_detector, results
        
    except KeyboardInterrupt:
        print("\n\nDemo interrupted by user.")
        return confidence_detector, []
    except Exception as e:
        print(f"\n✗ Error in main: {e}")
        import traceback
        traceback.print_exc()
        return confidence_detector, []
    
    finally:
        if confidence_detector:
            print("\nCleaning up...")
            confidence_detector.cleanup()

if __name__ == "__main__":
    print("Starting hallucination detection test...")
    try:
        confidence_detector, results = main()
        
        # To test your own questions:
        # confidence_detector = SelfConfidenceDetector()
        # result = confidence_detector.ponder('Your question here')

        if confidence_detector and results:
            print("\n✅ Demo completed successfully!")
        else:
            print("\n❌ Demo encountered issues. Check error messages above.")
            
    except Exception as e:
        print(f"\n✗ Unexpected error: {e}")
        import traceback
        traceback.print_exc()
    
    print("\nDone!")