Welcome to part two of what has evolved into a three part research series focused on putting the “engineering” back into prompt engineering by bringing meaningful observability metrics to the table.



In our last post, we explored an approach to estimate the importance of individual tokens in LLM prompts. An interesting revelation was the role the perceived ambiguity of the prompt played in the alignment between our estimation and the “ground truth” integrated gradients approach. We spent some time trying to quantify this and got some pretty interesting results that aligned well with human intuition.

In this post, we present two measures of model uncertainty in producing responses for prompts: structural uncertainty, and conceptual uncertainty. Structural uncertainty is quantified using normalized entropy to measure the variability in the probabilities of different tokens being chosen at each position in the generated text. In essence, it captures how unsure the model is at each decision point as it generates tokens in a response. Conceptual uncertainty is captured by aggregating the cosine distances between embeddings of partial responses and the actual response, giving insight into the model’s internal cohesion in generating semantically consistent text. Just like last time - this is a jumping off point. The aim of this research is to make our interactions with foundation models more transparent and predictable, and there’s still plenty more work to be done.

tl;dr:

Introduces two measures to quantify uncertainties in language model responses (Structural and Conceptual uncertainties)

These measures help assess the predictability of a prompt, and can help identify when to fine-tune vs. continue prompt engineering

This work also sets the stage for objective model comparisons on specific tasks - making it easier to choose the most suitable language model for a given use case

Why care about uncertainty?

In a nutshell: predictability.

If you’re building a system that uses a prompt template to wrap some additional data (e.g., RAG) - how confident are you that the model will always respond in the way you want? Do you know what shape of data input would cause an increase in weird responses?

By better understanding uncertainty in a model-agnostic way, we can build more resilient applications on top of LLMs. As a fun side effect, we also think this approach can give practitioners a way to benchmark when it may be time to fine-tune vs. continue to prompt engineer.

Lastly, if we’re able to calculate interpretable metrics that reflect prompt and response alignment - we’re several steps closer to being able to compare models in an apples-to-apples way for specific tasks.

Intuition

When we talk about "model uncertainty", we're really diving into how sure or unsure a model is about its response to a prompt. The more ways a model thinks it could answer, the more uncertain it is.

Imagine asking someone their favorite fruit. If they instantly say "apples", they're pretty certain. But if they hem and haw, thinking of oranges, bananas, and cherries, before finally arriving at “apples”, their answer becomes more uncertain. Our original goal was to calculate a single metric that would quantify this uncertainty - which felt fairly trivial when we had access to the logprobs of other sampled tokens at a position. Perplexity is frequently used for this purpose, but it’s a theoretically unbounded measure and is often hard to reason about across prompts/responses. Instead, we turned to entropy — which can be normalized such that the result is between 0-1 and tells a very similar story. Simply put: we wanted to use normalized entropy to measure how spread out the model’s responses are. If the model leans heavily towards one answer, the entropy is close to zero, but if it’s torn between multiple options it spikes closer to one.

However, we ran into some interesting cases where entropy was high simply because the model was choosing between several very similar tokens. It would’ve had practically no impact on the overall response if the model chose one token or another, and the straight entropy calculation didn’t capture this nuance. We realized then that we needed a second measure to not only assess how uncertain the model was about which token to pick, but how “spread out” the potential responses could’ve been had those other tokens been picked.

As we learned from our research into estimating token importances, simply comparing token-level embeddings isn’t enough to extract meaningful information about the change in trajectory of a response, so instead we create embeddings over each partial response and compare those to the embedding of the final response to get a sense of how those meanings diverge.

To summarize:

Structural uncertainty: We use normalized entropy to calculate how uncertain the model was in each token selection. If the model leans heavily towards one answer, the entropy is low. But if it's torn between multiple options, entropy spikes. The normalization step ensures we're comparing things consistently across different prompts. Conceptual uncertainty: For each sampled token in the response, we create a 'partial' version of the potential response up to that token. Each of these partial responses is transformed into an embedding. We then measure the distance between this partial response and the model's final, complete response. This tells us how the model's thinking evolves as it builds up its answer.

Interpreting these metrics becomes pretty straightforward:

If structural uncertainty is low but conceptual uncertainty is high , the model is clear about the tokens it selects but varies significantly in the overall messages it generates. This could imply that the model understands the syntax well but struggles with maintaining a consistent message.

, the model is clear about the tokens it selects but varies significantly in the overall messages it generates. This could imply that the model understands the syntax well but struggles with maintaining a consistent message. Conversely , high structural uncertainty and low conceptual uncertainty could indicate that the model is unsure at the token-level but consistent in the overall message. Here, the model knows what it wants to say but struggles with how to say it precisely.

could indicate that the model is unsure at the token-level but consistent in the overall message. Here, the model knows what it wants to say but struggles with how to say it precisely. If both are high or both are low, it may suggest that the token-level uncertainty and overall message uncertainty are strongly correlated for the specific task, either both being well-defined or both lacking clarity.



Interesting results

We built a little demo and ran several prompts to see if the metrics aligned with our intuition. The results were extremely interesting. We started out simple:

“Who was the first president of the USA?”