Enable javascript in your browser for better experience. Need to know to enable it? Go here.
Last updated : Nov 05, 2025
Nov 2025
Assess ?

Using an LLM as a judge — to evaluate the output of another system, usually an LLM-based generator — has garnered attention for its potential to deliver scalable, automated evaluation in generative AI. However, we’re moving this blip from Trial to Assess to reflect newly recognized complexities and risks.

While this technique offers speed and scale, it often fails as a reliable proxy for human judgment. Evaluations are prone to position bias, verbosity bias and low robustness. A more serious issue is scaling contamination: When LLM as a judge is used in training pipelines for reward modeling, it can introduce self-enhancement bias — where a model family favors its own outputs — and preference leakage, blurring the boundary between training and testing. These flaws have led to overfitted results that inflate performance metrics without real-world validity. There have been research studies that conduct more rigorous investigations into this pattern. To counter these flaws, we are exploring improved techniques, such as using LLMs as a jury (employing multiple models for consensus) or chain-of-thought reasoning during evaluation. While these methods aim to increase reliability, they also increase cost and complexity. We advise teams to treat this technique with caution — ensuring human verification, transparency and ethical oversight before incorporating LLM judges into critical workflows. The approach remains powerful but less mature than once believed.

Oct 2024
Trial ?

Many systems we build have two key characteristics: being able to provide an answer based on questions about a large data set, and being next to impossible to follow how it arrived at that answer. Despite this opacity we still want to assess and improve the quality of the responses. With the LLM as a judge pattern we use an LLM to evaluate the responses of another system, which in turn might be based on an LLM. We've seen this pattern used to evaluate the relevance of search results in a product catalog and to assess whether an LLM-based chatbot was guiding its users in a sensible direction. Naturally, the evaluator system must be set up and calibrated carefully. It can drive significant efficiency gains, which, in turn, translates to lower costs. This is an ongoing area of research, with the current state summarized in this article.

Veröffentlicht : Oct 23, 2024

Download the PDF

 

 

 

English | Español | Português | 中文

Sign up for the Technology Radar newsletter

 

Subscribe now

Visit our archive to read previous volumes