Enable javascript in your browser for better experience. Need to know to enable it? Go here.

LLM as a judge

更新于 : Nov 05, 2025
Nov 2025
评估 ?

Using an LLM as a judge — to evaluate the output of another system, usually an LLM-based generator — has garnered attention for its potential to deliver scalable, automated evaluation in generative AI. However, we’re moving this blip from Trial to Assess to reflect newly recognized complexities and risks.

While this technique offers speed and scale, it often fails as a reliable proxy for human judgment. Evaluations are prone to position bias, verbosity bias and low robustness. A more serious issue is scaling contamination: When LLM as a judge is used in training pipelines for reward modeling, it can introduce self-enhancement bias — where a model family favors its own outputs — and preference leakage, blurring the boundary between training and testing. These flaws have led to overfitted results that inflate performance metrics without real-world validity. There have been research studies that conduct more rigorous investigations into this pattern. To counter these flaws, we are exploring improved techniques, such as using LLMs as a jury (employing multiple models for consensus) or chain-of-thought reasoning during evaluation. While these methods aim to increase reliability, they also increase cost and complexity. We advise teams to treat this technique with caution — ensuring human verification, transparency and ethical oversight before incorporating LLM judges into critical workflows. The approach remains powerful but less mature than once believed.

Oct 2024
试验 ?

许多我们构建的系统具有两个关键特征:一是能够根据大量数据集中的问题提供答案, 二是几乎不可能追踪到该答案的得出过程。尽管这些系统具有不透明性,我们仍然希望评估并提高其响应质量。通过 大语言模型(LLM)作为评判者 的模式,我们可以使用一个 LLM 来评估另一个系统的响应,这个系统可能本身也是基于 LLM 的。我们看到这种模式用于评估产品目录中搜索结果的相关性,以及判断基于 LLM 的聊天机器人是否在合理地引导用户。当然,评估系统必须经过仔细设置和校准。这种方法能够显著提高效率,从而降低成本。这是一个正在进行的研究领域,其现状可以在这篇文章 中找到总结。

发布于 : Oct 23, 2024

Download the PDF

 

 

 

English | Español | Português | 中文

Sign up for the Technology Radar newsletter

 

 

Subscribe now

查看存档并阅读往期内容