Enable javascript in your browser for better experience. Need to know to enable it? Go here.
更新于 : Nov 05, 2025
Nov 2025
试验 ?

DeepEval is an open-source, Python-based evaluation framework for assessing LLM performance. It can be used to evaluate retrieval-augmented generation (RAG) and other applications built with frameworks such as LlamaIndex or LangChain, as well as to baseline and benchmark models. DeepEval goes beyond word-matching scores, assessing accuracy, relevance and consistency to provide more reliable evaluation in real-world scenarios. It includes metrics such as hallucination detection, answer relevancy and hyperparameter optimization and supports GEval for creating custom, use case–specific metrics. Our teams are using DeepEval to fine-tune agentic outputs using the LLM as a judge technique. It integrates with pytest and CI/CD pipelines, making it easy to adopt and valuable for continuous evaluation. For teams building LLM-based applications in regulated environments, Inspect AI, developed by the UK AI Safety Institute, offers an alternative with stronger focus on auditing and compliance.

Oct 2024
评估 ?

DeepEval 是一个基于 Python 的开源评估框架,用于评估大语言模型(LLM)的性能。你可以使用它评估使用流行框架(如LlamaIndexLangChain构建的检索增强生成(RAG)和其他类型的应用程序,也可以用于基准测试和对比不同模型,以满足你的需求。DeepEval 提供了一个全面的指标和功能套件,用于评估 LLM 的表现,包括幻觉检测、答案相关性和超参数优化。它支持与 pytest 的集成,结合其断言功能,你可以轻松地将测试套件集成到持续集成(CI)管道中。如果你正在使用 LLM,建议尝试 DeepEval 来改进测试流程,确保你的应用程序的可靠性。

发布于 : Oct 23, 2024

Download the PDF

 

 

 

English | Español | Português | 中文

Sign up for the Technology Radar newsletter

 

 

Subscribe now

查看存档并阅读往期内容