Technology Radar
Docling is an open-source Python and TypeScript library for converting unstructured documents into clean, machine-readable outputs. Using a computer vision–based approach to layout and semantic understanding, it processes complex inputs — including PDFs and scanned documents — into structured formats such as JSON and Markdown. That makes it a strong fit for retrieval-augmented generation (RAG) pipelines and for producing structured outputs from LLMs, in contrast to vision-first retrieval approaches such as ColPali.
Docling provides an open-source, self-hostable alternative to proprietary cloud-managed services such as Azure Document Intelligence, Amazon Textract and Google Document AI, while integrating well with frameworks such as LangGraph. In our experience, it performs well in production-scale extraction workloads across digital and scanned PDFs, including very large files containing text, tables and images. It delivers a strong quality-to-cost balance for downstream agentic RAG workflows. Based on these results, we’re moving Docling to Trial.
Docling 是一个开源的 Python 和 TypeScript 库,用于对非结构化数据进行高级文档处理。它解决了常被忽视的“最后一公里”问题,即将真实世界的文档——如 PDF 和 PowerPoint——转换为干净、可机器读取的格式。与传统提取器不同,Docling 使用基于计算机视觉的方法来解析文档的布局和语义结构,使其输出对于 增强检索生成(RAG) 流水线特别有价值。它可将复杂文档转换为结构化格式,如 JSON 或 Markdown,并支持 LLM 的结构化输出 等技术。这与 ColPali 不同,后者直接将页面图像输入视觉-语言模型以进行检索。 Docling 的开源特性和基于 Python 的核心(建立在自定义的 Pydantic 数据模型上)为团队提供了灵活的自托管替代方案,相比于 Azure 文档智能、Amazon Textract 和 Google Document AI 等专有云工具更具自主性。该项目由 IBM Research 支持,开发快速,并提供可即插即用的架构,可与 LangGraph 等其他框架集成,非常值得构建生产级 AI 数据管道的团队进行评估。