PDF documents contain a lot of company data. Of course, Gen. AI tools were able to simply accept and analyze PDF files, but the accuracy, time and cost were lower than ideal. New technology from Databricks can change that.
This week, the company detailed its “ai_parse_document” technology, now integrated with Databricks’ Agent Bricks platform. This technology addresses a critical bottleneck in enterprise AI adoption: roughly 80% of enterprise knowledge stays locked in PDF files, reports and diagrams that AI systems have difficulty accurately processing and understanding.
“There is a common assumption that parsing PDF files is a solved problem, but in fact it is not,” Erich Elsen, chief scientist at Databricks, told VentureBeat. “The challenge is not only the lack of structure in the documents; it is also that enterprise PDFs are inherently complex. They combine digital content, scanned pages and photos of physical documents, as well as tables, charts and irregular layouts, and most existing tools cannot accurately capture this information.”
The hidden complexity of document evaluation
While optical character recognition (OCR) has been around for a long time, Elsen says extracting useful, structured data from real-world corporate documents stays fundamentally unsolved.
Key elements akin to tables with merged cells, figure captions, and spatial relationships between document elements are routinely missed or misread by existing tools, making downstream AI applications, augmented search generation (RAG) systems, or business intelligence dashboards unreliable.
A typical enterprise solution is to mix many imperfect tools: one service for layout detection, one other for OCR, a third for table extraction, and additional APIs for figure evaluation. This approach requires months of custom data engineering and ongoing maintenance as document formats evolve.
“To compensate, teams had to combine multiple imperfect tools or build extensive custom pipelines, spending months on data engineering instead of innovation,” Elsen said. “ai_parse_document solves this problem by extracting complete, structured data from real documents – so organizations can finally trust and query unstructured data directly in Databricks.”
Technical approach: End-to-end training and pipeline laying
There are many PDF parsing services available on the market today, including AWS Texttract, Google Document AI, and Azure Document Intelligence, among others. Elsen argued that as a substitute of simply reading text, the tool uses a system of recent AI components trained end-to-end to extract structured context with state-of-the-art quality.
The function goes beyond basic extraction and captures:
-
Tables stay exactly as they seem, including merged cells and nested structures
-
Drawings and diagrams with captions and descriptions generated by artificial intelligence
-
Spatial metadata and bounding boxes enable precise location of elements
-
Optional output images for multimodal search applications
All results are stored directly in the Databricks Unity catalog as Delta tables, which suggests that analyzed documents change into structured data you can query, without having to go away the Databricks environment. This is a key difference from cloud services that require data to be exported for processing.
“Through data-centric training and optimized reasoning, we achieved 3-5x lower costs, matching or exceeding leading systems such as Textract, Document AI and Azure Document Intelligence,” Elsen said.
Early adoption by businesses in manufacturing and industrial sectors
Several large enterprises have already deployed ai_parse_document in production, and use cases include optimizing data analytics workflows, democratizing document processing, and developing RAG applications.
For example, Elsen noted that Rockwell Automation uses ai_parse_document to cut back the configuration burden on its data scientists.
“What once required significant configuration to support complex solutions is now streamlined, so their teams can spend more time innovating and less time managing infrastructure,” he said.
Meanwhile, TE Connectivity uses ai_parse_document to democratize unstructured data processing.
“Previously, extracting tables, text and metadata from documents required complex, code-intensive workflows,” Elsen said. “Databricks condenses all of this into a single SQL function, making advanced document processing accessible to every data team, not just data scientists.”
Another pioneer on the market is Emerson Electric. The company advantages ai_parse_document for the RAG use case. Elsen explained that by enabling parallel document parsing directly in Delta tables, Emerson made it possible to quickly and easily develop RAG applications, all inside the existing Databricks environment.
Platform integration game
While Databricks has a long history with open source software, the ai_parse_document technology is a proprietary component of the Databricks platform.
Unlike standalone document parsing APIs, ai_parse_document is deeply integrated with Databricks’ Agent Bricks platform, which is a collection of AI capabilities and orchestration capabilities for building production AI agents.
The feature works with the broader Databricks data infrastructure, including:
-
Declarative Spark Pipelines: Provide automatic incremental processing, which suggests new documents arriving in SharePoint, S3 or Azure Data Lake Storage are analyzed routinely without manual orchestration.
-
Unity catalog: Manages permissions, audit trails and data provenance for analyzed content in exactly the same way as for structured data.
-
Vector Search: Indexes parsed document elements, including text, tables and figures, with captions for multimodal RAG applications.
-
Connecting AI Features: Allows developers to send ai_parse_document output on to ai_extract (entity extraction), ai_classify (document categorization), and ai_summarize (content summary) in a single SQL query.
-
Multi-agent supervisor: Coordinates document processing agents with other specialized agents for complex workflows.
“Analysis is just the beginning and rarely ends on its own,” Elsen said. “The goal is to enable customers to combine our ai_functions, such as ai_extract and ai_classify, together with ai_parse_document to transform their documents into actionable data and insights. We also aim to ensure that the document corpus is seamlessly transformed into a knowledge database for use in RAG or other information retrieval agents.”
What this implies for enterprise AI strategy
For enterprises building AI agent systems, it is critical to grasp how PDF documents are actually used and understood by the systems.
Databricks’ approach sheds new light on a problem that many may have considered solved. It challenges existing expectations with a new architecture that may profit many kinds of workflows. However, this is a platform-specific feature that requires careful evaluation for organizations that are not yet using Databricks.
For technical decision makers evaluating AI agent platforms, the key takeaway is that document evaluation is moving from a specialized external service to the capabilities of an integrated platform.
