A new document format wants to fix how enterprises feed files to AI

DocLang, backed by IBM and NVIDIA under the Linux Foundation, proposes an XML format tuned for LLM tokenizers, raising the question of whether the document format is really the bottleneck.

Alessandro Benigni

PUBLISHED JUN 17, 2026

3 MIN READ

Follow on Google

-513 MIN AGO

A new document format wants to fix how enterprises feed files to AI — featured image for AI Insiders

A coalition of enterprise software vendors has decided the problem with AI document processing is the documents themselves. The LF AI & Data Foundation announced DocLang, an open standard for encoding enterprise files in a format designed from the ground up for LLM consumption, with backing from IBM, NVIDIA, Red Hat, ABBYY, HumanSignal, and Forgis.

The premise is straightforward: PDFs, Markdown, HTML, and LaTeX were built to render text for human readers, not to feed structured meaning to a tokenizer. When a language model ingests a PDF through an OCR pipeline, it costs roughly 1,200 input tokens per page as a baseline, and that number climbs steeply with tables, charts, and multi-column layouts. DocLang proposes a limited XML vocabulary that maps document elements to LLM tokens on a near 1-to-1 basis, keeping the conversion lossless while compressing token overhead. ABBYY’s own benchmark on IBM’s 2025 annual report showed the DocLang version requiring 5,310 input tokens versus 8,421 for the PDF, with lower latency and fewer parsing errors.

Jon Knisley, AI Value and Enablement Lead at ABBYY, told The Register that ambiguous document structure forces models into guesswork, which burns tokens on layout inference rather than meaning extraction. ABBYY claims 4x to 30x lower processing costs in internal benchmarks, depending on model and document complexity. Governance is another stated advantage: provenance metadata that gets stripped when documents move between systems would remain attached in the DocLang representation.

The Register, which first covered the announcement, noted the project’s origins in IBM’s 2024 open-source Docling toolkit, a PDF-to-structured-data converter that DocLang extends into a proper interchange standard.

Here is where the framing deserves scrutiny. The DocLang consortium is dominated by companies with a direct commercial interest in document AI pipelines. ABBYY sells AI document processing. IBM built the underlying toolkit. The benchmarks cited are ABBYY’s own. Independent validation is absent from the announcement, and Knisley acknowledged the standard is early.

The deeper structural question is whether format is actually the primary bottleneck. Modern multimodal frontier models can read PDFs natively with reasonable accuracy. The retrieval problems enterprises encounter in document AI pipelines often stem from chunking strategies, embedding quality, retrieval architecture, and context window management rather than tokenizer-level inefficiency in the source format. Reformatting every enterprise document archive to a new standard is a substantial undertaking. If a retrieval-augmented system is failing because its vector index is poorly constructed, DocLang solves a different problem than the one on fire.

That said, the format-as-bottleneck argument is not baseless for specific use cases. High-volume document ingestion pipelines where token cost compounds at scale, legacy OCR workflows processing scanned PDFs with dense tables, and regulated industries where metadata provenance matters represent genuine environments where a cleaner source format would provide measurable benefit. The 30x savings figure almost certainly reflects a case where the input PDF is structurally terrible, but a more typical document likely sits in the 1.5x to 2x range.

The open-standard positioning under the Linux Foundation is the strongest signal here. Vendor-neutral governance reduces the risk that DocLang becomes a proprietary lock-in mechanism. Whether enterprise IT departments will invest in tooling to convert existing document archives to a new format depends entirely on how much friction the conversion pipeline adds and how many AI platforms adopt native DocLang reading, none of which has been demonstrated at production scale.

If you are running high-volume document ingestion at enterprise scale and token costs are a meaningful line item, the ABBYY benchmark tool is worth running against your own document corpus before committing to any migration investment.

Reported by The Register on June 15, 2026, based on reporting by Thomas Claburn.

A new document format wants to fix how enterprises feed files to AI

The morning brief for people inside the AI industry.

More in Tools

A 100x cheaper eval judge that matches Claude Opus on chatbot traces

DFlash delivers 4.3x throughput gains on Qwen 3.5 serving

Facebook Turns Its Search Bar Into a Conversational AI Engine