On May 11th, the field of intelligent document processing witnessed a major breakthrough with the official launch of the first unified benchmark test for vision-language models called the "IDP Leaderboard". This benchmark comprehensively evaluates the performance of mainstream models in six core tasks: OCR (Optical Character Recognition), key information extraction, visual question answering, table extraction, classification, and long document processing through 16 datasets and 9,229 documents. It provides quantifiable references for industry development.

1.jpg

The test results show that Gemini2.5Flash outperformed all competitors in overall capabilities but unexpectedly underperformed in OCR and classification tasks. Its performance even fell behind the previous generation, Gemini2.0Flash, by 1.84% and 0.05%, respectively. Industry analysts believe this might be due to Google's excessive focus on multimodal reasoning during model iterations while neglecting the optimization of basic text recognition functions.

Meanwhile, OpenAI’s GPT-4o-mini performed impressively in chart and drawing comprehension, especially excelling in visual question-answering tasks such as ChartQA. However, the high cost per request Token remains a significant limiting factor in practical applications. The heated discussions within the developer community revolve around how to balance performance and cost effectively.

2.jpg

It is worth noting that long document processing and table extraction remain the "Achilles' heel" of current vision-language models. Even the best-performing models achieved only 69.08% in the LongDocBench task and 66.64% at most in table extraction (based on the GriTS metric). These results highlight the limitations of AI in handling complex layouts and long contexts.

The IDP Leaderboard employs highly challenging diversified datasets, covering handwritten text, printed text, accented text, structured and unstructured tables, as well as complex documents up to 21 pages. The evaluation metrics are flexibly selected based on task characteristics, such as edit distance accuracy for OCR, KIE, VQA, and long document processing, exact match accuracy for classification, and GriTS metrics for table extraction, ensuring comprehensive and fair evaluations.

This benchmark test plans to regularly update datasets and introduce more models (such as the Claude series) to maintain its dynamicity and authority. Developers can access relevant datasets and evaluation codes via GitHub (https://github.com/nanonets/idp-leaderboard) and participate in community discussions.

The release of the intelligent document processing benchmark marks the entry of multimodal AI into a new stage of quantifiable evaluation in the document processing field. Although Gemini2.5Flash demonstrated strong capabilities, the tests also revealed challenges faced by current technologies. With continuous expansion of datasets and deepened model optimizations, intelligent document processing technology is expected to unleash greater value in areas such as enterprise automation, archival digitization, and intelligent search, providing stronger technical support for digital transformation.