Recently, Baidu officially released and open-sourced its self-developed multimodal document parsing model PaddleOCR-VL. This model ranked first in the world for comprehensive performance on the authoritative document parsing evaluation list OmniBenchDoc V1.5 with an impressive score of 92.6, demonstrating excellent performance in four core capabilities: text, tables, formulas, and reading order.

PaddleOCR-VL has a core model parameter count of only 0.9B, making it lightweight and efficient. It can accurately identify complex elements such as text, handwritten Chinese characters, tables, formulas, and charts with minimal computational cost. The model supports 109 languages, including Chinese, English, French, Japanese, Russian, Arabic, and Spanish, and is suitable for various intelligent document processing tasks such as government and enterprise document management, knowledge retrieval, archive digitization, and research information extraction.

image.png

As a derivative model of Wenxin 4.5, PaddleOCR-VL-0.9B successfully achieved breakthroughs in both accuracy and efficiency by integrating the NaViT dynamic resolution visual encoder with the ERNIE-4.5-0.3B language model. Specifically, the model performed exceptionally well on OmniDocBench v1.5, with a text edit distance of 0.035, a CDM of 91.43 for formula recognition, a TEDS of 93.52 for tables, and a reading order prediction error value of 0.043. These data demonstrate its stability and reliability in high-difficulty scenarios such as complex document, handwritten manuscript, and historical archive recognition.

image.png

In terms of inference speed, PaddleOCR-VL can process 1881 Tokens per second on a single A100 GPU, showing significant improvements compared to other mainstream models. It is 14.2% faster than MinerU2.5 and 253.01% faster than dots.ocr. This performance has set a new benchmark in OCR technology.

image.png

Different from traditional OCR technology, PaddleOCR-VL can understand complex layout structures like humans, accurately extract diverse information such as financial tables, mathematical formulas, and class notes, and automatically restore the order that conforms to human reading habits, ensuring the accuracy of information delivery and the clarity of logic. Its innovative two-stage architecture first detects the layout and predicts the reading order, and then identifies and structurally outputs elements such as text, tables, and formulas, which significantly improves the stability and efficiency of recognition.