DeepSeek announced the release of its next-generation document recognition model, DeepSeek-OCR2. This model achieves significant breakthroughs in visual encoder design, aiming to address the lack of logical structure in traditional models when processing complex document layouts.

The core highlight of DeepSeek-OCR2 is its self-developed DeepEncoder V2 encoder. Unlike traditional visual models that process images in a fixed grid order from left to right and top to bottom, the new model introduces the concept of "visual causal flow." It can dynamically adjust the information processing order based on image semantics, intelligently sorting visual content before recognizing text, thus making the machine's reading logic more aligned with human understanding of tables, formulas, and complex documents.
In terms of architecture, the model continues to use an efficient encoder-decoder framework. After semantic modeling and reordering by DeepEncoder V2, the image is decoded by a mixture-of-experts (MoE) language model. Experimental data shows that in the OmniDocBench v1.5 benchmark test, DeepSeek-OCR2 achieved an overall score of 91.09%, an improvement of 3.73% over the previous version. Especially in terms of reading order accuracy, its edit distance has significantly decreased, indicating a stronger ability of the model to restore content structure.
Additionally, DeepSeek-OCR2 also demonstrates stronger stability in practical applications. In tests of PDF batch processing and online log data, the identification repetition rate has significantly decreased. This means that the model provides higher quality and more logical recognition output while maintaining low resource consumption.
Key points:
Dynamic Semantic Sorting: DeepSeek-OCR2 breaks the traditional fixed grid recognition order through "visual causal flow" technology, achieving dynamic reading based on semantics.
Leapfrog Performance Improvement: In authoritative benchmark tests, the new model's recognition performance improved by 3.73%, and reading order accuracy has been significantly enhanced.
Efficient MoE Architecture: The model continues to use the MoE architecture for decoding, achieving higher recognition accuracy and reliability without increasing computational load.


