Leading Chinese AI company DeepSeek has recently released a new visual encoder, DeepSeek OCR2, achieving significant breakthroughs in document processing and image recognition. The model completely overturns the traditional visual model's flat narrative processing logic by simulating the flexible scanning pattern of human vision.

DeepSeek researchers pointed out that human eyes focus flexibly based on content when observing objects. To achieve this feature, DeepSeek OCR2 introduced a new architecture, abandoning traditional CLIP components and adopting a lightweight language model architecture. This architecture uses "causal flow tokens" to reorganize and integrate visual information contextually, enabling AI to "observe" the world like humans, based on the meaning of the content rather than a fixed grid order.
This innovative approach not only enhances understanding but also greatly optimizes efficiency. In the same image processing tasks, DeepSeek OCR2 requires only 256 to 1,120 tokens, reducing visual token consumption by more than 80% compared to the 6,000 or more tokens typically consumed by similar systems. This high compression rate gives the model significant cost and speed advantages when processing long documents.

In the authoritative OmniDocBench benchmark test, the model set a new record with a high score of 91.09%, comprehensively surpassing Gemini3Pro in document parsing performance. Currently, DeepSeek has made the model's code and weights publicly available. The research team believes that this architecture is an important step toward unified multimodal processing and could enable deep integration of text, voice, and images within the same framework in the future.
Key Points:
🚀 Outstanding Efficiency: DeepSeek OCR2 significantly reduces the visual token requirements for a single image, cutting resource consumption by about 80% compared to similar systems.
📑 Performance Superiority: In the OmniDocBench test, the model performed exceptionally well in document parsing and reading order recognition, with accuracy exceeding Gemini3Pro.
🧠 Architectural Innovation: By introducing "causal flow tokens" to reorganize visual information, the model has made a leap from mechanical scanning to logical content understanding.



