Five days after the release of DeepSeek-V4 and its significant impact on the industry, DeepSeek officially launched a grayscale test for its multimodal image recognition feature, marking the entry of its multimodal capabilities into a practical implementation phase. This update added an "Image Recognition Mode" entry in the input bars of mobile and web versions, and prominently labeled "Image Understanding Function in Internal Testing," completing a crucial shift from pure text/code to visual interaction.
Test data shows that DeepSeek performs exceptionally well in basic visual understanding and scene description. It can generate highly accurate descriptive texts when identifying complex figures, environmental composition, and photographic details. When the "Thinking Mode" is enabled, the model demonstrates deep logical reasoning capabilities, accurately deducing the artistic style and historical background based on visual characteristics of cultural relics. Additionally, its ability to extract textual information and judge scenes in images has reached industry mainstream standards.
However, there is still room for improvement when facing extreme visual challenges. Tests show that the module's recognition rate is limited when processing interference images such as fragmented or inverted color images. In tasks involving element counting and complex graphical logic reasoning, although the model shows self-博弈-style reasoning attempts, there is still room for improvement in accuracy and response efficiency. Moreover, its coverage of extremely new product information is still constrained by the update cycle of the existing knowledge base.
Industry analysis indicates that this feature is currently more like a visual understanding module attached to the main model, aimed at verifying the multimodal pipeline through grayscale testing. With the rapid iteration of DeepSeek's visual patches, the competitive focus of domestic large models in the native multimodal field is shifting from "parameter scale" to "omni-scenario perception." This internal test not only fills the core functional gap of DeepSeek but also indicates that its native multimodal major features are entering the final preparation stage.



