Baidu Intelligent Cloud Qianfan team officially launched the new visual understanding model — Qianfan-VL, and fully open-sourced it! This series of models includes three different sizes: 3B, 8B, and 70B, aiming to meet the needs of enterprise-level multimodal applications. After deep optimization, it demonstrates strong visual understanding capabilities.
The Qianfan-VL model not only has excellent basic capabilities but also has been specially improved for high-frequency industry needs, such as optical character recognition (OCR) and educational scenarios, making it perform even better in practical use. The model is developed based on open-source models and completed the entire workflow computing on Baidu's self-developed Kunlun Chip P800. The powerful computing power ensures that the model can efficiently process complex data and algorithms.
This new model has three significant features. First, the multi-size selection allows enterprises and developers of different scales to find suitable solutions, with three specifications of 3B, 8B, and 70B to meet various application needs. Second, the 8B and 70B models have thinking and reasoning capabilities, which can handle complex chart understanding, visual reasoning, and math problem-solving tasks through special tokens. Finally, it performs exceptionally well in OCR and document understanding, not only accurately recognizing handwritten text and complex layouts but also performing structured information extraction.
In benchmark tests, the Qianfan-VL series model demonstrated outstanding general capabilities and excellent performance in specific tasks. Whether it's visual understanding or professional field Q&A, this model shows impressive accuracy and performance in all tests. Especially in the fields of OCR and document understanding, its full-scenario recognition capabilities and complex document analysis abilities provide high-precision solutions for enterprise applications.
Additionally, the mathematical problem-solving capabilities of Qianfan-VL are worth mentioning. The 8B and 70B models show superior performance when handling complex reasoning tasks by combining visual information with external knowledge. In practical application scenarios, it can extract key information and perform data analysis, helping enterprises make intelligent decisions.
The launch of Qianfan-VL marks a major breakthrough for Baidu in the field of visual understanding. We look forward to its application across industries, which will trigger a new wave.