Baidu has reached a new milestone in the AIGC field, officially open-sourcing its latest visual understanding model - Qianfan-VL. This series of models comes in three versions: 3B, 8B, and 70B, with parameters increasing from small to large, suitable for different application scenarios. Notably, the training of the Qianfan-VL series fully relies on Baidu's self-developed Kunlun X P800 chip, demonstrating the strong capabilities of domestic chips in the field of artificial intelligence.

Qianfan-VL is called a multimodal large model, capable of understanding both images and text. For example, it can analyze data and trends in complex charts. In terms of core capabilities, Qianfan-VL performs particularly well in OCR (Optical Character Recognition) and education scenario optimization. Users just need to take a photo of an ID card, and the model can automatically recognize the name and ID number, achieving full-scenario text recognition. Whether it's printed text, handwriting, or complex mathematical formulas, it can easily recognize and extract information, converting it into structured data.

image.png

In the education field, Qianfan-VL is positioned as a "super student," helping students take photos to solve problems, perform geometric reasoning, and function analysis. According to test results, the 70B version of Qianfan-VL achieved a high score of 98.76 in the ScienceQA scientific question-answering test, far exceeding competitors. At the same time, in the Chinese multimodal benchmark test CCBench, this version also stood out with a high score of 80.98, demonstrating its strong comprehension ability in the Chinese context.

The Kunlun X P800 chip that supports the training of Qianfan-VL has excellent power consumption control, with a power consumption of 150W to 160W, giving it clear advantages in energy consumption and heat dissipation in large-scale clusters. The unique architecture design of the P800 separates the computing unit and communication unit, optimizing the chip's utilization efficiency. Through "communication-computation integration" technology, data transmission and computing processes can be seamlessly connected, significantly improving model training performance.

The underlying architecture of Qianfan-VL integrates multiple industry-leading achievements and adopts an innovative "four-stage training pipeline" method, ensuring that the model has a solid general knowledge foundation and professional knowledge during the training process. Currently, the entire series of Qianfan-VL models have been open-sourced on platforms such as GitHub and Hugging Face, available for free use by enterprises and developers. Meanwhile, Baidu Intelligent Cloud's Qianfan platform also provides online experience and deployment services.

GitHub:

https://github.com/baidubce/Qianfan-VL

Hugging Face:

https://huggingface.co/baidu/Qianfan-VL-70B

Key points:

🌟 Baidu's Qianfan-VL series model is officially open-sourced, including three versions: 3B, 8B, and 70B, suitable for different scenarios.  

🧠 The model has powerful multimodal capabilities, capable of recognizing text and images simultaneously, especially excelling in OCR and education fields.  

💡 The Kunlun X P800 chip supports the model training, with low power consumption and high utilization efficiency, optimizing large-scale computing performance.