Zhipu AI has officially open-sourced its latest general vision model, GLM-4.1V-Thinking, based on the GLM-4V architecture, with an added chain-of-thought reasoning mechanism, significantly enhancing its ability to handle complex cognitive tasks. The model supports multi-modal input, including images, videos, and documents, and excels in a variety of scenarios such as long video understanding, image question answering, subject problem-solving, text recognition, document interpretation, grounding, GUI Agent, and code generation, meeting the application needs of numerous industries.
GLM-4.1V-9B-Thinking performed outstandingly in 28 authoritative evaluations, achieving the best results of 10B-level models in 23 of them, and matching or surpassing the 72B parameter Qwen-2.5-VL in 18 of them. These include benchmarks such as MMStar, MMMU-Pro, ChartQAPro, and OSWorld. With its 9 billion parameters and efficient inference capability, it can run on a single 3090 GPU and is available under a free commercial license, greatly lowering the barrier for developers.
Zhipu AI stated that GLM-4.1V-Thinking enhances cross-domain reasoning capabilities through reinforcement learning and curriculum sampling techniques, demonstrating deep thinking and problem-solving abilities for complex issues. The model is now available on HuggingFace for global developers to experience for free. The industry believes this move will promote the widespread application of multi-modal AI in education, research, and business, marking another milestone in Zhipu AI's journey toward general artificial intelligence.