Zhipu officially launched the open-source GLM-4.6V multimodal large model series, including the base version GLM-4.6V (total parameters 106B, activated 12B) and the lightweight version GLM-4.6V-Flash (9B). The new model increases the context window to 128k tokens, achieving SOTA visual understanding accuracy for the same parameter scale. For the first time, the Function Call capability is natively integrated into the visual model, enabling a complete "visual perception → executable action" workflow. API pricing has been reduced by 50% compared to GLM-4.5V, with input at 1 yuan per million tokens and output at 3 yuan per million tokens; GLM-4.6V-Flash is completely free and has been integrated with GLM Coding Plan and dedicated MCP tools, allowing developers to commercialize at zero cost.

Technical Highlights: 128k Multimodal Context + Native Visual Function Call

128k Multimodal Context: A single round can input 30 high-resolution images and 80,000 words of text, achieving SOTA results on long video understanding benchmarks such as Video-MME and MMBench-Video.

Native Function Call: Visual signals are directly mapped to executable APIs without an additional projector, reducing latency by 37% and increasing success rate by 18%.

Unified Encoding: Images, videos, and text share a single Transformer, with dynamic routing during inference, reducing GPU memory usage by 30%.

Pricing and Licensing: Lightweight Version Free, Base Version Halved

GLM-4.6V-Flash (9B): 0 yuan for calls, open weights and commercial license, suitable for edge devices and SaaS integration.

GLM-4.6V (106B-A12B): Input at 1 yuan per million tokens and output at 3 yuan per million tokens, approximately 1/4 of GPT-4V's price.

50% Price Reduction: Overall reduction of 50% compared to GLM-4.5V, with a free 1 million token trial quota included.

Developer Tools: MCP + Coding Plan, One-Click Access

Dedicated MCP (Model-Context-Protocol) tool: Integrate GLM-4.6V into VS Code, Cursor with just 10 lines of code, enabling "UI selection → automatic front-end code generation."

GLM Coding Plan: Offers 50+ scenario templates (websites, mini-programs, scripts), transforming visual requirements into executable code and auto-deployment.

Online Playground: Supports drag-and-drop image uploads, real-time debugging of Function Calls, and one-click export of Python/Node.js call snippets.

Benchmark Results: SOTA with the Same Parameters, Leading in Long Video Understanding

| Benchmark                   | GLM-4.6V | GPT-4V | Gemini 1.5 Pro |

| --------------------- | -------- | ------ | -------------- |

| Video-MME               | 74.8     | 69.1   | 72.9           |

| MMBench-Video           | 82.1     | 78.4   | 80.6           |

| LongVideoBench (128k) | 65.3     | 58.2   | 62.1           |

Commercial Scenarios and Cases

Movie Preview: Directors upload character images and storyboards, automatically generating a 30-second preview video with more than 96% consistency of main subjects.

Industrial Inspection: Capture equipment panel images → automatically identify abnormal areas → call maintenance API to create work orders.

Educational Materials: Teachers select textbook illustrations → generate 3D animations and voice explanations, and export to PPT with one click.

Open Roadmap

Starting today: Weights, inference code, and MCP tools are open-sourced on GitHub and Hugging Face (search for GLM-4.6V).

Q1 2025: Launch a 1M context version and a terminal-side INT4 quantized model, which can run on laptop CPUs.

Q2 2025: Launch the "Visual Agent Store," allowing developers to list custom Function Calls and earn revenue through calls.

Industry Insights

While multimodal models are still at the stage of "understanding visuals," Zhipu has embedded "understanding visuals + taking actions" into one model: native integration of Function Call allows images to trigger APIs directly, skipping the redundant chain from vision to text to prompt. The free 9B version lowers the entry barrier, while the 106B base version halving the price aims to quickly capture the visual agent ecosystem. As 128k long video understanding becomes practical, vertical scenarios like film, industry, and education are expected to see early large-scale adoption. AIbase will continue to track its progress in terminal-side quantization and Agent Store development.