On April 2,
Core Breakthrough: Understand Visuals and Write Code
As a native multi-modal Coding foundation model,
Native Multi-modal Perception: It can deeply understand images, videos, design drafts, and complex document layouts, supporting visual tool calls such as screen frames, screenshots, and web browsing.
Extended Vision: The context window has been significantly expanded to 200k, allowing Agents to easily handle large projects or long technical documents.
Performance Leap: In core benchmark tests such as multi-modal Coding and GUI Agent (Graphical User Interface Agent), this model achieved leading performance with a smaller size, while ensuring that logical reasoning in pure text scenarios does not degrade.
Typical Scenarios: A Second-by-Second Leap from "Sketch" to "Final Product"
The addition of
Front-end Replication: Just send a sketch, a design draft screenshot, or a screen recording, and the model can understand the layout, color scheme, and interaction logic, generating a complete and functional front-end project that accurately replicates visual details.
GUI Autonomous Exploration: Combined with frameworks like
Interactive Editing: It supports adding, removing, or modifying modules, texts, or layouts directly through conversation, enabling visual code iteration.
Empowering "Lobster": AutoClaw Undergoes Visual Evolution
After integrating this model into Zhipu's self-developed agent
Deep Interpretation of Charts: Lobster can now directly understand K-line charts, valuation range charts, and broker research reports.
Efficient Output: It supports parallel collection from four data sources within 60 seconds, automatically generating professional analytical reports or PPTs with rich graphics and text.
Industry Insight: Programming Is No Longer "Feeling in the Dark"
With the release of


