Zhipu officially announced the open-source of the next-generation image generation model GLM-Image. The biggest breakthrough of this model is that it is the first SOTA (state-of-the-art) multimodal model that completes the entire workflow from data preprocessing to large-scale training on a domestic chip computing platform .

GLM-Image adopts an innovative "autoregressive + diffusion decoder" hybrid architecture, successfully achieving deep integration between image generation and language models. This architecture design enables the model to perform exceptionally well in "knowledge-intensive" generation tasks, accurately understanding global instructions and depicting local details, effectively solving long-standing challenges in AI painting such as poster layout, PPT creation, and complex scientific illustration generation.

image.png

GLM-Image supports both text-to-image and image-to-image generation within a single model.

  • Text-to-image: Generate high-detail images based on text descriptions, performing especially well in information-dense scenarios.
  • Image-to-image: Supports various tasks including image editing, style transfer, multi-subject consistency, and identity-preserving generation of people and objects.

In terms of technical indicators, GLM-Image demonstrates strong Chinese understanding and rendering capabilities. It ranks first among open-source models in multiple complex visual text generation rankings, especially excelling in challenging Chinese character generation tasks. Additionally, the model natively supports arbitrary aspect ratio image generation from 1024 to 2048 sizes without additional training, adapting to various resolutions automatically.

Currently, GLM-Image has been fully open-sourced on platforms such as GitHub and Hugging Face. To lower the usage barrier, the API call price is as low as 0.1 yuan per image. Zhipu stated that they will also launch a new version optimized for speed in the future, further improving commercial cost-effectiveness.

image.png

image.png

  • GitHub:https://github.com/zai-org/GLM-Image

  • Hugging Face:https://huggingface.co/zai-org/GLM-Image

Key points:

  • 🇨🇳 Domestic full-stack self-researched: Completed the full workflow training based on Huawei Ascend Atlas800T A2 devices and MindSpore framework, verifying the feasibility of training top-tier models on domestic computing power.

  • 🎨 Breakthrough in text and image fusion: Adopting a hybrid architecture, it ranked first among open-source models in LongText-Bench and other long-text rendering rankings, significantly improving the accuracy of Chinese character and complex text-image generation.

  • 💰 High-cost-effective open-source: The model supports adaptive image generation across various resolutions and is open to creators at extremely low API prices, aiming to promote the popularization of domestic cognitive generation technology.