The Qwen-Image, a 2 billion parameter multimodal diffusion transformer (MMDiT) image generation foundation model, is the first in the Qwen series to be open-sourced. This innovative achievement has made breakthroughs in complex text rendering and precise image editing, and has demonstrated outstanding performance on multiple public benchmarks, becoming a rising star in the field of image generation and editing.

Qwen-Image stands out with its powerful text rendering capabilities, supporting multi-line layout, paragraph-level text generation, and fine-grained detail presentation. Whether in English or Chinese, it can achieve high-fidelity output. For example, when rendering anime scenes in the style of Studio Ghibli, the model can accurately present shop signs, character postures, and expressions, and even small texts on wine barrels are clearly visible. Similarly, in rendering Chinese couplets, Qwen-Image not only accurately draws the left and right couplets and the horizontal scroll but also skillfully integrates calligraphy effects, which is astonishing.

微信截图_20250805080614.png

In terms of English text rendering, Qwen-Image also performs excellently. Whether it's the information displayed in bookstore windows or complex infographics, the model can accurately generate text content and skillfully integrate it into the overall composition, demonstrating a high level of artistry and informativeness. More impressively, even when handling smaller or more text, Qwen-Image maintains a high level of accuracy and clarity, such as accurately generating long passages of text on a piece of paper held in hand, or fully presenting handwritten paragraphs on a glass plate.

Aside from text rendering, Qwen-Image also demonstrates extraordinary strength in image editing. Through an enhanced multi-task training paradigm, the model can maintain consistency during the editing process, supporting various operations such as style transfer, object addition/removal, detail enhancement, and adjustment of human poses. This enables ordinary users to easily achieve professional-level image editing, significantly lowering the technical barrier for visual content creation.

On multiple public benchmarks, Qwen-Image's performance is remarkable. From general image generation benchmarks like GenEval, DPG, and OneIG-Bench, to image editing benchmarks like GEdit, ImgEdit, and GSO, Qwen-Image has achieved state-of-the-art performance, demonstrating its comprehensive advantages in image generation and editing. Particularly in Chinese text rendering, Qwen-Image greatly surpasses existing state-of-the-art models, highlighting its unique position as an advanced image generation model.

Currently, Qwen-Image is open-sourced on platforms such as ModelScope, Hugging Face, and GitHub, and provides detailed Technical reports and Demo demonstrations. Users can visit QwenChat (chat.qwen.ai) and select the "image generation" feature to experience the power of this model firsthand.

ModelScope:https://modelscope.cn/models/Qwen/Qwen-Image

Hugging Face:https://huggingface.co/Qwen/Qwen-Image

GitHub:https://github.com/QwenLM/Qwen-Image

Technical report:https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-Image/Qwen_Image.pdf

Demo: https://modelscope.cn/aigc/imageGeneration?tab=advanced