Tencent research team has released a new multimodal AI model, X-Omni, which has achieved significant breakthroughs in image generation and understanding, especially in long-text rendering, effectively solving the accuracy issues in text generation within images for traditional AI models.
For a long time, AI image generation models have faced significant challenges when handling text rendering. Traditional discrete autoregressive models build images by generating pixels or tokens one by one, which can lead to cumulative errors, causing spelling mistakes, missing characters, or distorted text. Many research teams have therefore turned to diffusion models or hybrid architectures, believing that pure autoregressive methods are difficult to handle high-quality text rendering tasks.
X-Omni adopts an innovative reinforcement learning framework to optimize model performance. The system introduces a multidimensional reward mechanism, including the aesthetic quality evaluator HPSv2, the comprehensive reward model Unified Reward, the visual language understanding model Qwen2.5-VL-32B, and specialized text recognition evaluation tools GOT-OCR2.0 and PaddleOCR. These components work together to provide real-time feedback and guidance during the model's generation process, significantly improving the stability and accuracy of the output quality.
The core advantage of X-Omni lies in achieving unified modeling of image generation and understanding. Traditional methods usually handle these two tasks separately, requiring different model architectures and training strategies. X-Omni uses the semantic image tokenizer SigLIP-VQ to convert visual information into semantic tokens that language models can process, allowing the same model to both generate high-quality images and accurately understand their content.
In performance testing, X-Omni performed excellently on multiple benchmarks. In text rendering tasks, the model maintains high accuracy whether processing English or Chinese text, especially surpassing multiple mainstream models, including GPT-4o, in long-text rendering. In text-to-image generation tasks, X-Omni can precisely follow complex instructions and generate high-quality images that meet requirements. At the same time, in image understanding tasks, the model also outperformed specialized vision understanding models such as LLaVA-One Vision in professional tests like OCRBench.
Notably, X-Omni maintains high-quality generation results without using classifier-free guidance techniques. Classifier-free guidance is a commonly used optimization technique that can improve the model's compliance with instructions, but it increases computational overhead. X-Omni achieves excellent performance without relying on this external auxiliary mechanism, indicating that its internal visual and language modules have achieved a high level of coordination and integration.
From a technical architecture perspective, the success of X-Omni proves the potential of discrete autoregressive models in multimodal tasks. By introducing a reinforcement learning optimization mechanism and a unified semantic representation method, the model overcomes the limitations of traditional autoregressive methods and provides a new technological path for the development of multimodal AI.
The release of X-Omni marks a new stage in AI development in the fields of image generation and understanding. Not only has the model achieved breakthroughs in technical indicators, but more importantly, it has validated the feasibility of unified multimodal modeling, laying the foundation for building more intelligent and efficient AI systems. As this technology continues to improve, users will be able to create visual works containing complex textual content more conveniently through natural language, significantly enhancing the efficiency and quality of AI-assisted content creation.
Paper address: https://arxiv.org/pdf/2507.22058