With the continuous maturation of AI technology, the MMLab team at The Chinese University of Hong Kong has collaborated with researchers from universities such as Beihang University and Shanghai Jiao Tong University to introduce a revolutionary tool - a structured image generation and editing system. The launch of this system marks a significant step forward for AI in generating charts and data visualizations. Although existing AI generation models like FLUX.1 and GPT-Image perform well in generating natural images, they often make errors when dealing with structured images such as charts and formulas, and the accuracy and logic of the data are often not guaranteed.
The team's analysis points out that the generation and editing of structured images have three core requirements: accurate text rendering, complex layout planning, and multi-modal reasoning capabilities. These capabilities are crucial for education, research, and office work. However, current technological methods fail to meet these needs, as existing datasets mainly focus on natural images and lack strictly aligned structured samples.
To break through this bottleneck, the research team has made comprehensive innovations in three areas: data, models, and evaluation. First, in terms of data, they have built a structured sample database containing 1.3 million code-aligned samples. They used executable drawing code to generate high-quality image samples and ensured that each sample had detailed reasoning chain annotations. Second, in terms of the model, the team designed a lightweight visual language model (VLM) integration scheme that combines the ability to generate both structured and natural images. Finally, they also introduced a new evaluation benchmark called StructBench and a metric called StructScore to ensure that the generated images were effectively validated for accuracy.
Through these innovations, the research team has not only improved AI's understanding and generation capabilities for structured images but also demonstrated significant advantages in comparisons with multiple open-source models. The release of this system not only fills the gap in the field of structured visual generation but also provides important technical support for the development of multi-modal AI. In the future, this tool will be widely applied in education, research, and office fields, helping AI become an effective productivity tool.
Paper link: https://arxiv.org/pdf/2510.05091