Recently, GPT Image2 has caused a sensation on social media with its incredibly impressive generation effects. As the project gained popularity, the low-key behind-the-scenes team gradually came into the spotlight. According to the information, the core team consists of only 13 people, and they completely rewrote the underlying architecture in just four months. Although the research leader Chen Boyuan did not reveal specific technical routes, he described this new model as "GPT for the image field," indicating a significant leap in generality.
As the team's key figure, Chen Boyuan has a rather legendary growth journey. During his doctoral studies, he proposed innovative paradigms such as "Diffusion Forcing" and participated in the development of instruction tuning technology later adopted by Gemini 2.0 at Google. Interestingly, he didn't even know Python when he joined a science camp in high school. After joining OpenAI, he not only took charge of all the training work for the GPT image model but was also a core member of the Sora video generation team. In a demonstration, he showcased the model's excellent language processing capabilities by generating posters with accurately rendered Chinese, Korean, and Bengali text.

In addition to text rendering, GPT Image2 has also reached a new level in understanding world knowledge and following instructions. This module, led by Dr. Jianfeng Wang from University of Science and Technology of China, has solved a long-standing pain point for image generation AI—such as past models always drawing clocks at 10:10, while the new model can now accurately understand any time point and complex spatial layout instructions. He stated that the model is eliminating the gap between users' creative intentions and the final output.
In terms of productivity tools, Yuguang Yang from Zhejiang University's Zhuyuan College demonstrated the ability to convert lengthy papers into high-precision PPTs and infographics with one click. This is due to the team's deep integration in multimodal understanding, MoE (Mixture of Experts) architecture, and long-range guidance technology.
From the initial DALL-E to today's GPT Image2