At the recently concluded International Conference on Machine Learning (ICML), Kuaishou partnered with Shanghai Jiao Tong University to introduce a remarkable multimodal generation and understanding model - Orthus. Based on an autoregressive Transformer architecture, this model can seamlessly convert between text and images, demonstrating unprecedented generation capabilities, and is now officially open-sourced.
The most notable feature of Orthus is its exceptional computational efficiency and strong learning ability. Studies show that with minimal computational resources, Orthus surpasses existing hybrid understanding generation models, such as Chameleon and Show-o, on multiple image comprehension metrics. On the GenEval metric for text-to-image generation, Orthus also performs outstandingly, surpassing the diffusion model SDXL, which was specifically designed for this purpose.
This model not only handles the interaction between text and images but also shows great potential in applications such as image editing and web page generation. The architectural design of Orthus is very clever, using an autoregressive Transformer as the backbone network, equipped with specific modality generation heads, used respectively for generating text and images. This design effectively decouples the modeling of image details and the expression of text features, allowing Orthus to focus on modeling the complex relationships between text and images.
Specifically, Orthus consists of several core components, including a text tokenizer, a visual autoencoder, and two specific modality embedding modules. It integrates text and image features into a unified representation space, making the backbone network more efficient when processing inter-modal dependencies. During the inference phase, the model generates the next text token or image feature autoregressively based on specific markers, demonstrating strong flexibility.
Through these innovative designs, Orthus not only avoids the divergence between end-to-end diffusion modeling and autoregressive mechanisms, but also reduces information loss caused by image discretization. This model can be seen as a successful expansion of He Kai-ming's MAR work in the field of image generation to the multimodal domain.
The collaboration between Kuaishou and Shanghai Jiao Tong University undoubtedly brings new possibilities for the development of multimodal generation models, and is worth attention and anticipation from the industry and academic communities.