Emu3.5 is a native multimodal model developed by the Beijing Academy of Artificial Intelligence (BAAI). It can jointly predict the next state across vision and language, enabling coherent world modeling and generation. After end-to-end pre-training and large-scale reinforcement learning post-training, it demonstrates excellent performance in multimodal tasks.
Multimodal
Transformers