On December 4, the Beijing Zhiyuan Institute of Artificial Intelligence officially released its new multimodal large model Emu3.5, hailed as "AI that truly understands the physical world." Unlike previous image, video, and text models that operated independently, Emu3.5 achieves for the first time a "world-class unified modeling," evolving AI from "being able to draw and write" to truly "understanding the world."

Metaverse, Sci-fi, Cyberpunk, Large Model (2) Painting

Image source note: The image is AI-generated, and the image licensing service provider is Midjourney.

The fatal weakness of traditional AI: no understanding of physics or causality  

Most image generation models in the past could produce realistic images but lacked a deep understanding of real-world laws: objects do not fly up without reason, gravity, collision, and motion paths are completely "black boxes" for them. Even top video generation models often have sudden changes in actions and logical breaks, the fundamental reason being that they learn only "surface pixels," not "the rules of the world."

Core breakthrough of Emu3.5: Predicting "what the world will be next"  

Emu3.5 has completely changed this situation. The research team encoded images, text, and videos all into the same Token sequence, and the model only learns one pure task — NSP (Next State Prediction, predicting the next state of the world).

Simplified speaking:  

- Regardless of whether the input is an image, text, or video frame, Emu3.5 sees them as different expressions of the "current state of the world."  

- The model's task is always just one: to predict "what the world will look like next."  

- The next moment could be text → automatically continuing the dialogue;  

- The next moment could be an image → generating reasonable actions automatically;  

- The next moment may include both visual and language changes → simulating the complete evolution of the world.

Unified Tokenization: Images, texts, and videos are fully integrated  

The biggest technical highlight of Emu3.5 is unifying all modalities into the same set of "building blocks of the world." The model no longer distinguishes between "this is an image" or "this is a sentence" or "a frame of a video"; all information is discretized into Token sequences. Through massive data training, the model learns cross-modal causal relationships and physical common sense, truly possessing "world-level understanding."

From "pixel transporter" to "world simulator"  

Industry experts comment: Emu3.5 is a milestone in the transition of multimodal large models from the "generation era" to the "world model era." In the future, based on Emu3.5, it will not only generate more natural long videos and interactive image editing, but may also directly be used in advanced scenarios such as embodied intelligence for robots, autonomous driving simulation, and prediction of the physical world.

AIbase's exclusive comments  

While all major tech companies are competing on parameters, resolution, and video length, Beijing Zhiyuan directly brings the core issue back to "whether AI really understands the world." Emu3.5 uses the simplest "predicting the next Token" to unify all modalities, yet achieves the most profound capability leap: from "looking like" to "being correct." This time, the Chinese team once again leads the global AI new direction with an original paradigm.

The true world model has already arrived.