Skywork AI team recently released a technical report, announcing a major breakthrough in the field of interactive world models. Its latest developed Matrix-Game3.0 system achieved real-time video generation at 40 frames per second (FPS) under 720p high-definition resolution for the first time, and successfully solved the long-standing issue of "long-term memory" deficiency in AI video generation.

Key Breakthrough: Solving the "Amnesia" Problem in AI Video
For a long time, AI video generation models often encountered issues such as spatial structure confusion or style drift when handling long sequences of interactions due to the lack of effective memory. Matrix-Game3.0 broke through this bottleneck by introducing a camera-aware memory retrieval mechanism.
This system can accurately retrieve historical images based on the current camera posture and also uses a unified self-attention architecture, integrating long-term memory, recent history, and current predicted frames into the same space for joint modeling. Experiments have shown that even with complex interactions lasting several minutes, the system maintains a high level of spatiotemporal consistency, ensuring that when users revisit a scene, the details match the originally generated visuals closely.
Industrial-Level Data Engine: Injection of Massive 3A Game Data
To help AI deeply understand the physical logic of the real world, the development team built a large-scale "data factory":
Virtual Reality Synchronized Generation: Using Unreal Engine 5 (UE5), they developed the Unreal-Gen platform, which can fully automatically generate movie-level interactive videos containing over 100 million character combinations.
Automated Collection of 3A Games: The system supports large-scale automatic recording of high-quality interactive data from top games such as Grand Theft Auto V and Cyberpunk 2077.
Multi-Dimensional Real-World Scene Supplement: It integrates over 10,000 real-world 4K sequences, covering diverse scenarios including indoor, urban, and aerial shots.

Performance Optimization: Achieving Ultra-Low Latency Through "Lightweighting"
To meet the requirements of ultra-low latency for real-time interaction, Matrix-Game3.0 underwent deep optimization in its inference architecture. The team adopted a multi-segment autoregressive distillation strategy and combined it with VAE decoder pruning technology (with a pruning rate of up to 75%), increasing the decoding speed by more than five times. In addition, by using INT8 quantization and other methods, the system further reduced computational costs, ensuring smooth operation even with a 5B parameter scale.
Future Vision: Towards an Infinite Generation Digital Universe
In addition to the 5B version, the team also demonstrated a MoE model with a parameter scale of 28B. As the model size increases, AI shows stronger vitality in dynamic simulation, scene transitions, and generalization capabilities.
Industry experts point out that the release of Matrix-Game3.0 provides a key technological foundation for robot training, XR extended reality, and next-generation immersive entertainment. This marks a new stage where AI has evolved from simple "generating segments" to "real-time building interactive worlds."
Paper URL: https://arxiv.org/pdf/2604.08995



