Tencent has officially released HunyuanWorld-Voyager, an innovative video diffusion framework designed to generate 3D point clouds with world consistency based on a single input image, supporting users to immerse themselves in exploration along custom camera paths.
The official stated that this is the first global ultra-long-range world model with native 3D reconstruction capabilities, redefining AI-driven VR, gaming, and simulation space intelligence. This model not only can generate accurately aligned depth information and RGB videos, but it can also be directly used for high-quality 3D reconstruction without post-processing.
Direct 3D Output: Export point cloud videos as 3D formats without tools like COLMAP, enabling immediate 3D applications.
Innovative 3D Memory: Introduce a scalable world cache mechanism to ensure geometric consistency for any camera trajectory.
Top Performance: Ranked first in the Stanford WorldScore test, and performed well in video generation and 3D reconstruction benchmark tests.
The architecture of HunyuanWorld-Voyager includes two key components. The first is "World-consistent Video Diffusion," which proposes a unified architecture that can generate accurately aligned RGB video and depth video sequences based on existing world observations, ensuring global scene consistency. The second is "Long-distance World Exploration," which adopts an efficient world cache mechanism, combined with point cloud culling and autoregressive reasoning capabilities, supporting iterative scene expansion and achieving smooth video sampling through context-aware consistency technology.
To train the HunyuanWorld-Voyager model, the research team built a scalable data construction engine. This automated video reconstruction pipeline can automatically estimate camera poses and metric depths for any input video, eliminating the need for manual annotation, thus enabling the construction of large-scale and diverse training data. Based on this pipeline, HunyuanWorld-Voyager integrated real-world collected and Unreal Engine-rendered video resources, building a large-scale dataset containing over 100,000 video clips.
In experimental evaluations, HunyuanWorld-Voyager showed excellent performance in video generation quality. Compared with four open-source camera-controllable video generation methods, the results showed that this model outperformed other models in metrics such as PSNR, SSIM, and LPIPS, proving its superior video generation quality. At the same time, in terms of scene reconstruction, the generated videos from HunyuanWorld-Voyager also showed better effects in geometric consistency.
Additionally, HunyuanWorld-Voyager achieved the highest score in the WorldScore static benchmark test, demonstrating its superiority in camera motion control and spatial consistency. This achievement not only showcases the potential of the Hunyuan World model but also paves the way for future 3D scene generation technology.
Key Points:
🌍 HunyuanWorld-Voyager can generate 3D point clouds with world consistency based on a single input image, supporting users' immersive exploration.
🎥 The model simultaneously generates precisely aligned depth information and RGB videos, suitable for high-quality 3D reconstruction.
🏆 In multiple tests, HunyuanWorld-Voyager outperformed other models in both video generation quality and scene reconstruction effectiveness.