Recently, NVIDIA, in collaboration with the University of Toronto, the Vector Institute, and the University of Texas at Austin, has released a groundbreaking technology called **ViPE (Video Pose Engine)**. ViPE aims to address key challenges in 3D geometric perception, specifically how to efficiently and accurately extract 3D information from complex natural videos.
Technology Core and Applications
3D geometric perception is essential for various modern technologies such as autonomous driving, virtual reality (VR), and augmented reality (AR). ViPE innovatively extracts intrinsic camera characteristics, motion information, and high-precision depth maps from raw videos quickly, providing a reliable data foundation for these spatial AI systems.
ViPE is highly adaptable and can handle various scenarios and camera types, including dynamic selfies, movie shots, dash cam footage, and pinhole, wide-angle, and 360° panoramic camera models.
Working Principle and Performance
The research team used a hybrid method with multiple constraints to ensure the high accuracy of ViPE:
Bundle Adjustment: Conduct dense bundle adjustment on key frames to estimate camera parameters, pose, and depth maps.
Dense Flow and Sparse Point Constraints: Introduce dense flow constraints from the DROID-SLAM network and sparse point constraints from the cuvslam library to ensure robustness and sub-pixel accuracy.
Depth Regularization: Utilize monocular metric depth networks to address scale ambiguity and consistency issues, generating high-resolution and temporally consistent depth information.
Test results show that ViPE outperforms existing technologies (such as MegaSAM, VGGT, and MASt3R-SLAM) in multiple benchmarks. It not only performs well in pose and intrinsic function accuracy but also runs stably at 3 to 5 frames per second on a single GPU and successfully generates scale-consistent trajectories.
To further advance research in the field of spatial AI, the team also released a large dataset containing approximately 96 million annotated frames, offering valuable resources for future technological exploration. The release of ViPE marks an important advancement in 3D geometric perception technology and lays a solid foundation for future spatial AI applications.
Address: https://research.nvidia.com/labs/toronto-ai/vipe/