The NVIDIA research team has released a multimodal understanding model called OmniVinci, which has achieved remarkable results on key multimodal understanding benchmarks, scoring 19.05 points higher than existing top models. More impressively, OmniVinci used only 1/6 of the training data, demonstrating excellent data efficiency and performance.

OmniVinci aims to create a comprehensive AI system capable of understanding vision, audio, and text simultaneously, enabling machines to perceive and understand complex worlds through multiple senses as humans do. To achieve this goal, the NVIDIA team adopted innovative architectural designs and data management strategies, integrating information from different senses into a unified multimodal latent space, achieving cross-modal understanding and reasoning.

QQ20251028-113422.png

In the Dailyomni benchmark test, OmniVinci outperformed Qwen2.5-Omni, scoring 1.7 points higher in the MMAR audio comprehension test and 3.9 points higher in the Video-MME visual comprehension test. The training token count was only 0.2 trillion, while Qwen2.5-Omni used 1.2 trillion, showing that OmniVinci's training efficiency is six times that of Qwen2.5-Omni.

The core innovation of the model lies in its multimodal alignment mechanism, including three technologies: the OmniAlignNet module, temporal embedding grouping (TEG), and constrained rotation temporal embedding (CRTE). OmniAlignNet leverages the complementarity between visual and audio signals to enhance their learning and alignment. TEG effectively encodes temporal relationships by grouping visual and audio information temporally. CRTE further addresses temporal alignment issues, ensuring the model can understand absolute temporal information of events.

QQ20251028-113422.png

The research team adopted a two-stage training approach, first conducting modality-specific training, followed by full-modal joint training, gradually enhancing the model's full-modal understanding capabilities. In implicit full-modal learning, researchers further improved the model's ability to understand audio-visual content using existing video question-answering datasets.

The release of OmniVinci marks a significant breakthrough for NVIDIA in the field of multimodal AI, and is expected to drive the development of AI technology in various applications, helping to create smarter systems and services. The open-source release of this model will also provide new opportunities for global researchers and developers, promoting further exploration and innovation of AI in practical applications.