Recently, NVIDIA released its latest Cosmos-Reason1 series models, aimed at enhancing AI capabilities in physical common sense and embodied reasoning. As artificial intelligence has made significant progress in language processing, mathematics, and code generation, how to extend these capabilities to physical environments has become a major challenge.
Physical AI (PAI) differs from traditional AI by relying on sensory inputs like video and combining real-world physical laws to generate responses. The application areas of PAI include robotics and autonomous vehicles, which require common sense reasoning abilities and deep understanding of space, time, and physical laws.
However, existing AI models still have weaknesses in connecting with the physical world, failing to intuitively understand gravity or spatial relationships, which affects their performance in embodied tasks. Training directly in the physical world is costly and risky, hindering the development of PAI to some extent.
To address these issues, NVIDIA's Cosmos-Reason1 models propose innovative solutions. This series includes two versions: Cosmos-Reason1-7B and Cosmos-Reason1-56B, employing physical AI supervised fine-tuning and reinforcement learning in two training stages.
The research team introduced a dual ontology system, with one hierarchical ontology categorizing physical common sense into three types: space, time, and fundamental physics, while the other maps the reasoning capabilities of embodied agents such as humans, robotic arms, and humanoid robots.
The model architecture uses a large language model with only a decoder, combined with a vision encoder to process video data for synchronized reasoning between text and visual data. For this purpose, the team constructed three benchmarks for physical common sense, covering 604 questions and 426 videos, as well as six benchmarks for embodied reasoning, including 610 questions and 600 videos.
After training, the Cosmos-Reason1 models performed excellently in physical common sense and embodied reasoning benchmark tests, particularly achieving significant progress in predicting next actions, verifying task completion, and assessing physical feasibility after reinforcement learning training.
With the release of the Cosmos-Reason1 series models, NVIDIA provides new solutions for physical reasoning tasks, offering promising applications in robotics and autonomous driving in the future.
Access: https://github.com/nvidia-cosmos/cosmos-reason1
Key points:
🌟 NVIDIA releases the Cosmos-Reason1 series models to enhance AI capabilities in physical reasoning.
🤖 The model uses a dual ontology system for synchronized reasoning between video and text data.
📈 The Cosmos-Reason1 models perform outstandingly in physical common sense and embodied reasoning benchmarks.