In the field of artificial intelligence, vision-language models (VLMs) have made significant progress in recent years, especially in 2D visual understanding. As this field continues to develop, researchers have begun to focus on 3D scene understanding. However, due to the scarcity of high-quality spatial data and the limitations of the static viewpoint assumption, existing 3D VLMs often struggle with effective reasoning and generalization. To address these challenges, a research team recently released a new foundational model called 3D-R1.

The core innovation of 3D-R1 lies in significantly improving the reasoning and generalization capabilities of 3D scene understanding through a high-quality synthetic dataset, reinforcement learning, and the introduction of dynamic view selection. Researchers used existing 3D-VL datasets and a data engine based on Gemini2.5Pro to build a high-quality synthetic dataset called Scene-30K. This dataset provides strong initialization data for 3D-R1.

During the training process with reinforcement learning, 3D-R1 introduced various reward functions, including perceptual rewards, semantic similarity rewards, and formatting rewards, aiming to enhance the model's reasoning capabilities while ensuring the accuracy of detection and the semantic precision of answers. In addition, 3D-R1 adopts a dynamic view selection strategy that can adaptively select the most reference-worthy perspectives for 3D scene understanding.

Through a series of experiments, 3D-R1 achieved an average improvement of 10% in multiple 3D scene benchmarks, proving its effectiveness in enhancing the reasoning and generalization capabilities of 3D scene understanding. The research team stated that the release of 3D-R1 marks an important milestone in the research of 3D vision-language models, laying a solid foundation for future related research and applications.