From early ImageNet classification to today's diffusion models, computer vision has been striving to let machines "see the world" over the past decade. However, as perceptual capabilities approach human limits, the marginal gains of pursuing accuracy alone are diminishing. At CVPR 2026, the focus of research in visual intelligence has undergone a profound shift: vision is no longer an end in itself but serves as an intermediary for reasoning, decision-making, and interaction.

Leaving Behind "Blind Reasoning": Toward Adaptive and Implicit Paths

For a long time, multimodal models have assumed that logical reasoning proceeds through a "chain of thought" (CoT). However, recent studies suggest that this approach of "reasoning every time" is often inefficient. For example, the VideoAuto-R1 framework introduced the concept of "on-demand reasoning": directly answering simple perceptual tasks, and only triggering reasoning in complex logical scenarios. Experiments show that this approach maintains optimal performance while reducing the average output length by 3.3 times.

image.png

Additionally, the medium of reasoning is also changing. Previously, models heavily relied on language descriptions to handle spatial relationships, which proved inadequate when dealing with puzzles or geometric structures. The new trend is to allow models to perform implicit visual reasoning directly within the "latent space," without converting it into linear text, thus more naturally capturing complex visual structures.

Re-evaluating Evaluation Systems: Breaking the Illusion of Multiple-Choice Success

Current evaluations of visual-language models mostly use multiple-choice questions (MCQA), but this may systematically overestimate model capabilities. Research found that models often "cheat" by elimination or option bias, and their actual scores could be inflated by about 20 points. To address this, the industry is promoting a "verifiable open QA" paradigm, forcing models to truly understand visual content rather than relying on option clues.

At the same time, evaluation scenarios are shifting from single-agent static images to multi-agent environments. The emergence of benchmarks like VS-Bench requires models not only to understand the environment but also to possess strategic reasoning and decision-making abilities in complex interactions such as collaboration and competition. This marks the evolution of visual intelligence from a mere "understander" to a "decision-maker."

image.png

Infrastructure Upgrade: Open-Source Models and Real-World Data Completion

In terms of model form, the open-source community is experiencing greater transparency. Models like Molmo2 not only release weights but also fully open data and training processes. These models expand capabilities from single images to videos and introduce precise positioning functions, achieving a leap from "understanding" to "pointing out locations."

The progress is supported by increasingly comprehensive data infrastructure. For text-driven image editing tasks, the launch of large-scale real-world datasets like Pico-Banana-400K fills the gap left by over-reliance on synthetic data. This dataset supports multi-turn editing and preference alignment, providing a solid foundation for training editing models with more common sense and logic.

In summary, visual intelligence is evolving from a single perception to an integrated intelligence combining perception, cognition, and action. This process is not just minor improvements in performance, but a systematic reconstruction of reasoning mechanisms, evaluation paradigms, and data supply chains.