Recently, ByteDance and the University of Hong Kong jointly launched a new open-source visual reasoning model called Mini-o3, marking another major breakthrough in multi-turn visual reasoning technology. Unlike previous visual language models (VLMs) that could only handle 1-2 rounds of dialogue, Mini-o3 limits the number of conversation rounds to 6 during training, but during testing, it can expand the reasoning rounds to dozens, greatly enhancing the ability to handle visual questions.
The strength of Mini-o3 lies in its deep reasoning in high-difficulty visual search tasks, reaching the top level of current technology. This is achieved through three core design elements. First, the research team built a visual probe dataset called VisualProbe, containing thousands of visual search challenges designed for exploratory reasoning. Second, they developed an iterative data collection process, allowing the model to learn various reasoning strategies such as depth-first search, trial-and-error exploration, and goal maintenance. Finally, the research team proposed a super-round masking strategy, which avoids penalizing answers that reach the maximum interaction round during reinforcement learning, thus effectively improving training efficiency and test scalability.
The training process of Mini-o3 is divided into two stages. The first stage is cold-start supervised fine-tuning (SFT), aimed at activating multi-turn tool usage capabilities. The research team collected a large number of high-quality reasoning trajectories using context learning. The second stage is reinforcement learning (RL), during which the image pixel limits are reduced and a super-round masking mechanism is introduced, significantly improving the model's interaction rounds and reasoning capabilities.
Mini-o3 performs excellently on multiple visual search benchmarks, surpassing existing open-source models. Through comparative experiments, researchers found that cold-start SFT and the super-round masking technology are key to improving reasoning capabilities. Additionally, a reasonable maximum pixel budget setting is crucial for optimizing model performance.
The release of Mini-o3 not only achieves new heights in technology, but also provides new directions for the development of future multi-turn visual reasoning. The success of this model marks that deep thinking and complex reasoning have become more achievable without consuming a large amount of training resources.
Paper URL: https://arxiv.org/pdf/2509.07969