Redefining Tradition! Mini-o3 Open-Source Model Achieves Ultra-Long Visual Reasoning, Deep Thinking Is No Longer a Challenge

AIbase基地

Published inAI News · 4 min read · Sep 16, 2025

Recently, ByteDance and the University of Hong Kong jointly launched a new open-source visual reasoning model called Mini-o3, marking another major breakthrough in multi-turn visual reasoning technology. Unlike previous visual language models (VLMs) that could only handle 1-2 rounds of dialogue, Mini-o3 limits the number of conversation rounds to 6 during training, but during testing, it can expand the reasoning rounds to dozens, greatly enhancing the ability to handle visual questions.

The strength of Mini-o3 lies in its deep reasoning in high-difficulty visual search tasks, reaching the top level of current technology. This is achieved through three core design elements. First, the research team built a visual probe dataset called VisualProbe, containing thousands of visual search challenges designed for exploratory reasoning. Second, they developed an iterative data collection process, allowing the model to learn various reasoning strategies such as depth-first search, trial-and-error exploration, and goal maintenance. Finally, the research team proposed a super-round masking strategy, which avoids penalizing answers that reach the maximum interaction round during reinforcement learning, thus effectively improving training efficiency and test scalability.

The training process of Mini-o3 is divided into two stages. The first stage is cold-start supervised fine-tuning (SFT), aimed at activating multi-turn tool usage capabilities. The research team collected a large number of high-quality reasoning trajectories using context learning. The second stage is reinforcement learning (RL), during which the image pixel limits are reduced and a super-round masking mechanism is introduced, significantly improving the model's interaction rounds and reasoning capabilities.

Mini-o3 performs excellently on multiple visual search benchmarks, surpassing existing open-source models. Through comparative experiments, researchers found that cold-start SFT and the super-round masking technology are key to improving reasoning capabilities. Additionally, a reasonable maximum pixel budget setting is crucial for optimizing model performance.

The release of Mini-o3 not only achieves new heights in technology, but also provides new directions for the development of future multi-turn visual reasoning. The success of this model marks that deep thinking and complex reasoning have become more achievable without consuming a large amount of training resources.

Paper URL: https://arxiv.org/pdf/2509.07969

Mini-o3 Visual Reasoning Model ByteDance Open Source

This article is from AIbase Daily

Welcome to the [AI Daily] column! This is your daily guide to exploring the world of artificial intelligence. Every day, we present you with hot topics in the AI field, focusing on developers, helping you understand technical trends, and learning about innovative AI product applications.

—— Created by the AIbase Daily Team

AI News Recommendations

Alibaba Launches Revolutionary Speech Recognition Model FunAudio-ASR with Remarkable Noise Reduction

Recently, Tongyi Lab of Alibaba officially released its latest end-to-end speech recognition large model - FunAudio-ASR. The biggest highlight of this model is its innovative "Context Module," which significantly improves the accuracy of speech recognition in high-noise environments. The hallucination rate has been reduced from 78.5% to 10.7%, a decrease of nearly 70%. This technological breakthrough has set a new benchmark for the speech recognition industry, especially suitable for noisy environments such as meetings and public places. FunAudio-AS

Sep 16, 2025

140

Tencent HunyuanImage 2.1 Makes a Strong Debut! Open-Source 2K Text-to-Image Champion, Turns into High-Resolution Art Master in an Instant?

Recently, the Tencent Hunyuan team officially open-sourced HunyuanImage2.1, this 17B parameter DiT (Diffusion Transformer) text-to-image model quickly topped the Artificial Analysis Image Arena ranking, surpassing HiDream-I1-Dev and Qwen-Image, becoming the new leader in open-source weight models. The model supports native 2048x2048 resolution output, and significantly improves

Sep 16, 2025

Tencent HuanYuan 3D 3.0 Launches: 3.6 Billion Voxels Ultra-High Definition Modeling, 3x Accuracy Improvement

On September 16th, at the 2025 Tencent Global Digital Ecosystem Conference, Tencent announced the new HuanYuan 3D 3.0 generative model. This model improves modeling accuracy by 3 times, with a geometric resolution of up to 1536³, supporting ultra-high definition modeling with 3.6 billion voxels. The detail expression has been significantly enhanced, providing users with an unprecedented 3D content creation experience. From game development to e-commerce advertising, from AR/VR to embodied intelligence, 3D AIGC technology is accelerating its application across multiple fields. The HuanYuan 3D 3.0 model is now integrated into the HuanYuan 3D AI Creation Engine, available to users.

Sep 16, 2025

130

AI Daily: Tencent Launches Huan Yuan 3D 3.0 Model; Kunlun Wanyi Launches Agent Studio Feature; Alibaba Qoder Introduces Paid Subscription Service

Tencent unveiled Hunyuan 3D 3.0 at the 2025 Global Digital Ecosystem Conference, tripling modeling precision with 3D-DiT hierarchical sculpting. The launch includes Hunyuan 3D Studio and an open-source initiative to advance 3D creation.....

Sep 16, 2025

100

Tencent Launches Hunyuan 3D 3.0 Model, Building Accuracy Increased by 3 Times

At Tencent's 2025 Global Digital Ecosystem Conference, the launch of Hunyuan 3D 3.0 marked a breakthrough in AI-powered 3D modeling. Featuring innovative 3D-DiT hierarchical sculpting, it triples precision with 1536³ geometric resolution. Now free on Tencent's AI engine and cloud API.....

Sep 16, 2025

Deepdub Launches Lightning 2.5: Real-Time Voice AI Model Leading Industry Transformation

Recently, Deepdub announced the launch of its latest voice AI model, Lightning 2.5. This model is called the fastest and most scalable voice solution the company has developed to date, designed for real-time production-grade voice experiences. Lightning 2.5 demonstrates exceptional support capabilities in application scenarios such as multilingual AI agents, call center automation, real-time content localization, and personalized media and gaming. The development of Lightning 2.5 is based on Deepdub's own research

Sep 16, 2025

Linghui Intelligence Launches the World's First Open-Source Speech Large Model Framework LLaSO

In the fast-paced wave of artificial intelligence development, Beijing Deep Logic Intelligence Technology Co., Ltd. recently launched a remarkable innovation - LLaSO. This groundbreaking research framework is recognized as the world's first fully open, end-to-end speech language model, marking a new height in speech recognition and processing technology. The most notable feature of the LLaSO framework is its openness. Developers can freely access, modify, and use this framework, promoting the extensive application and research of speech technology. With LLaSO, developers can easily build

Sep 16, 2025

Say Goodbye to Blurriness! NVIDIA Launches ViPE Engine for High-Precision 3D Data in Spatial AI

Recently, NVIDIA, together with research teams from the University of Toronto, Vector Institute, and the University of Texas at Austin, jointly released a breakthrough technology called **ViPE (Video Pose Engine)**. ViPE aims to address key challenges in the field of 3D geometric perception, specifically how to efficiently and accurately extract 3D information from complex natural videos. Core Technology and Applications 3D geometric perception is at the core of various modern technologies such as autonomous driving, virtual reality (VR), and augmented reality (AR). ViPE innovatively extracts 3D data quickly from raw videos.

Sep 16, 2025

Meta AI Launches MobileLLM-R1: Lightweight Edge Inference Model with Less than 1 Billion Parameters and Significant Performance Improvements

Meta AI recently introduced MobileLLM-R1, a series of lightweight edge inference models, which have now been released on Hugging Face. The model series has parameters ranging from 140M to 950M, focusing on efficient mathematical, coding, and scientific reasoning, and achieving excellent performance with fewer than 1 billion parameters. The largest model in MobileLLM-R1 is MobileLLM-R1-950M, which adopts a series of architecture optimization designs: including 22

Sep 16, 2025

VEED Fabric 1.0 Launch! Turn a Picture into a Talking Video

Recently, VEED launched Fabric 1.0, an innovative tool hailed as the world's first AI talking video model. This tool can generate any talking video from a single image, featuring realistic lip synchronization and natural facial expressions. The official stated that the model supports up to a 1-minute video, with costs reduced by 60 times and speed increased by 7 times. This release has quickly sparked discussions in the tech community, with developers and content creators noting its significant potential in areas such as social advertising, product demonstrations, and educational content. According to the latest publicly available information, Fab

Sep 16, 2025

130

Latest AI News

AI Daily Brief

AI Product Finder

AI Product Rankings

AI Product Submit

AI Tools Directory

AI Models Finder

Model Providers

Submit Your Model

Compare LLMs

LLM Cost Calculator

LLM Arena

MCP Servers

MCP Client

MCP Case Tutorials

MCP Ranking

MCP Service Submission

MCP Playground

MCP Inspector

GEO Services

Redefining Tradition! Mini-o3 Open-Source Model Achieves Ultra-Long Visual Reasoning, Deep Thinking Is No Longer a Challenge

AIbase基地

This article is from AIbase Daily

AI News Recommendations

Alibaba Launches Revolutionary Speech Recognition Model FunAudio-ASR with Remarkable Noise Reduction

Tencent HunyuanImage 2.1 Makes a Strong Debut! Open-Source 2K Text-to-Image Champion, Turns into High-Resolution Art Master in an Instant?

Tencent HuanYuan 3D 3.0 Launches: 3.6 Billion Voxels Ultra-High Definition Modeling, 3x Accuracy Improvement

AI Daily: Tencent Launches Huan Yuan 3D 3.0 Model; Kunlun Wanyi Launches Agent Studio Feature; Alibaba Qoder Introduces Paid Subscription Service

Tencent Launches Hunyuan 3D 3.0 Model, Building Accuracy Increased by 3 Times

Deepdub Launches Lightning 2.5: Real-Time Voice AI Model Leading Industry Transformation

Linghui Intelligence Launches the World's First Open-Source Speech Large Model Framework LLaSO

Say Goodbye to Blurriness! NVIDIA Launches ViPE Engine for High-Precision 3D Data in Spatial AI

Meta AI Launches MobileLLM-R1: Lightweight Edge Inference Model with Less than 1 Billion Parameters and Significant Performance Improvements

VEED Fabric 1.0 Launch! Turn a Picture into a Talking Video

Latest AI News

AI Daily Brief

AI Product Finder

AI Product Rankings

AI Product Submit

AI Tools Directory

AI Models Finder

Model Providers

Submit Your Model

Compare LLMs

LLM Cost Calculator

LLM Arena

MCP Servers

MCP Client

MCP Case Tutorials

MCP Ranking

MCP Service Submission

MCP Playground

MCP Inspector

GEO Services​

Redefining Tradition! Mini-o3 Open-Source Model Achieves Ultra-Long Visual Reasoning, Deep Thinking Is No Longer a Challenge

AIbase基地

This article is from AIbase Daily

AI News Recommendations

Alibaba Launches Revolutionary Speech Recognition Model FunAudio-ASR with Remarkable Noise Reduction

Tencent HunyuanImage 2.1 Makes a Strong Debut! Open-Source 2K Text-to-Image Champion, Turns into High-Resolution Art Master in an Instant?

Tencent HuanYuan 3D 3.0 Launches: 3.6 Billion Voxels Ultra-High Definition Modeling, 3x Accuracy Improvement

AI Daily: Tencent Launches Huan Yuan 3D 3.0 Model; Kunlun Wanyi Launches Agent Studio Feature; Alibaba Qoder Introduces Paid Subscription Service

Tencent Launches Hunyuan 3D 3.0 Model, Building Accuracy Increased by 3 Times

Deepdub Launches Lightning 2.5: Real-Time Voice AI Model Leading Industry Transformation

Linghui Intelligence Launches the World's First Open-Source Speech Large Model Framework LLaSO

Say Goodbye to Blurriness! NVIDIA Launches ViPE Engine for High-Precision 3D Data in Spatial AI

Meta AI Launches MobileLLM-R1: Lightweight Edge Inference Model with Less than 1 Billion Parameters and Significant Performance Improvements

VEED Fabric 1.0 Launch! Turn a Picture into a Talking Video

GEO Services