With the rapid development of artificial intelligence, particularly in the field of large reasoning models like OpenAI's o3, researchers are striving to enhance these models with stronger agent capabilities. These capabilities extend beyond text processing to include image understanding and manipulation. Recently, a research team from Shanghai Jiao Tong University, Shanghai AI Lab, The Chinese University of Hong Kong, and Wuhan University has introduced a new method called Visual-ARFT (Vision Agent Reinforcement Fine-Tuning) aimed at improving multimodal agent capabilities in vision-language models, enabling them to execute complex tasks more flexibly.
The core of Visual-ARFT lies in endowing models with "tool agent" capabilities. This means that the model can not only analyze and understand images but also actively call external tools for searching or writing code. This capability allows the model to autonomously decompose tasks, plan steps, and complete tasks when faced with complex multimodal problems. For example, it can search for needed information through an engine after analyzing image information or generate Python code to process images for visual question answering.
To evaluate the effectiveness of Visual-ARFT, the research team built a new evaluation benchmark called MAT-Bench (Multimodal Agent Tool Benchmark). This benchmark contains multiple complex multihop visual question answering tasks, accurately assessing the model's capabilities in tool calling and multimodal reasoning. Test results show that models using the Visual-ARFT method perform excellently in several subtasks, surpassing advanced models like GPT-4o, demonstrating significant potential.
Notably, Visual-ARFT adopts a training strategy based on reinforcement fine-tuning, driving the model to explore how to use tools and form complete reasoning processes through a simple and efficient reward mechanism. The research team successfully enhanced the model's multimodal agent capabilities using a small amount of data during training.
In the future, Visual-ARFT will not only open up new paths for the development of agent capabilities but may also have far-reaching impacts in fields such as image processing and intelligent search. With continuous technological advancements, we look forward to seeing more agents' performances in complex scenarios, pushing the boundaries of artificial intelligence further.
Project address: https://github.com/Liuziyu77/Visual-RFT/tree/main/Visual-ARFT