Recently, the Natural Language Intelligence Team of Tongyi Lab officially released and open-sourced VRAG-RL — a multimodal RAG reasoning framework driven by visual perception. It aims to address the challenge of how AI can retrieve key information and perform fine-grained reasoning from visual languages such as images, tables, design drafts, etc., in real-world business scenarios.
Retrieving and reasoning about key information in complex visual document knowledge bases has always been a major challenge in the AI field. Traditional Retrieval-Augmented Generation (RAG) methods struggle when handling visually rich information because they find it difficult to deal with visual content like images and charts, and existing visual RAG methods are limited by fixed retrieval-generation processes, making it hard to fully mine critical knowledge from visual information.
To tackle these challenges, the VRAG-RL framework systematically innovates from three dimensions: reinforcement learning-enabled multimodal agent training, visual perception mechanism design, and collaborative optimization of retrieval and reasoning. It introduces diversified visual perception actions, such as region selection, cropping, and scaling, allowing the model to progressively focus on information-dense regions from coarse granularity to fine granularity, accurately extracting key visual information. This coarse-to-fine perception method not only enhances the model's understanding of visual information but also significantly improves retrieval efficiency.
During training, VRAG-RL adopts a multi-expert sampling strategy, combining the inference capabilities of large-scale models with the precise annotation abilities of expert models, enabling the model to learn more effective visual perception strategies. Its fine-grained reward mechanism integrates factors such as retrieval efficiency, pattern consistency, and generation quality, guiding the model to continuously optimize its retrieval and reasoning paths through interactions with search engines. This multidimensional reward mechanism achieves bidirectional drive for retrieval and reasoning, forming a closed-loop optimization.
VRAG-RL also introduces leading-edge GRPO algorithms, simulating real-world application scenarios by deploying local search engines, achieving zero cost for search engine calls, and making model training more efficient. This training method not only enhances the model's generalization ability but also allows it to perform well across different domains and types of visual tasks.
Experimental results show that VRAG-RL outperforms existing methods significantly on multiple visual language benchmark datasets, covering task types ranging from single-hop to multi-hop reasoning, from pure text understanding to chart recognition and complex layout parsing, among other visually rich scenarios. Whether using traditional prompt-based methods or reinforcement learning-based approaches, VRAG-RL demonstrates superior comprehensive performance.
In addition, VRAG-RL supports multi-round interactions, gradually focusing on information-dense areas during the reasoning phase to achieve coarse-to-fine information acquisition. Meanwhile, this method optimizes retrieval efficiency and reasoning paths, improving the model's performance in visual tasks while maintaining high efficiency.
Github: github.com/Alibaba-NLP/VRAG