AI News

Don't miss any moment of global AI innovation

AI Daily

Daily three-minute AI industry trends

AI Timeline

AI industry milestones

Al Hardware

Lists all AI hardware products.

AI Monetization Guide

Latest Cases

AI monetization case sharing

Image Collection

AI image creation monetization cases

Video Collection

AI video creation monetization cases

Audio Collection

AI audio creation monetization cases

Content Collection

AI content writing monetization cases

AI Tutorials

Latest Tutorials

Free sharing of the latest AI tutorials

AI Product Rankings

AI Product Ranking

Shows total visits ranking of AI websites

AI Traffic Growth Ranking

Track fastest growing AI websites by traffic

AI Traffic Decline Ranking

Focus on AI websites with significant traffic drops

AI Weekly Ranking

Shows weekly visits ranking of AI websites

Popular Country Rankings

United States

AI websites most popular with US users

China

AI websites most popular with Chinese users

India

AI websites most popular with Indian users

Brazil

AI websites most popular with Brazilian users

Popular Category Rankings

Image Generation

Total visits ranking of AI image generation websites

Personal Assistant

Total visits ranking of AI personal assistant websites

Character Generation

Total visits ranking of AI character generation websites

Video Generation

Total visits ranking of AI video generation websites

Popular Open Source Data Rankings

AI Project Ranking

GitHub popular AI projects by total stars

AI Project Growth Ranking

GitHub popular AI projects by growth rate

AI Developer Ranking

GitHub popular AI developer ranking

AI Organization Ranking

GitHub popular AI organization ranking

Popular Open Source Categories

Deepseek

GitHub popular deepseek open source projects

TTS

GitHub popular TTS open source projects

LLM

GitHub popular LLM open source projects

ChatGPT

GitHub popular ChatGPT open source projects

AI Open Source Project Library

Overview

Overview of GitHub popular AI open source projects

Product Library Tool Navigation MCP

Tongyi Open-Source Visual Perception Multi-modal RAG Reasoning Framework VRAG-RL

AIbase基地

Published inAI News · 5 min read · May 31, 2025

Recently, the Natural Language Intelligence Team of Tongyi Lab officially released and open-sourced VRAG-RL — a multimodal RAG reasoning framework driven by visual perception. It aims to address the challenge of how AI can retrieve key information and perform fine-grained reasoning from visual languages such as images, tables, design drafts, etc., in real-world business scenarios.

Retrieving and reasoning about key information in complex visual document knowledge bases has always been a major challenge in the AI field. Traditional Retrieval-Augmented Generation (RAG) methods struggle when handling visually rich information because they find it difficult to deal with visual content like images and charts, and existing visual RAG methods are limited by fixed retrieval-generation processes, making it hard to fully mine critical knowledge from visual information.

To tackle these challenges, the VRAG-RL framework systematically innovates from three dimensions: reinforcement learning-enabled multimodal agent training, visual perception mechanism design, and collaborative optimization of retrieval and reasoning. It introduces diversified visual perception actions, such as region selection, cropping, and scaling, allowing the model to progressively focus on information-dense regions from coarse granularity to fine granularity, accurately extracting key visual information. This coarse-to-fine perception method not only enhances the model's understanding of visual information but also significantly improves retrieval efficiency.

During training, VRAG-RL adopts a multi-expert sampling strategy, combining the inference capabilities of large-scale models with the precise annotation abilities of expert models, enabling the model to learn more effective visual perception strategies. Its fine-grained reward mechanism integrates factors such as retrieval efficiency, pattern consistency, and generation quality, guiding the model to continuously optimize its retrieval and reasoning paths through interactions with search engines. This multidimensional reward mechanism achieves bidirectional drive for retrieval and reasoning, forming a closed-loop optimization.

VRAG-RL also introduces leading-edge GRPO algorithms, simulating real-world application scenarios by deploying local search engines, achieving zero cost for search engine calls, and making model training more efficient. This training method not only enhances the model's generalization ability but also allows it to perform well across different domains and types of visual tasks.

Experimental results show that VRAG-RL outperforms existing methods significantly on multiple visual language benchmark datasets, covering task types ranging from single-hop to multi-hop reasoning, from pure text understanding to chart recognition and complex layout parsing, among other visually rich scenarios. Whether using traditional prompt-based methods or reinforcement learning-based approaches, VRAG-RL demonstrates superior comprehensive performance.

In addition, VRAG-RL supports multi-round interactions, gradually focusing on information-dense areas during the reasoning phase to achieve coarse-to-fine information acquisition. Meanwhile, this method optimizes retrieval efficiency and reasoning paths, improving the model's performance in visual tasks while maintaining high efficiency.

Github: github.com/Alibaba-NLP/VRAG

This article is from AIbase Daily

Welcome to the [AI Daily] column! This is your daily guide to exploring the world of artificial intelligence. Every day, we present you with hot topics in the AI field, focusing on developers, helping you understand technical trends, and learning about innovative AI product applications.

—— Created by the AIbase Daily Team