Tongyi Qianwen Secures Another Victory: Qwen3-VL Twins Open-Source, Bringing a New Paradigm to Multimodal Retrieval

AIbase基地

Published inAI News · 5 min read · Jan 9, 2026

When text, images, videos, charts, and even UI interfaces can be uniformly "understood" and precisely matched, the boundaries of multimodal information retrieval are being completely redefined. Today, Alibaba Tongyi Lab officially open-sources two models: Qwen3-VL-Embedding and Qwen3-VL-Reranker. Built upon the powerful Qwen3-VL multimodal foundation, these models are specifically designed for cross-modal understanding and efficient retrieval, marking a significant leap from the "keyword matching" era to a new epoch of "semantic alignment" in multimodal search.

These two models do not exist in isolation but form a collaborative intelligent retrieval engine. Qwen3-VL-Embedding uses an efficient dual-tower architecture to independently encode diverse content such as text, images, visual documents (e.g., code screenshots, data charts, app interfaces), and even videos into vector representations within a unified high-dimensional semantic space. This means that regardless of whether the user input is a textual description, a product image, or a short video, the system can map it into the same semantic coordinate system, enabling millisecond-level cross-modal similarity calculations and massive data recall.

Meanwhile, Qwen3-VL-Reranker acts as a "refiner." It employs a single-tower cross-attention architecture to perform deep re-ranking on the initial results from Embedding. When facing complex tasks such as "image-text query matching image-text documents" or "video segment retrieval of related articles," the Reranker will jointly encode the query and candidate documents, analyzing their deeper associations in semantics, details, and even contextual logic through the model's internal cross-attention mechanism, ultimately outputting a precise relevance score. This two-stage process of "fast embedding retrieval + precise reranking" significantly improves the accuracy and relevance of the final retrieval results.

Technical strength is ultimately proven by data. In authoritative multimodal benchmark tests such as MMEB-v2 and MMTEB, the Qwen3-VL series has shown outstanding performance. The 8B version of the Embedding model surpassed all known open-source models and mainstream closed-source commercial services on MMEB-v2; the Reranker model continues to lead in visual document retrieval tasks including JinaVDR and ViDoRe v3, with the 8B version taking first place in most subtasks. Particularly notable is that this series inherits the multilingual capabilities of Qwen3-VL, supporting over 30 languages, and offers flexible vector dimension options, instruction fine-tuning capabilities, and high-performance quantized versions, greatly reducing the integration barriers for developers.

This open-source release is not only a technical achievement but also marks the maturity of multimodal AI infrastructure. In the past, image-text retrieval, video understanding, and document analysis often required separate models and processes. Now, the Qwen3-VL twin models provide a unified, efficient, and open solution, allowing developers to handle almost all mixed modal content within a single framework. As real-world data increasingly emerges in multimodal forms, this toolset may accelerate the next generation evolution of search engines, content platforms, enterprise knowledge bases, and intelligent assistants—where machines truly "see" and "understand" everything we see, write, and take pictures of.

Sold 2 Billion in 3 Years! JD.com Teams Up with Haier at AWE to Create a Stir: AI Kitchens Are About to Take Over Your Stomach

Haier Small Appliances and JD.com's Kitchen Small Appliances signed a strategic partnership, planning to achieve a sales volume of 2 billion yuan across JD.com's full channels in the next three years. Both sides will integrate platforms, brands, and AI technology to jointly promote the deep evolution of smart kitchens.

Microsoft Plans to Train 3 Million Africans in AI Tools to Drive Digital Transformation

Microsoft is accelerating its expansion in the African AI market, planning to train 3 million people within the year to use AI tools, aiming to capture the demographic dividend of young people and compete with platforms like DeepSeek from China. The company is taking a multi-pronged approach through education, collaboration, and investment in computing power to secure leadership in the future market.

NVIDIA Releases Open-Source Large Model Nemotron 3 Super: Performance Approaching GPT-5.4

NVIDIA releases its new open-source large model, Nemotron 3 Super, designed specifically for AI agents. The model uses an innovative Mamba-MoE hybrid architecture, with a total of 120 billion parameters and only 12 billion activated parameters, significantly improving inference efficiency, increasing speed by 300%, while maintaining excellent task success rates, becoming a focus in the open-source community.

Core Members of Qwen Move to ByteDance: The Competition for Large Model Talent Heats Up Again

Alibaba's Tongyi Lab has recently undergone restructuring, with the original Qwen team being split, leading to talent turnover. After Lin Jinyang left, Yu Bowen, the former head of the large model pre-training team of Qwen, also joined ByteDance, taking on the role of head of the pre-training team for the Seed team's visual model and multimodal interaction team. This reflects the intensifying competition for talent in the field of large models in China, and the industry structure is undergoing a new round of reshaping.

It is reported that Yu Bowen, a core member of the original Qwen team, has joined ByteDance's Seed team

Ali Tongyi Lab has recently undergone an organizational restructuring, splitting the Qwen team into multiple parallel lines such as pre-training and post-training. Subsequently, Yu Bowen, the former head of the post-training team of Qwen, was reported to have joined ByteDance, taking on the role of post-training lead for visual models and multimodal interaction within the Seed team. ByteDance has not officially responded yet.

Google Releases Its First Native Multimodal Embedding Model Gemini Embedding 2: Enabling Machines to Truly Understand the World

Google launches the native multimodal embedding model Gemini Embedding 2, which supports text, images, videos, audio, and documents, mapping them uniformly into a vector space to achieve deep cross-media understanding. Unlike generative models, it focuses on 'understanding,' converting data into vectors to help systems identify semantic relationships.

Latest AI News

AI Daily Brief

AI Product Finder

AI Product Rankings

AI Product Submit

AI Tools Directory

AI Models Finder

LLM Leaderboard

Model Providers

Compare LLMs

LLM Cost Calculator

LLM Arena

MCP Servers

MCP Client

MCP Case Tutorials

MCP Ranking

MCP Service Submission

MCP Playground

MCP Inspector

GEO Brand Visibility

AI Brand Monitoring Tool

AI Search Visibility Checker

GEO Promotion Link Detection

GEO Ranking Optimization System

GEO Services​

AI Model Compatibility Checker

AI Deployment Calculator

Tongyi Qianwen Secures Another Victory: Qwen3-VL Twins Open-Source, Bringing a New Paradigm to Multimodal Retrieval

AIbase基地

This article is from AIbase Daily

AI News Recommendations

Sold 2 Billion in 3 Years! JD.com Teams Up with Haier at AWE to Create a Stir: AI Kitchens Are About to Take Over Your Stomach

Microsoft Plans to Train 3 Million Africans in AI Tools to Drive Digital Transformation

NVIDIA Releases Open-Source Large Model Nemotron 3 Super: Performance Approaching GPT-5.4

Core Members of Qwen Move to ByteDance: The Competition for Large Model Talent Heats Up Again

In-House Computing Power Advances Again: Meta Releases New AI Chip, Performance Directly Challenges NVIDIA H100

It is reported that Yu Bowen, a core member of the original Qwen team, has joined ByteDance's Seed team

From Chip Giant to Full-Stack Player: NVIDIA Plans to Invest $26 Billion in Open-Weight Models

Google Releases Gemini Embedding2: Native Multimodal Embedding Model Unifies Text, Image, and Audio-Visual Semantic Spaces

World Models Enter the Era of Fine-Tuning: Tencent Opensources the Reinforcement Learning Post-Training Framework WorldCompass

Google Releases Its First Native Multimodal Embedding Model Gemini Embedding 2: Enabling Machines to Truly Understand the World

AI News Recommendations

Sold 2 Billion in 3 Years! JD.com Teams Up with Haier at AWE to Create a Stir: AI Kitchens Are About to Take Over Your Stomach

Microsoft Plans to Train 3 Million Africans in AI Tools to Drive Digital Transformation

NVIDIA Releases Open-Source Large Model Nemotron 3 Super: Performance Approaching GPT-5.4

Core Members of Qwen Move to ByteDance: The Competition for Large Model Talent Heats Up Again

In-House Computing Power Advances Again: Meta Releases New AI Chip, Performance Directly Challenges NVIDIA H100

It is reported that Yu Bowen, a core member of the original Qwen team, has joined ByteDance's Seed team

From Chip Giant to Full-Stack Player: NVIDIA Plans to Invest $26 Billion in Open-Weight Models

Google Releases Gemini Embedding2: Native Multimodal Embedding Model Unifies Text, Image, and Audio-Visual Semantic Spaces

World Models Enter the Era of Fine-Tuning: Tencent Opensources the Reinforcement Learning Post-Training Framework WorldCompass

Google Releases Its First Native Multimodal Embedding Model Gemini Embedding 2: Enabling Machines to Truly Understand the World

GEO Services