Qwen Launches CoGenAV Multimodal Speech Representation Model with Synchronized Perception of Audio and Visual

AIbase基地

Published inAI News · 5 min read · May 28, 2025

Recently, the Tongyi Foundation Model released CoGenAV, innovating voice recognition technology with the concept of audio-visual synchronization, effectively addressing the challenge of noise interference in voice recognition.

Traditional voice recognition performs poorly in noisy environments. CoGenAV takes a different approach by learning the temporal alignment relationships between audio, visual, and text information to construct a more robust and generalizable speech representation framework. This systematically enhances the performance of multiple Speech-Centric tasks such as Voice Speech Recognition (VSR/AVSR), Audio-Visual Speech Synthesis (AVSS/AVSE), and Audio-Visual Synchronization (ASD).

In technical implementation, CoGenAV adopts the "Contrastive Generation Synchronization" strategy. During feature extraction, the model uses ResNet3D CNN to analyze the lip movements of speakers in videos, capturing dynamic correlations between sound and mouth shape. It simultaneously employs a Transformer encoder to extract speech information from audio and aligns audiovisual features precisely. The contrastive generation synchronization training improves the model's understanding ability through two methods: contrastive synchronization and generative synchronization. Contrastive synchronization uses Seq2Seq Contrastive Learning to enhance the correspondence between audio and video features while introducing ReLU activation functions to filter out interfering frames; generative synchronization aligns audiovisual features with their acoustic-text representations using a pre-trained ASR model and designs a lightweight adapter module to improve cross-modal fusion efficiency.

Thanks to these innovative technologies, CoGenAV has achieved breakthrough results on multiple benchmark datasets. In the Visual Speech Recognition (VSR) task, using only 223 hours of lip motion video training, it achieved a Word Error Rate (WER) of 20.5% on the LRS2 dataset, comparable to traditional models that use thousands of hours of data. In the Audio-Visual Speech Recognition (AVSR) task, combined with the Whisper Medium model, it achieved a WER of 1.27% on the same dataset, setting a new state-of-the-art record, with performance improving by over 80% in a 0dB noise environment, significantly outperforming pure audio models. In the Speech Enhancement and Separation (AVSE/AVSS) tasks, as a visual feature extractor, its SDRi metric reached 16.0dB in the LRS2 speech separation task, surpassing AvHuBERT by 1.6dB and Av SepFormer by 0.3dB; in the speech enhancement task, the SDRi metric was 9.0dB, outperforming Av HuBERT by 1.6dB. In the Active Speaker Detection (ASD) task, it achieved an average precision (mAP) of 96.3% on the Talkies dataset, leading existing methods.

CoGenAV can be directly integrated into mainstream voice recognition models like Whisper without modification or fine-tuning to enable visual speech recognition, reducing deployment barriers. It demonstrates excellent noise resistance and data efficiency, greatly saving training costs and enhancing the practicality and scalability of the model. Currently, the related code and models of CoGenAV are open-source on platforms such as GitHub, arXiv, HuggingFace, and ModelScope for researchers and developers to use.

GitHub: https://github.com/HumanMLLM/CoGenAV

arXiv: https://arxiv.org/pdf/2505.03186

HuggingFace: https://huggingface.co/detao/CoGenAV

ModelScope: https://modelscope.cn/models/iic/cogenav

A Daily: Moonlight Open-Sources Large Model Kimi K2; Zhiyuan Fully Open-Sources RoboBrain 2.0; Tongyi Qianwen Launches Qwen Chat Desktop Client

Moon's dark side opens trillion-parameter Kimi K2 model; RoboBrain2.0 enhances robot cognition; Alibaba's Qwen adds image generation; IndexTTS2 revolutionizes voice cloning; HuggingFace's Reachy Mini sells well; Meta enables real-time video generation; PixVerse adds multi-keyframe; Tesla Grok supports AMD only; OpenAI delays open-source release; Liquid AI's LFM2 boosts edge AI; AI 'time travel' trend goes viral.....

Qwen Chat Desktop Client Released, Supporting One-Click Activation and Invocation of MCP

Recently, Qwen Chat received a major update and made a new appearance, offering users a more intuitive interaction experience and a wider range of functional services, aiming to become the most reliable AI partner for everyone. The updated Qwen Chat has achieved significant improvements in interaction design, allowing users to start a conversation directly on the home page without complicated operations, making chatting more convenient. Its functions have also been significantly expanded, supporting daily questions, meeting users' various information query needs, and assisting in content creation, whether it's writing articles or generating text.

AI Daily: Tencent Huyaun Launches 3D Generation Large Model Hunyuan3D-PolyGen; DingTalk AI Spreadsheet Makes a Big Entry; Alibaba Launches Multimodal Large Language Model HumanOmniV2

1.Tencent's Hunyuan3D-PolyGen boosts 3D modeling efficiency by 70% with BPT tech. 2.Alibaba's HumanOmniV2 achieves 69.33% accuracy in multilingual input. 3.DingTalk AI processes 1k tasks/hour with 'spreadsheet-as-document'. 4.Baidu PaddleOCR3.1 improves 37-language recognition by 30%. 5.Microsoft Deep Research opens API. 6.HKPolyU & OPPO's DLoRAL speeds video enhancement 10x. 7.Google opens MCP Toolbox for SQL. 8.Microsoft Win11 to add AI dynamic....

Tencent Hunyuan Launches the Industry's First Art-Level 3D Generation Large Model Hunyuan3D-PolyGen

On July 7, the Tencent Hunyuan 3D team announced the launch of the industry's first art-level 3D generation large model, Hunyuan3D-PolyGen. By employing self-developed high-compression representation BPT technology and a autoregressive mesh generation framework, it enables accurate generation of complex geometric models with up to ten thousand faces. The model has breakthrough solutions for core pain points in 3D asset generation, such as poor topology quality, excessive face count, and difficulty in post-editing. It has improved the modeling efficiency of artists by over 70%. The relevant capabilities have been launched on the Tencent Hunyuan 3D AI creation engine and integrated into multiple game pipelines. Traditional

Product Finder

Product Submit

AI Models Finder

MCP Servers

MCP Client

MCP Inspector

Case Tutorials

Latest AI News

AI Daily Brief

Qwen Launches CoGenAV Multimodal Speech Representation Model with Synchronized Perception of Audio and Visual

AIbase基地

This article is from AIbase Daily

AI News Recommendations

A Daily: Moonlight Open-Sources Large Model Kimi K2; Zhiyuan Fully Open-Sources RoboBrain 2.0; Tongyi Qianwen Launches Qwen Chat Desktop Client

Qwen Chat Desktop Client Released, Supporting One-Click Activation and Invocation of MCP

NVIDIA stellt DiffusionRenderer vor: Ein neues KI-Modell zur Erstellung von realistischen 3D-Szenen aus Videos

Google Veo3 Adds Image-to-Video Feature, Users Create Over 40 Million Videos Within Seven Weeks

AI Daily: Alibaba Tongyi Opens Source Audio Generation Model ThinkSound; Google Veo3 Generates Images into Videos; Feishu Announces Several New AI Products

Hugging Face Launches SmolLM3: A 3B-Parameter Small Model Competes with 4B Giants, 128K Context Leads a New Trend in Efficient AI!

Google Veo3 Makes a Major Upgrade, Supporting the Generation of Animated Videos from Static Images

Hugging Face releases the next generation of small parameter model SmolLM3: 128K context, dual-mode reasoning

AI Daily: Tencent Huyaun Launches 3D Generation Large Model Hunyuan3D-PolyGen; DingTalk AI Spreadsheet Makes a Big Entry; Alibaba Launches Multimodal Large Language Model HumanOmniV2

Tencent Hunyuan Launches the Industry's First Art-Level 3D Generation Large Model Hunyuan3D-PolyGen