Recently, the Tongyi Foundation Model released CoGenAV, innovating voice recognition technology with the concept of audio-visual synchronization, effectively addressing the challenge of noise interference in voice recognition.

Traditional voice recognition performs poorly in noisy environments. CoGenAV takes a different approach by learning the temporal alignment relationships between audio, visual, and text information to construct a more robust and generalizable speech representation framework. This systematically enhances the performance of multiple Speech-Centric tasks such as Voice Speech Recognition (VSR/AVSR), Audio-Visual Speech Synthesis (AVSS/AVSE), and Audio-Visual Synchronization (ASD).

WeChat_Screenshot_20250528193127.png

In technical implementation, CoGenAV adopts the "Contrastive Generation Synchronization" strategy. During feature extraction, the model uses ResNet3D CNN to analyze the lip movements of speakers in videos, capturing dynamic correlations between sound and mouth shape. It simultaneously employs a Transformer encoder to extract speech information from audio and aligns audiovisual features precisely. The contrastive generation synchronization training improves the model's understanding ability through two methods: contrastive synchronization and generative synchronization. Contrastive synchronization uses Seq2Seq Contrastive Learning to enhance the correspondence between audio and video features while introducing ReLU activation functions to filter out interfering frames; generative synchronization aligns audiovisual features with their acoustic-text representations using a pre-trained ASR model and designs a lightweight adapter module to improve cross-modal fusion efficiency.

Thanks to these innovative technologies, CoGenAV has achieved breakthrough results on multiple benchmark datasets. In the Visual Speech Recognition (VSR) task, using only 223 hours of lip motion video training, it achieved a Word Error Rate (WER) of 20.5% on the LRS2 dataset, comparable to traditional models that use thousands of hours of data. In the Audio-Visual Speech Recognition (AVSR) task, combined with the Whisper Medium model, it achieved a WER of 1.27% on the same dataset, setting a new state-of-the-art record, with performance improving by over 80% in a 0dB noise environment, significantly outperforming pure audio models. In the Speech Enhancement and Separation (AVSE/AVSS) tasks, as a visual feature extractor, its SDRi metric reached 16.0dB in the LRS2 speech separation task, surpassing AvHuBERT by 1.6dB and Av SepFormer by 0.3dB; in the speech enhancement task, the SDRi metric was 9.0dB, outperforming Av HuBERT by 1.6dB. In the Active Speaker Detection (ASD) task, it achieved an average precision (mAP) of 96.3% on the Talkies dataset, leading existing methods.

CoGenAV can be directly integrated into mainstream voice recognition models like Whisper without modification or fine-tuning to enable visual speech recognition, reducing deployment barriers. It demonstrates excellent noise resistance and data efficiency, greatly saving training costs and enhancing the practicality and scalability of the model. Currently, the related code and models of CoGenAV are open-source on platforms such as GitHub, arXiv, HuggingFace, and ModelScope for researchers and developers to use.

GitHub: https://github.com/HumanMLLM/CoGenAV

arXiv: https://arxiv.org/pdf/2505.03186

HuggingFace: https://huggingface.co/detao/CoGenAV

ModelScope: https://modelscope.cn/models/iic/cogenav