Recently, Qwen released CoGenAV, innovating speech recognition technology with the concept of audio-visual synchronization. It effectively addresses the challenge of noise interference in speech recognition. Traditional speech recognition performs poorly in noisy environments, while CoGenAV takes a different approach by learning the temporal alignment relationships among audio-visual-text, building a more robust and generalizable speech representation framework, systematically improving tasks such as speech recognition (VSR/AVSR), speech reconstruction (AVSS/AVSE), and audio-visual synchronization (A').