Alibaba Cloud has released Qwen3-Omni, marking the launch of the world's first native end-to-end multi-modal AI model, which is now open source. Qwen3-Omni is capable of handling various input types such as text, images, audio, and video, and can provide real-time streaming output, responding quickly whether through text or natural speech.
The Qwen3-Omni model demonstrates advanced cross-modal performance in multiple fields. With early text-centered pre-training and mixed multi-modal training, the model has powerful multi-modal capabilities. It excels particularly in audio and video performance, while maintaining high standards in text and image effects. According to 36 benchmark tests for audio and video, Qwen3-Omni has achieved the latest leading results in 22 of them, and its performance in areas such as automatic speech recognition and audio understanding is comparable to that of industry peers like Gemini2.5Pro.
Qwen3-Omni supports 119 text languages and 19 speech input languages, as well as 10 speech output languages, including English, Chinese, French, and German, among others. This feature enables it to better serve global users. Its innovative architecture is based on a MoE (Mixture of Experts) system combined with AuT pre-training, giving the model strong general representation capabilities. At the same time, the multi-codebook design ensures low-latency real-time audio and video interaction, supporting smooth natural dialogue.
In addition to Qwen3-Omni, Alibaba Cloud has also released Qwen3-TTS, a text-to-speech model that supports 17 voice options. The model performs outstandingly in multiple evaluation benchmarks, surpassing several competitors, especially in terms of voice stability and voice similarity.
Qwen-Image-Edit-2509 is another newly released tool, focusing on multi-image support for image editing, significantly improving the consistency and quality of editing. It not only handles single images but also supports multi-image collage editing, meeting more complex editing needs.
GitHub: https://github.com/QwenLM/Qwen3-Omni
Huggingface: https://huggingface.co/collections/Qwen/qwen3-omni-68d100a86cd0906843ceccbe
Key Points:
🌟 Qwen3-Omni is the world's first native end-to-end multi-modal AI model, supporting unified processing of text, images, audio, and video.
🌐 The model supports 119 text languages and 19 speech inputs, meeting the multilingual needs of global users.
🖼️ The newly released Qwen-Image-Edit-2509 supports multi-image editing, significantly improving the consistency and effectiveness of editing.