AIbase December 9 report: Alibaba's Qwen team has released the new generation of all-modal large model Qwen3-Omni-Flash-2025-12-01 today. The model supports seamless input of text, images, audio, and video, and generates high-quality text and natural speech in real-time streaming responses. The official claims that its voice performance is approaching human-level naturalness.

Technical Breakthrough: Real-time Streaming Multi-modal Interaction
Qwen3-Omni-Flash adopts a real-time streaming architecture, enabling seamless input and synchronized output of text, images, audio, and video. The model supports interaction in 119 text languages, 19 speech recognition languages, and 10 speech synthesis languages, ensuring accurate responses across multilingual scenarios.
Personalized Experience: System Prompt Customization Opened
The new version fully opens the system prompt customization permission, allowing users to finely control the model's behavior mode, including setting specific character styles like "sweet girl" or "dominant woman," adjusting preferences for colloquial expression and response length. The model can adaptively adjust speaking speed, pauses, and rhythm based on the text content.

Performance Improvement: Comprehensive Benchmark Advancement
Official data shows that the new model has improved by 5.6 points in logical reasoning tasks (ZebraLogic), 9.3 points in code generation (LiveCodeBench-v6), and 4.7 points in multi-disciplinary visual question answering (MMMU), demonstrating strong multi-modal understanding and analytical capabilities.
Market Deployment: API Now Available, Affordable Pricing
Qwen3-Omni-Flash is now available via API, with input pricing at 1 yuan per million tokens and output at 3 yuan per million tokens. The model has been integrated into Qwen Chat with a Demo that supports uploading a 30-second video and generating live on-screen narration in real time.
Industry Significance: Multi-modal Enters the "Personality" Stage
While multi-modal models are still competing on how many images they can understand, Alibaba has directly turned "real-time streaming + personality" into an API. For scenarios that emphasize voice and style, such as live streaming, short videos, and virtual meetings, this effectively reduces the cost of "voice actors + post-production narration" to nearly zero.
Next Steps:







