On September 19, Xiaomi announced the open-source release of its first native end-to-end speech large model, Xiaomi-MiMo-Audio, marking a major breakthrough in the field of speech technology. Five years ago, the emergence of GPT-3 ushered in a new era for general artificial intelligence (AGI) in language, but the speech field has long been constrained by reliance on large-scale annotated data, making it difficult to achieve the few-shot generalization capabilities seen in language models. Now, Xiaomi's Xiaomi-MiMo-Audio model, based on an innovative pre-training architecture and hundreds of millions of hours of training data, has achieved few-shot generalization based on In-Context Learning (ICL) for the first time in the speech field, and observed clear "emergence" behavior during pre-training.

The Xiaomi-MiMo-Audio model performs well on multiple standard evaluation benchmarks. Its performance not only surpasses open-source models with the same parameter count, but also exceeds Google's closed-source speech model Gemini-2.5-Flash on the standard test set of the audio understanding benchmark MMAU, and outperforms OpenAI's closed-source speech model GPT-4o-Audio-Preview on the Big Bench Audio S2T task in the audio complex reasoning benchmark. This achievement not only demonstrates Xiaomi's deep strength in the field of speech technology, but also provides a new direction for the development of speech AI.

WeChat screenshot_20250919094548.png

The open-sourced Xiaomi-MiMo-Audio model by Xiaomi features multiple innovations and first-time breakthroughs. First, this model has proven for the first time that extending speech lossless compression pre-training to 100 million hours can "emerge" cross-task generalization, manifested as few-shot learning capability, which is considered the "GPT-3 moment" in the speech field. Second, Xiaomi is the first company to clearly define the objectives and definitions of speech generative pre-training, and has open-sourced a complete speech pre-training solution, including a lossless compression Tokenizer, a new model structure, training methods, and an evaluation system, ushering in the "LLaMA moment" in the speech field. In addition, Xiaomi-MiMo-Audio is the first open-source model to introduce the thinking process simultaneously into both speech understanding and speech generation processes, supporting mixed thinking.

Xiaomi adopted a simple, thorough, and direct open-source style to accelerate the development of the speech research field. The open-source content includes the pre-trained model MiMo-Audio-7B-Base and the instruction-tuned model MiMo-Audio-7B-Instruct, as well as the Tokenizer model, technical report, and evaluation framework. The MiMo-Audio-7B-Instruct model can switch between non-thinking and thinking modes via prompt, starting with high-level reinforcement learning and great potential, serving as a new base model for researching speech RL and Agentic training. The Tokenizer model has 1.2B parameters, uses a Transformer architecture, balances efficiency and performance, was trained from scratch, covers over ten million hours of speech data, and supports both audio reconstruction tasks and audio-to-text tasks. The technical report comprehensively presents the model and training details, while the evaluation framework supports more than 10 evaluation tasks and has been open-sourced on GitHub.

Xiaomi stated that the open-source release of Xiaomi-MiMo-Audio will significantly accelerate the alignment of speech large model research with language large models, providing an important foundation for the development of speech AGI. Xiaomi will continue to open-source, and looks forward to working with every fellow traveler to move towards the "singularity" of speech AI and enter the future era of human-computer interaction.

https://huggingface.co/XiaomiMiMo/MiMo-Audio-7B-Instruct