On September 1st, StepFun officially released its strongest open-source end-to-end speech large model, Step-Audio2mini. The model achieved SOTA (State-of-the-Art) results on multiple international benchmark datasets, unifying speech understanding, audio reasoning, and generation in a single model. It performs exceptionally well in tasks such as audio understanding, speech recognition, cross-lingual translation, emotion and paralanguage analysis, and speech dialogue, and is the first to support native speech Tool Calling capabilities, enabling operations such as online search. Step-Audio2mini is described as "hearing clearly, thinking clearly, and speaking naturally." The model is now available on platforms like GitHub and Hugging Face for users to download, try, and provide feedback.
Step-Audio2mini has achieved SOTA results in multiple key benchmarks. It demonstrates excellent performance in audio understanding, speech recognition, translation, and dialogue scenarios, surpassing all open-source end-to-end speech models such as Qwen-Omni and Kimi-Audio, and exceeding GPT-4o Audio in most tasks. On the general multimodal audio understanding test set MMAU, Step-Audio2mini scored 73.2, ranking first among open-source end-to-end speech models; on the URO Bench, which measures conversational abilities, Step-Audio2mini achieved the highest scores in both basic and professional tracks among open-source end-to-end speech models; in the Chinese-English translation task, Step-Audio2mini achieved scores of 39.3 and 29.1 on the CoVoST2 and CVSS evaluation sets, significantly leading GPT-4o Audio and other open-source speech models; in the speech recognition task, Step-Audio2mini ranked first in multilingual and dialectal recognition, with an average CER (Character Error Rate) of 3.19 on the open-source Chinese test set and an average WER (Word Error Rate) of 3.50 on the open-source English test set, leading other open-source models by more than 15%.
Step-Audio2mini effectively solves the issues existing in previous speech models through innovative architectural design, achieving both "thinking deeply" and "emotional engagement." It adopts a true end-to-end multimodal architecture, breaking through the traditional ASR+LLM+TTS three-tier structure, enabling direct conversion from original audio input to speech output, resulting in a simpler architecture, lower latency, and effective understanding of paralanguage information and non-speech signals. In addition, Step-Audio2mini introduces a combination of chain-of-thought reasoning (CoT) and reinforcement learning optimization for the first time in end-to-end speech models, allowing it to finely understand, reason, and naturally respond to paralanguage and non-speech signals such as emotions, tone, and music. The model also supports external tools such as web search, helping to address hallucination issues and providing the model with the ability to expand across various scenarios.
The capabilities of Step-Audio2mini are vividly demonstrated in practical cases. It can accurately recognize natural sounds and skilled voiceovers, and can also perform real-time searches to obtain the latest industry news. Furthermore, Step-Audio2mini can control the speaking speed, easily adapting to different dialogue needs in various scenarios. When asked about philosophical dilemmas, Step-Audio2mini can transform abstract questions into simple methodologies, demonstrating strong logical reasoning abilities.
GitHub: https://github.com/stepfun-ai/Step-Audio2
Hugging Face: https://huggingface.co/stepfun-ai/Step-Audio-2-mini
ModelScope: https://www.modelscope.cn/models/stepfun-ai/Step-Audio-2-mini