Recently, StepZen officially launched its latest open-source end-to-end speech large model - Step-Audio2mini. This model has shown excellent performance in multiple international benchmark tests, achieving SOTA (State-of-the-Art) results, which is impressive. Step-Audio2mini not only has strong capabilities in speech understanding and audio generation, but also for the first time unifies audio reasoning and generation modeling, providing excellent solutions for various application scenarios such as speech recognition, cross-language translation, and emotional analysis.

One of the features of Step-Audio2mini is its outstanding multimodal audio understanding capability. On the MMAU (Multimodal Audio Understanding dataset), this model ranks first among open-source speech models with a score of 73.2. In the URO Bench test for conversational ability, whether in the basic track or the professional track, Step-Audio2mini achieved the highest score among open-source models, demonstrating its excellent conversational understanding and expression ability.

image.png

In the Chinese-English translation task, Step-Audio2mini also performed well. It achieved high scores of 39.3 and 29.1 on the CoVoST2 and CVSS evaluation sets, significantly surpassing GPT-4o Audio and other open-source speech models. In addition, the model also excels in speech recognition, with a character error rate (CER) of 3.19 on open-source Chinese test sets and a word error rate (WER) of 3.50 on open-source English test sets, leading other open-source models by more than 15%.

image.png

The success of Step-Audio2mini is attributed to its innovative architecture design. The model breaks the traditional three-tier structure of ASR (Automatic Speech Recognition), LLM (Large Language Model), and TTS (Text-to-Speech), achieving direct conversion from raw audio input to speech response output, simplifying the architecture and reducing latency. In addition, the model introduces joint optimization technology combining Chain-of-Thought (CoT) reasoning and reinforcement learning, allowing it to better understand paralanguage information such as emotions and intonation, and respond naturally.

It is worth noting that Step-Audio2mini also supports audio knowledge enhancement, enabling the use of external tools for online searches, solving the hallucination problem in traditional models. This innovation not only enhances the practicality of the model but also expands its application potential in various scenarios.

Currently, Step-Audio2mini is available on platforms such as GitHub and Hugging Face. Developers are welcome to try it out and contribute code!