Open Source Speech Large Model Step-Audio 2 mini Released! Listen Clearly, Speak Naturally

AIbase基地

Published inAI News · 4 min read · Sep 1, 2025

Recently, StepZen officially launched its latest open-source end-to-end speech large model - Step-Audio2mini. This model has shown excellent performance in multiple international benchmark tests, achieving SOTA (State-of-the-Art) results, which is impressive. Step-Audio2mini not only has strong capabilities in speech understanding and audio generation, but also for the first time unifies audio reasoning and generation modeling, providing excellent solutions for various application scenarios such as speech recognition, cross-language translation, and emotional analysis.

One of the features of Step-Audio2mini is its outstanding multimodal audio understanding capability. On the MMAU (Multimodal Audio Understanding dataset), this model ranks first among open-source speech models with a score of 73.2. In the URO Bench test for conversational ability, whether in the basic track or the professional track, Step-Audio2mini achieved the highest score among open-source models, demonstrating its excellent conversational understanding and expression ability.

In the Chinese-English translation task, Step-Audio2mini also performed well. It achieved high scores of 39.3 and 29.1 on the CoVoST2 and CVSS evaluation sets, significantly surpassing GPT-4o Audio and other open-source speech models. In addition, the model also excels in speech recognition, with a character error rate (CER) of 3.19 on open-source Chinese test sets and a word error rate (WER) of 3.50 on open-source English test sets, leading other open-source models by more than 15%.

The success of Step-Audio2mini is attributed to its innovative architecture design. The model breaks the traditional three-tier structure of ASR (Automatic Speech Recognition), LLM (Large Language Model), and TTS (Text-to-Speech), achieving direct conversion from raw audio input to speech response output, simplifying the architecture and reducing latency. In addition, the model introduces joint optimization technology combining Chain-of-Thought (CoT) reasoning and reinforcement learning, allowing it to better understand paralanguage information such as emotions and intonation, and respond naturally.

It is worth noting that Step-Audio2mini also supports audio knowledge enhancement, enabling the use of external tools for online searches, solving the hallucination problem in traditional models. This innovation not only enhances the practicality of the model but also expands its application potential in various scenarios.

Currently, Step-Audio2mini is available on platforms such as GitHub and Hugging Face. Developers are welcome to try it out and contribute code!

ByteDance Seed Team Announces the Launch of 3D Generation Large Model Seed 3D 1.0

The ByteDance Seed team recently announced the launch of the 3D generation large model Seed3D1.0, which is capable of generating high-quality, realistic 3D models from a single image in an end-to-end manner, including detailed geometry, realistic textures, and physically based rendering (PBR) materials. This innovative achievement is expected to provide powerful world simulation support for the development of embodied intelligence, addressing bottlenecks in physical interaction capabilities and content diversity in current technologies. During the development process, the Seed team collected and processed a large amount of high-quality 3D data, building a complete three

Latest AI News

AI Daily Brief

AI Product Finder

AI Product Rankings

AI Product Submit

AI Tools Directory

AI Models Finder

LLM Leaderboard

Model Providers

Submit Your Model

Compare LLMs

LLM Cost Calculator

LLM Arena

MCP Servers

MCP Client

MCP Case Tutorials

MCP Ranking

MCP Service Submission

MCP Playground

MCP Inspector

GEO Services

AI Search Visibility Checker

AI Model Compatibility Checker

AI Dataset Collection

Intelligent Document Recognition

Open Source Speech Large Model Step-Audio 2 mini Released! Listen Clearly, Speak Naturally

AIbase基地

This article is from AIbase Daily

AI News Recommendations

General Motors to Introduce Google Gemini Assistant, Available in Cars Starting Next Year

Amazon Tests AI Delivery Glasses, Completing Package Navigation and Risk Detection with One Lens

Kuaishou Launches AI Programming Ecosystem KAT-Coder-Air Free for Public Use

AI Daily: Google Skills Platform Opens to the Public for Free Access to Internal AI Knowledge; LiblibAI Secures $130 Million in Funding; Sora Updates Introduce Character Cameo Feature

ByteDance Seed Team Announces the Launch of 3D Generation Large Model Seed 3D 1.0

Google Launches Google Skills Platform, Opening Internal AI Knowledge to the Public for Free

OpenAI, Oracle Further Invest in AI Infrastructure: 15 Billion Dollar Lighthouse Campus Launches Construction

Chesky: Airbnb Temporarily Pauses Integration with ChatGPT; AI Customer Service Already Uses Qwen

Reddit Sues Perplexity AI: Accuses of Industrial-Scale Illegal Scraping of Millions of User Comments

Generate an image in one sentence! Microsoft Photos launches AI drawing and smart reshaping features

Latest AI News

AI Daily Brief

AI Product Finder

AI Product Rankings

AI Product Submit

AI Tools Directory

AI Models Finder

LLM Leaderboard

Model Providers

Submit Your Model

Compare LLMs

LLM Cost Calculator

LLM Arena

MCP Servers

MCP Client

MCP Case Tutorials

MCP Ranking

MCP Service Submission

MCP Playground

MCP Inspector

GEO Services​

AI Search Visibility Checker

AI Model Compatibility Checker

AI Dataset Collection

Intelligent Document Recognition

Open Source Speech Large Model Step-Audio 2 mini Released! Listen Clearly, Speak Naturally

AIbase基地

This article is from AIbase Daily

AI News Recommendations

General Motors to Introduce Google Gemini Assistant, Available in Cars Starting Next Year

Amazon Tests AI Delivery Glasses, Completing Package Navigation and Risk Detection with One Lens

Kuaishou Launches AI Programming Ecosystem KAT-Coder-Air Free for Public Use

AI Daily: Google Skills Platform Opens to the Public for Free Access to Internal AI Knowledge; LiblibAI Secures $130 Million in Funding; Sora Updates Introduce Character Cameo Feature

ByteDance Seed Team Announces the Launch of 3D Generation Large Model Seed 3D 1.0

Google Launches Google Skills Platform, Opening Internal AI Knowledge to the Public for Free

OpenAI, Oracle Further Invest in AI Infrastructure: 15 Billion Dollar Lighthouse Campus Launches Construction

Chesky: Airbnb Temporarily Pauses Integration with ChatGPT; AI Customer Service Already Uses Qwen

Reddit Sues Perplexity AI: Accuses of Industrial-Scale Illegal Scraping of Millions of User Comments

Generate an image in one sentence! Microsoft Photos launches AI drawing and smart reshaping features

GEO Services