Step-Audio 2 mini, the End-to-End Speech Large Model from StepZen

AIbase基地

Published inAI News · 5 min read · Sep 1, 2025

On September 1st, StepFun officially released its strongest open-source end-to-end speech large model, Step-Audio2mini. The model achieved SOTA (State-of-the-Art) results on multiple international benchmark datasets, unifying speech understanding, audio reasoning, and generation in a single model. It performs exceptionally well in tasks such as audio understanding, speech recognition, cross-lingual translation, emotion and paralanguage analysis, and speech dialogue, and is the first to support native speech Tool Calling capabilities, enabling operations such as online search. Step-Audio2mini is described as "hearing clearly, thinking clearly, and speaking naturally." The model is now available on platforms like GitHub and Hugging Face for users to download, try, and provide feedback.

Step-Audio2mini has achieved SOTA results in multiple key benchmarks. It demonstrates excellent performance in audio understanding, speech recognition, translation, and dialogue scenarios, surpassing all open-source end-to-end speech models such as Qwen-Omni and Kimi-Audio, and exceeding GPT-4o Audio in most tasks. On the general multimodal audio understanding test set MMAU, Step-Audio2mini scored 73.2, ranking first among open-source end-to-end speech models; on the URO Bench, which measures conversational abilities, Step-Audio2mini achieved the highest scores in both basic and professional tracks among open-source end-to-end speech models; in the Chinese-English translation task, Step-Audio2mini achieved scores of 39.3 and 29.1 on the CoVoST2 and CVSS evaluation sets, significantly leading GPT-4o Audio and other open-source speech models; in the speech recognition task, Step-Audio2mini ranked first in multilingual and dialectal recognition, with an average CER (Character Error Rate) of 3.19 on the open-source Chinese test set and an average WER (Word Error Rate) of 3.50 on the open-source English test set, leading other open-source models by more than 15%.

微信截图_20250901101946.png

Step-Audio2mini effectively solves the issues existing in previous speech models through innovative architectural design, achieving both "thinking deeply" and "emotional engagement." It adopts a true end-to-end multimodal architecture, breaking through the traditional ASR+LLM+TTS three-tier structure, enabling direct conversion from original audio input to speech output, resulting in a simpler architecture, lower latency, and effective understanding of paralanguage information and non-speech signals. In addition, Step-Audio2mini introduces a combination of chain-of-thought reasoning (CoT) and reinforcement learning optimization for the first time in end-to-end speech models, allowing it to finely understand, reason, and naturally respond to paralanguage and non-speech signals such as emotions, tone, and music. The model also supports external tools such as web search, helping to address hallucination issues and providing the model with the ability to expand across various scenarios.

The capabilities of Step-Audio2mini are vividly demonstrated in practical cases. It can accurately recognize natural sounds and skilled voiceovers, and can also perform real-time searches to obtain the latest industry news. Furthermore, Step-Audio2mini can control the speaking speed, easily adapting to different dialogue needs in various scenarios. When asked about philosophical dilemmas, Step-Audio2mini can transform abstract questions into simple methodologies, demonstrating strong logical reasoning abilities.

GitHub: https://github.com/stepfun-ai/Step-Audio2
Hugging Face: https://huggingface.co/stepfun-ai/Step-Audio-2-mini
ModelScope: https://www.modelscope.cn/models/stepfun-ai/Step-Audio-2-mini

Meituan Launches Open-Source Large Model LongCat: Aimed at Empowering Developers and Accelerating AI Application Deployment

Recently, Meituan officially released its latest open-source large language model, LongCat (Llama), aimed at promoting the development and application of artificial intelligence technology. The release of this model not only marks another significant advancement for Meituan in the AI field, but also provides developers and researchers with a powerful new tool. The core version of LongCat, LongCat-Flash, has 56 billion parameters. Its core advantage lies in an innovative mixture of experts (MoE) architecture. This architecture uses a dynamic computing mechanism to adapt to contextual needs.

Redefining the Standards for Code Agent Evaluation! GitTaskBench Pioneers a New Era

Recently, GitTaskBench, jointly developed by renowned academic institutions such as the Chinese Academy of Sciences, Peking University, and the Hong Kong Polytechnic University, was officially launched, marking the beginning of a new era for practical delivery standards of code agents. Current evaluation systems often focus on code generation and closed questions, failing to fully reflect the many challenges developers face in real-world work, such as environment configuration, dependency management, and cross-repository resource integration. Therefore, GitTaskBench not only focuses on code generation but also includes the entire development process in its evaluation scope, achieving a first-time implementation.

Firecrawl Announces Open-Sourcing of AI Readiness Checker Next Week to Help Websites Achieve Comprehensive Optimization

Leading web scraping and data processing solution provider Firecrawl recently announced that it will open-source its innovative AI Readiness Checker next week. This tool is designed to conduct a comprehensive audit of websites, helping them improve visibility and content optimization in an AI-driven search environment. According to Firecrawl's latest update on the social media platform X, this checker can evaluate performance across multiple critical areas, ensuring that websites are adapted to modern AI technologies and search engine requirements. The AI Readiness Checker is capable of performing in-depth website audits.

Meituan Releases Meeseeks Evaluation Benchmark! o3-mini Leads, DeepSeek-R1 Surprisingly Lasts, Sparks Discussion

Recent advances in large language models (LLMs) like OpenAI's GPT-4, Claude3.5Sonnet, and DeepSeek-R1 have highlighted AI's knowledge and reasoning capabilities. However, users often find these models fail to follow instructions precisely, despite producing coherent outputs. To evaluate LLMs' instruction-following ability, Meituan's M17 team introduced Meeseeks, a novel benchmark focusing on compliance with specific format/content requirements.....

Latest AI News

AI Daily Brief

AI Product Finder

AI Product Rankings

AI Product Submit

AI Tools Directory

Building and Deploying AI

AI Models Finder

LLM Leaderboard

Model Providers

Submit Your Model

Compare LLMs

LLM Cost Calculator

LLM Arena

MCP Servers

MCP Client

MCP Case Tutorials

MCP Ranking

MCP Service Submission

MCP Playground

MCP Inspector

Step-Audio 2 mini, the End-to-End Speech Large Model from StepZen

AIbase基地

This article is from AIbase Daily

AI News Recommendations

Microsoft Launches Copilot Labs with Its First Experimental Tool, Copilot Audio Expression Now Available

MedResearcher-R1, the Medical AI Agent Open-Sourced by Ant Group

Tencent ARC Open-Sources AudioStory: Generating Long Audio with Large Language Models

Meituan Launches Open-Source Large Model LongCat: Aimed at Empowering Developers and Accelerating AI Application Deployment

Redefining the Standards for Code Agent Evaluation! GitTaskBench Pioneers a New Era

Firecrawl Announces Open-Sourcing of AI Readiness Checker Next Week to Help Websites Achieve Comprehensive Optimization

Meituan Releases Meeseeks Evaluation Benchmark! o3-mini Leads, DeepSeek-R1 Surprisingly Lasts, Sparks Discussion

Malaysia Chips Breakthrough! SkyeChip Launches the Country's First Edge AI Processor, Taking a Critical Step to Reduce Reliance on the US

3D Large Model Company Yinge Technology Secures Another Several Million Dollars in Funding, Led by Lan Chi Venture

Alipay Quark Launches AI Education Inclusiveness Initiative, Free AI Membership Open to Teachers and Students