AliTongyi officially launched the new generation end-to-end speech recognition large model Fun-ASR. This model achieves a breakthrough in speech recognition accuracy of over 15% in vertical industry scenarios such as home decoration and insurance by enhancing context awareness and high-precision transcription capabilities. Test data shows that the accuracy in the insurance industry has increased by 18% compared to the previous generation, while home decoration and livestock sectors have seen increases of 15%-20%.
As a speech recognition algorithm driven by large language models, Fun-ASR adopts self-developed speech algorithms and Qwen3 supervised fine-tuning technology, combining cutting-edge model architectures and text modal alignment technology. While maintaining advantages in language processing, it integrates a RAG retrieval enhancement solution, supporting the import of over 1000 custom hot words. This feature can automatically match domain-specific hot words, historical documents, and context records in audio, significantly optimizing keyword recognition performance in specific scenarios.
To address pain points such as noise interference, language confusion, and generation hallucination in speech recognition, the development team has innovatively introduced reinforcement learning (RL) technology, reducing recognition errors through dynamic optimization strategies, thereby substantially improving system stability and reliability. Notably, the model performs better than similar products in recognizing dialects such as Sichuan dialect, Cantonese, and Hokkien, and adapts to complex acoustic environments such as far-field pickup and near-field noise reduction, covering diverse scenarios like meeting rooms, workstations, supermarkets, and outdoor areas.
In terms of training data, Fun-ASR is built on hundreds of millions of hours of audio data, deeply integrating professional terminology libraries from more than ten fields such as the internet, technology, livestock, and automobiles. This data advantage enables it to demonstrate significant advantages in vertical industry recognition, for example, accurately identifying key commands in animal sounds and environmental noise in the livestock industry.
The AliTongyi technology team stated that the evolution of Fun-ASR marks the deep penetration of speech recognition technology from general scenarios to specialized and scenario-based applications. As the model is deployed in more industries, its dynamic hot word updates and multimodal interaction capabilities will further drive innovation in speech interaction efficiency.