Recently, Microsoft announced the launch of its new speech-to-text model MAI-Transcribe-1, which has an average word error rate (WER) of just 3.9% across 25 languages, and is hailed as the most accurate transcription model in the world today. This is the third model in Microsoft's self-developed MAI series, following the release of the speech synthesis model MAI-Voice-1 and the image generation model MAI-Image-2.

According to Microsoft, MAI-Transcribe-1 performed exceptionally well in the FLEURS industry standard benchmark test, especially achieving the highest transcription accuracy for 11 "core languages" such as English, French, and German among the 25 languages. This model not only performs well in various multilingual transcription scenarios but also shows a clear advantage over other models like OpenAI's Whisper-large-v3 and Google's Gemini 3.1 Flash.
MAI-Transcribe-1 is suitable for a variety of multilingual speech transcription scenarios, including meeting notes and media content transcription. Although the current version does not support advanced features such as real-time transcription and speaker separation, Microsoft plans to enhance these capabilities in future updates. In terms of performance, the new model leads in batch transcription tasks, with a batch processing transcription speed that is 2.5 times faster than the existing Microsoft Azure Fast product.
In addition, MAI-Transcribe-1 is now available on the Microsoft Foundry platform for enterprises and developers, with a pricing of $0.36 per hour, and Microsoft states that it is one of the most cost-effective speech-to-text models among cloud service providers today. Microsoft also announced the introduction of MAI-Image-2 and MAI-Voice-1 to the Foundry platform, further enhancing its self-developed capabilities in multimodal AI areas such as speech recognition, speech synthesis, and image generation, aiming to provide developers with more performance- and cost-effective solutions.
Key Points:
📊 MAI-Transcribe-1 has an average word error rate of just 3.9% across 25 languages, making it the most accurate transcription model in the world.
🌍 The model performs outstandingly in core transcription scenarios across multiple languages and surpasses its competitors.
💰 It costs $0.36 per hour, making it one of the most cost-effective speech-to-text models in the cloud service market.


