Volcano Engine has officially launched the Doubao Speech Recognition Model 2.0 (Doubao-Seed-ASR-2.0). This upgraded version of speech recognition technology not only achieves significant improvements in inference capabilities, but also supports accurate recognition of multiple languages and visual information, marking another major advancement in speech recognition technology.
According to reports, the Doubao Speech Recognition Model 2.0 builds on the advantages of the previous version's high-performance audio encoder with 2 billion parameters, focusing on optimization in complex scenarios. The model conducts deep learning on challenging elements such as proper nouns, names, place names, and homophones, aiming to provide higher accuracy in various application scenarios. Its inference capabilities are based on an advanced PPO scheme, enabling precise recognition through deep understanding of context without relying on historical records of target words.

Notably, the upgrade of the Doubao Speech Recognition Model 2.0 enables it to have multimodal understanding capabilities, allowing it to analyze both text and visual information simultaneously. This means that after users send images, the model can combine image content for speech recognition, thus more accurately understanding user intent. For example, when a user describes an image containing a skateboard, traditional models might mistakenly recognize "slid chicken" as "funny," while the Doubao model can determine from the image analysis that the correct term is indeed "slid chicken," avoiding recognition errors.
In addition, the Doubao Speech Recognition Model 2.0 supports accurate recognition of 13 overseas languages, including Japanese, Korean, German, and French. This multilingual support will effectively expand its use in cross-language application scenarios, enhancing the interaction experience for global users.

Volcano Engine stated that the Doubao Speech Recognition Model 2.0 is now available at the Volcano Fangzhou Experience Center and provides API services for external access, allowing enterprises and developers to conveniently integrate this technology. In the future, Volcano Engine will continue to drive the evolution of the model, striving to achieve more accurate voice-to-text services in multimodal and multi-scenario environments, providing efficient solutions for users.
The release of the Doubao Speech Recognition Model 2.0 by Volcano Engine fully demonstrates its continuous innovation capabilities and technical strength in the field of artificial intelligence, and is expected to have a positive impact on industry standards and user experiences.



