Today, Tongyi Lab officially announced the release of two voice large models that support "FreeStyle" instruction generation: Fun-CosyVoice3.5 and Fun-AudioGen-VD. This release marks a transition in speech generation technology from traditional paradigms relying on pre-set tags to a new paradigm based on natural language instructions, achieving a deep interactive experience of "generating speech freely with one sentence."


In terms of technical architecture and functional upgrades, Fun-CosyVoice3.5 focuses on multilingual replication and refined expression, adding support for four new languages including Thai and Indonesian. By introducing DiffRO and GRPO reinforcement learning technologies, the model significantly improves the prosody and audio quality similarity. Its error rate for rare characters has dropped from 15.2% to 5.3%, and the first packet delay has also been reduced by 35%. Complementing it, Fun-AudioGen-VD focuses on sound design and scenario modeling, supporting precise control of gender, emotion, and spatial acoustics through instructions, enabling the simulation of complex role and background sound integrated scenarios from "crazy villains" to "noisy cafés."
From an industry trend perspective, Tongyi Lab's move upgrades speech generation from a simple conversion tool to a creation tool. This descriptive and programmable digital expression capability directly empowers fields such as film, games, and AI avatars, reducing content creation costs while greatly expanding the semantic richness of human-computer interaction.
API call: https://help.aliyun.com/zh/model-studio/text-to-speech?spm=a2c4g.11186623.help-menu-2400256.d_0_3_2_0.d5536a31V2tEJP
Documentation: https://help.aliyun.com/zh/model-studio/cosyvoice-clone-api?spm=a2c4g.11186623.help-menu-search-2400256.d_2