Inworld AI has recently launched its latest voice model - Real-time TTS-2. This model, available through the research preview version of the Inworld API and Inworld Realtime API, aims to change the way traditional voice AI conversations are conducted. Previously, voice synthesis models were simply text-to-audio converters, but TTS-2 can listen to audio in real time during interactions, perceive users' tone, rhythm, and emotional state, and provide a more natural conversational experience.

image.png

The key feature of TTS-2 lies in its closed-loop system architecture. Unlike traditional models, it does not rely solely on text transcriptions but directly receives actual audio from the conversation. This difference allows the model to understand the meaning of the same sentence in different contexts. For example, "Okay, never mind" conveys very different emotions when spoken with a frustrated tone versus a relaxed one. TTS-2 can capture these emotional nuances, enhancing the coherence and authenticity of the conversation.

The model is equipped with four features that further enhance its uniqueness. First, the "Voice Instructions" feature allows developers to guide the expression of speech using simple language prompts during reasoning, rather than just selecting fixed emotion tags. Second, "Dialogue Awareness," which enables the model to understand context thanks to the closed-loop architecture. Additionally, TTS-2 supports cross-language speech recognition and output, allowing users to seamlessly switch languages within the same conversation while maintaining a consistent voice identity. Finally, "Advanced Voice Design" enables developers to generate reusable voices through descriptive text without needing audio references.

The release of TTS-2 marks another breakthrough for Inworld AI in voice technology. The model not only handles high-quality audio output but also focuses on contextual awareness and voice consistency, enhancing user experience. Through these innovations, Inworld AI hopes to stand out in the competitive voice AI market.

Key Points:   

🎤 ** Real-time Conversation **: TTS-2 captures users' audio through a closed-loop system, understanding emotions and tone.   

🌍 ** Multi-language Support **: A single voice identity can remain consistent across over 100 languages, supporting seamless switching in between.   

🛠️ ** Flexible Voice Design **: Developers can generate reusable voices through descriptive text without needing additional audio references.