KaniTTS is a high-speed, high-fidelity text-to-speech model optimized for real-time conversational artificial intelligence applications. The model uses a two-stage processing flow, combining a large language model and an efficient audio codec. On an Nvidia RTX 5080, the latency for generating 15 seconds of audio is only about 1 second, and the MOS naturalness score reaches 4.3/5. It supports multiple languages such as English, Chinese, and Japanese.
Audio Processing
TransformersMultiple Languages