【AIbase Report】Voice generation company Fish Audio has officially released the upgraded S1 Voice Cloning Model, achieving significant breakthroughs in emotional expression and realism. The new model can generate human-like voices with rich emotions, rhythm, and tonal variations, almost perfectly replicating the subtle differences in human speech.

According to the introduction, users only need to provide about 10 seconds of voice sample, and S1 can clone any voice, fully preserving the original accent, tone, and rhythm, reproducing personal speaking habits and emotional characteristics, generating results almost indistinguishable from real people. Compared with internationally renowned products ElevenLabs, Fish Audio's voice cloning service is about six times cheaper, offering a clear advantage in balancing voice generation cost and performance.

At the same time, Fish Audio S1 API has also been launched simultaneously, significantly improving the real-time voice generation experience. Its first frame delay (TTFT) is less than 500 milliseconds, allowing playback to start within half a second for a sentence; it also supports streaming transmission for both input and output, enabling natural interaction where text is read aloud as it is received, and allows unlimited cloning of different voices and free switching.

Industry experts believe that the upgrade of Fish Audio S1 marks that voice cloning technology is moving from "usable" to "perceptible." Its high-fidelity and low-latency features will accelerate the widespread application of AI voice in virtual humans, smart assistants, content creation, and dubbing.