At the TechCrunch Disrupt 2025 conference, Mati Staniszewski, co-founder and CEO of AI voice giant ElevenLabs, made an astonishing prediction: AI voice models will become "commoditized" within the next two to three years. Although they remain a core competitive advantage in the short term, the performance differences between models will gradually diminish in the long run, especially for mainstream languages and general voice styles.

Image source note: The image is AI-generated, and the licensing service provider is Midjourney
Short-term: Models, Long-term: Products
Facing the question of "If models will eventually become homogeneous, why invest so much in R&D?" Staniszewski admitted: "Today, models are still the biggest technical barrier. If AI voice sounds unnatural or unsmooth, user experience is out of the question." He pointed out that ElevenLabs' breakthroughs in model architecture (such as emotional expression and multilingual prosody modeling) are key to its current leading position.
But the company has already laid the groundwork for a post-model era. Staniszewski emphasized that ElevenLabs' long-term strategy is not just to be a "model supplier," but to build a complete "AI + product" experience. Just as Apple defined smartphones through hardware and software integration, ElevenLabs hopes to use its self-developed models as an engine to drive high-value application scenarios, thereby building a real moat.
Multi-modal Integration Becomes the Next Battlefield
Looking ahead 1-2 years, Staniszewski predicts that single-modal voice models will accelerate toward multi-modal integration. "You will generate audio and video at the same time, or dynamically link large language models and voice engines during conversations." He used Google's latest released Veo3 video generation model as an example, explaining that cross-modal collaboration is becoming a new technological frontier.
To this end, ElevenLabs is actively seeking partnerships with third-party models and open-source communities, exploring how to embed its top audio capabilities into a broader AI ecosystem. For instance, integrating ElevenLabs' speech synthesis with visual generation and LLM reasoning to create immersive virtual humans, smart customer service, or interactive entertainment experiences.
Commoditization ≠ No Value, But a Shift in Value Focus
Staniszewski does not believe that model commoditization means industry decline, but rather a shift in value focus from underlying technology to application innovation. He explained, "In the future, companies will choose different models based on specific scenarios—use one for customer service, another for game voice acting, and yet another for educational explanations. Reliability, scalability, and scenario adaptability will be more important than just 'best sound quality.'"
Therefore, ElevenLabs is simultaneously strengthening its API platform, developer toolchain, and industry solutions to ensure customers can not only get high-quality voice but also quickly integrate it into real business flows.
Conclusion: Be the "Voice Infrastructure" of the AI Era
As voice AI moves from "showy" to "practical," ElevenLabs' choice is clear and pragmatic: focus on models in the short term, and deeply cultivate products in the long run. As the industry consensus gradually points toward "Model-as-a-Commodity," the real winners may not be the companies with the most parameters, but those who best understand users and can seamlessly embed AI into human interaction scenarios.





