A two-person startup called Nari Labs has released Dia, a 1.6-billion parameter text-to-speech (TTS) model designed to generate natural-sounding conversations directly from text prompts. Co-founder Toby Kim claims Dia outperforms proprietary offerings from competitors like ElevenLabs, Google's NotebookLM AI podcast generation feature, and potentially even OpenAI's recently released gpt-4o-mini-tts.
Kim stated on X (formerly Twitter) that Dia's quality rivals NotebookLM's podcast functionality and surpasses ElevenLabs Studio and Sesame's open models. He revealed the model was built with "zero funding" and emphasized that they weren't AI experts initially, launching the project out of a love for NotebookLM's podcast feature. They tried all available TTS APIs on the market, finding none sufficiently natural. Kim expressed gratitude to Google for allowing them to use its Tensor Processing Unit (TPU) chips to train Dia.
Currently, Dia's code and weights are open-sourced on Hugging Face and GitHub for users to download and deploy locally. Individual users can also experience it online via Hugging Face Space.
Advanced Controls and Enhanced Customization
Dia supports nuanced features including emotional intonation, speaker labels, and non-verbal audio cues like (laugh), (cough), (clear throat), all achieved through pure text. Nari Labs' examples demonstrate Dia's ability to correctly interpret these labels, a feature often unreliable in other models. The model currently only supports English, and the voice varies on each run unless the user modifies the generation seed or provides audio prompts for voice cloning.
Nari Labs provides comparison examples on its website showcasing Dia's superiority over ElevenLabs Studio and Sesame CSM-1B in handling natural rhythm, non-verbal expressions, multi-emotional dialogues, complex rhythmic content, and maintaining voice style through audio prompts. Nari Labs notes that Sesame's demo might have used an internally larger parameter version.
Model Access and Technical Specifications
Developers can obtain Dia from Nari Labs' GitHub repository and Hugging Face model page. The model runs on PyTorch 2.0+ and CUDA 12.6, requiring approximately 10GB of VRAM. Nari Labs plans to offer CPU support and quantized versions in the future.
Dia is distributed under the fully open-source Apache 2.0 license, permitting commercial use. Nari Labs emphasizes a prohibition against unethical use and encourages responsible experimentation. The project's development was supported by Google TPU Research Cloud, Hugging Face's ZeroGPU grant program, and other relevant research. Despite being a team of only two engineers, Nari Labs actively invites community contributions.