Stability AI has officially open-sourced Stable Audio Open Small, a text-to-audio generation model optimized for mobile devices, with 341 million parameters. This lightweight model can run locally on Arm CPUs and generate high-quality stereo audio, marking a significant step forward in AI audio generation technology towards edge computing and mobile devices.
Technical Highlights: Lightweight and Efficient, Local Generation on Mobile Devices
Stable Audio Open Small is based on Stability AI's previously released Stable Audio Open model. Through deep optimization, the number of parameters was reduced from 1.1B to 341M, significantly lowering computational requirements. Thanks to the support of the KleidiAI library from Arm, the model can generate up to 11 seconds of 44.1kHz stereo audio in less than 8 seconds on smartphones, without needing cloud processing, making it suitable for offline scenarios.
The model uses a latent diffusion model (LDM), combined with T5 text embeddings and a transformer-based diffusion architecture (DiT). It can generate sound effects, drum beats, instrument segments, or ambient sounds using simple English text prompts such as "128BPM electronic drum loop" or "the sound of waves hitting the shore." According to AIbase tests, the model produces detailed short audio clips, especially suitable for sound design and music production.
Open Source and Licensing: Empowering Developers and Creators
Stable Audio Open Small is released under the Stability AI Community License, which is free for researchers, individual users, and companies with annual revenue below $1 million. Model weights and code are available on Hugging Face and GitHub. Enterprises need to purchase an enterprise license to ensure the sustainability of the technology in commercial applications. This tiered licensing strategy lowers the technical barrier and encourages developers worldwide to explore audio generation applications.
In addition, all training data for the model comes from royalty-free audio from Freesound and Free Music Archive, ensuring copyright compliance and avoiding risks associated with copyrighted content used by competitors like Suno and Udio.
Performance and Innovation: ARC Post-Training Enhances Efficiency
Stable Audio Open Small introduces a adversarial relative contrast (ARC) post-training method, which improves generation speed and prompt adherence without traditional distillation or classifier-free guidance. It combines relative adversarial loss and contrastive discriminator loss. Research shows that the model generates 12 seconds of audio in just 75 milliseconds on an H100 GPU, and about 7 seconds on a mobile device, achieving a CLAP conditional diversity score of 0.41, leading among similar models.
In subjective testing, the model received high scores in diversity (4.4), quality (4.2), and prompt adherence (4.2), demonstrating its excellent performance in generating sound effects and rhythmic segments. Its Ping-Pong sampling technique further optimizes few-step inference, balancing speed and quality.
Industry Impact: Driving Mobile AI and Creative Democratization
The release of Stable Audio Open Small marks a shift in AI audio generation technology toward mobile devices and edge computing. Unlike competitors relying on cloud processing, this model's offline operation capability provides convenience for mobile scenarios, such as real-time audio generation, reaching 99% of smartphone users globally. AIbase analysis suggests that this technological accessibility will reshape the audio creation ecosystem, enabling ordinary users to participate in professional-level sound design.
However, the model also has limitations: it only supports English prompts, performs weakly on non-Western music styles, and cannot generate realistic vocals or full songs. Stability AI stated that future improvements will focus on multilingual support and musical style diversity to enhance global applicability.
Project: https://huggingface.co/stabilityai/stable-audio-open-small