Stability AI and Arm have jointly released a compact text-to-audio model named "Stable Audio Open Small." This model can generate high-quality stereo audio clips up to 11 seconds long in approximately 7 seconds, optimized to run on mobile devices like smartphones.
This breakthrough is based on the "Adversarial Relativistic-Contrastive" (ARC) technology developed by researchers at UC Berkeley. On high-end hardware such as Nvidia H100 GPUs, the model performs even more impressively, generating 44 kHz stereo audio in just 75 milliseconds, achieving nearly real-time audio synthesis capabilities.
In comparison to the original Stable Audio Open released last year with 1.1 billion parameters, this streamlined version uses only 341 million parameters, significantly reducing computational resource requirements to enable smooth operation on consumer-grade hardware. This marks the first major achievement since Stability AI and Arm announced their collaboration in March of this year.
To achieve smartphone-level performance, the development team thoroughly revamped the model architecture, restructuring it into three core components: an autoencoder for compressing audio data, an embedding module for interpreting text prompts, and a diffusion model for generating the final audio output.
According to Stability AI, the model excels particularly in generating sound effects and field recordings but still has limitations in music generation, especially when handling vocals. At present, it primarily supports English prompt inputs.
The model was trained using approximately 472,000 audio clips from the Freesound database that comply with CC0, CC-BY, or CC-Sampling+ licensing terms. The development team conducted a series of automated checks to screen the training data, aiming to avoid potential copyright issues.