Audio generation technology is undergoing a paradigm shift from cascade architectures to end-to-end generation. To address the information loss and error accumulation caused by the "Mel spectrum" intermediate representation in traditional TTS systems, the Meituan LongCat team officially released and open-sourced LongCat-AudioDiT (available in 1B/3.5B versions) today. This model successfully breaks the performance limits of zero-shot voice cloning by directly modeling in the waveform latent space.

Core Architecture: Saying Goodbye to Mel Spectra
LongCat-AudioDiT abandons the traditional multi-stage process of "predicting acoustic features + neural vocoder," and builds a minimal architecture composed of Wav-VAE (Waveform Variational Autoencoder) and DiT (Diffusion Transformer).
Efficient Wav-VAE: Using a fully convolutional design, it compresses 24kHz waveforms to a 11.7Hz frame rate by 2000 times. By employing non-parametric shortcut branches and multi-objective adversarial training, it ensures that the reconstructed waveform maintains precise time-frequency structure while offering excellent natural listening quality.
Semantic-enhanced DiT: The model innovatively fuses the original word embeddings from the UMT5 text encoder with top-level hidden states, compensating for phonetic details lost in high-level semantics, significantly improving the intelligibility of generated speech.
Inference Optimization: Precisely Solving Voice Drift
To further optimize generation quality, the team introduced two key technical improvements:
Dual Constraint Mechanism: Identifies and corrects the long-standing "training-inference mismatch" issue in flow-matching TTS. By forcing a reset of the prompt area (Prompt) latent variables during inference, it completely solves the problems of speaker voice drift and instability.
Adaptive Projection Guidance (APG): Replaces the traditional classifier-free guidance (CFG). APG can accurately filter beneficial components in the guidance signal and suppress signals causing audio degradation, significantly improving the naturalness of speech without causing spectral "over-saturation."
Performance: SOTA-Level Cloning Accuracy
In the Seed benchmark test, LongCat-AudioDiT demonstrated dominant performance:
Similarity (SIM): The 3.5B model achieved 0.818 on the Seed-ZH test set and 0.797 on the Seed-Hard challenging sentence test set, surpassing well-known models such as Seed-TTS, CosyVoice3.5, and MiniMax-Speech.
Accuracy: It is among the industry's top tier in metrics such as English WER (1.50%) and Chinese difficult sentence CER (6.04%).
Notably, LongCat-AudioDiT achieved better performance than multi-stage trained models by only using pre-trained data from ASR transcriptions in a single-stage training. Currently, the related paper, code, and model weights are fully open on
Address:
GitHub: https://github.com/meituan-longcat/LongCat-AudioDiT
HuggingFace: https://huggingface.co/meituan-longcat/LongCat-AudioDiT





