Meituan LongCat-AudioDiT Open Source: Pioneering Waveform Latent Space Modeling, Setting New SOTA for Voice Cloning

AIbase基地

Published inAI News · 4 min read · Apr 2, 2026

Audio generation technology is undergoing a paradigm shift from cascade architectures to end-to-end generation. To address the information loss and error accumulation caused by the "Mel spectrum" intermediate representation in traditional TTS systems, the Meituan LongCat team officially released and open-sourced LongCat-AudioDiT (available in 1B/3.5B versions) today. This model successfully breaks the performance limits of zero-shot voice cloning by directly modeling in the waveform latent space.

Core Architecture: Saying Goodbye to Mel Spectra

LongCat-AudioDiT abandons the traditional multi-stage process of "predicting acoustic features + neural vocoder," and builds a minimal architecture composed of Wav-VAE (Waveform Variational Autoencoder) and DiT (Diffusion Transformer).

Efficient Wav-VAE: Using a fully convolutional design, it compresses 24kHz waveforms to a 11.7Hz frame rate by 2000 times. By employing non-parametric shortcut branches and multi-objective adversarial training, it ensures that the reconstructed waveform maintains precise time-frequency structure while offering excellent natural listening quality.
Semantic-enhanced DiT: The model innovatively fuses the original word embeddings from the UMT5 text encoder with top-level hidden states, compensating for phonetic details lost in high-level semantics, significantly improving the intelligibility of generated speech.

Inference Optimization: Precisely Solving Voice Drift

To further optimize generation quality, the team introduced two key technical improvements:

Dual Constraint Mechanism: Identifies and corrects the long-standing "training-inference mismatch" issue in flow-matching TTS. By forcing a reset of the prompt area (Prompt) latent variables during inference, it completely solves the problems of speaker voice drift and instability.
Adaptive Projection Guidance (APG): Replaces the traditional classifier-free guidance (CFG). APG can accurately filter beneficial components in the guidance signal and suppress signals causing audio degradation, significantly improving the naturalness of speech without causing spectral "over-saturation."

Performance: SOTA-Level Cloning Accuracy

In the Seed benchmark test, LongCat-AudioDiT demonstrated dominant performance:

Similarity (SIM): The 3.5B model achieved 0.818 on the Seed-ZH test set and 0.797 on the Seed-Hard challenging sentence test set, surpassing well-known models such as Seed-TTS, CosyVoice3.5, and MiniMax-Speech.
Accuracy: It is among the industry's top tier in metrics such as English WER (1.50%) and Chinese difficult sentence CER (6.04%).

Notably, LongCat-AudioDiT achieved better performance than multi-stage trained models by only using pre-trained data from ASR transcriptions in a single-stage training. Currently, the related paper, code, and model weights are fully open on GitHub and HuggingFace.

Address:

GitHub: https://github.com/meituan-longcat/LongCat-AudioDiT

HuggingFace: https://huggingface.co/meituan-longcat/LongCat-AudioDiT

AudioGenerationTechnology LongCat-AudioDiT End-to-EndGeneration VoiceCloning

This article is from AIbase Daily

Welcome to the [AI Daily] column! This is your daily guide to exploring the world of artificial intelligence. Every day, we present you with hot topics in the AI field, focusing on developers, helping you understand technical trends, and learning about innovative AI product applications.

—— Created by the AIbase Daily Team

AI News Recommendations

Major Open Source Release! Native Multimodal LongCat-Next Released, Making Vision and Speech the Mother Tongue of AI

Global AI is undergoing an 'AI-native' technological shift. Addressing the current 'language-centric, externally patched vision or speech' architecture, a team released and open-sourced the native multimodal large model LongCat-Next and a discrete tokenizer, aiming to break modal barriers and enable AI to understand the physical world like processing text, achieved by reconstructing the underlying architecture.....

Jun 4, 2026

170

Cracking Down on AI Fraud! Google Launches Fake Call Detection Feature to Expose Forged Numbers

Google has introduced the "Fake Call Detection" feature in the "Phone by Google" app, specifically designed to identify fraudulent calls that use AI voice cloning and number spoofing to impersonate users' contacts. This feature marks suspicious calls on the screen, helping users prevent the growing threat of tech fraud and protect against threats such as forged numbers and fake voices of family and friends.

Jun 3, 2026

210

End of the No-Code Era! OpenAI Launches Sites Feature - Turn Ideas into Interactive Websites with Just Words

OpenAI introduces a revolutionary feature called Sites for Codex, available in preview form to Business and Enterprise paying users. This feature automatically transforms users' ideas, data, or project plans into tangible websites and applications, significantly lowering the development barrier and enabling immediate transformation from concept to reality.

Jun 3, 2026

370

Mistral AI Enters the High-End Manufacturing Sector: Partnering with Airbus and BMW to Bet on a New Physical AI Track

French AI startup Mistral AI announced a strategic shift on May 28, expanding its AI models and infrastructure into advanced manufacturing, with deep partnerships with Airbus and BMW. Focusing on 'physical AI,' it aims to empower industrial production chains through generative AI, enabling an 'intellectual upgrade' of industrial engineering.....

May 28, 2026

330

Xiaopeng Automotive Advances Humanoid Robot Mass Production: Plans to Start Mass Production by the End of 2026 and Bring It to Stores in the Following Year

XPeng Group held a mass production mobilization meeting for humanoid robots, announcing the sprint phase with nearly 1,000 employees. Chairman He Xiaopeng confirmed robot mass production by end-2026, with deployment in XPeng stores by Q1 2027, marking a key step in embodied AI industrialization.....

May 27, 2026

390

No Rehearsal, Real Fight on Stage! Meituan LongCat-Video-Avatar1.5 Open-Sourced: Fully Outperforming Mainstream Closed-Source Models

The Meituan Dragon Cat large model team has open-sourced the commercial-grade digital human video generation model LongCat-Video-Avatar1.5, achieving a leap from open-source SOTA to commercial application. This version significantly improves in core dimensions such as lip synchronization, physical plausibility, long video stability, multi-person interaction, and efficient inference, aiming to solve the pain points of traditional digital human videos and promote the application of digital humans toward realistic scenarios tailored for individuals.

May 22, 2026

390

The End of the Free Era for Gemini? Google Quietly Introduces Quota Limits, Bringing a Harsh Monetization Test to the AI Industry

Google adjusted its AI usage rules ahead of the I/O conference, introducing a detailed counter for the Gemini large model, marking the end of the 'almost unlimited' free era. The new rules no longer simply limit the number of chats but use a counting function to more precisely control user usage, indicating that free services will face stricter limitations. This move is intended to prepare for the upcoming Gemini upgrade and new hardware plans, but it came as a surprise to users.

May 20, 2026

400

Say No to Smartphone Addiction! Google Teams Up with Fashion Brands to Launch Two High-End AI Smart Glasses

Google and Samsung jointly launched two smart glasses at the 2026 I/O conference, in collaboration with Gentle Monster and Warby Parker. The products are positioned as an extension of smartphones, deeply integrated with Gemini AI, to free users' hands. Users can use them in scenarios such as commuting, walking, and shopping without frequently taking out their phones.

May 20, 2026

200

Free Beta Test Countdown: Tencent Cloud's Two Major AI Models to Enter Official Commercial Use at the End of the Month

Tencent Cloud announced that its two core large models, Hy3preview and DeepSeek-V4-Pro, on its agent development platform will end free public beta testing on May 27, 2026, at 10:00, transitioning to commercial operation. Subsequently, billing will be based on actual model usage volume. Developers and enterprise users need to adjust strategies promptly to adapt to the upcoming paid model.....

May 19, 2026

540

Tencent Cloud Announces the End of Free Public Testing for Hy3 Preview and DeepSeek-V4-Pro Models, Transitioning to Commercial Use

Tencent Cloud announced that the free public testing period for the Hy3 preview and DeepSeek-V4-Pro models in its intelligent agent development platform will end on May 27, 2026 at 10:00. These two models have received widespread attention during the public testing period, providing powerful intelligent solutions to help developers and enterprises improve efficiency and business capabilities. Tencent Cloud stated that after long-term debugging and testing, the models now have higher stability and intelligence levels.

May 19, 2026

400

Latest AI News

AI Daily Brief

AI Product Finder

AI Product Rankings

AI Product Submit

AI Tools Directory

GEO Brand Visibility

AI Visibility Audit

AI Search Visibility Checker

GEO Ranking Monitor

AI Conversation Insight

GEO Promotion Link Detection

GEO Ranking Optimization System

GEO Ranking Optimization

MCP Servers

MCP Client

MCP Case Tutorials

MCP Ranking

MCP Service Submission

MCP Playground

MCP Inspector

LLM API Hub

AI Models Finder

Model Providers

LLM Leaderboard

Compare LLMs

LLM Cost Calculator

LLM Arena

AI Model Compatibility Checker

AI Deployment Calculator

Meituan LongCat-AudioDiT Open Source: Pioneering Waveform Latent Space Modeling, Setting New SOTA for Voice Cloning

AIbase基地

Core Architecture: Saying Goodbye to Mel Spectra

Inference Optimization: Precisely Solving Voice Drift

Performance: SOTA-Level Cloning Accuracy

This article is from AIbase Daily

AI News Recommendations

Major Open Source Release! Native Multimodal LongCat-Next Released, Making Vision and Speech the Mother Tongue of AI

Cracking Down on AI Fraud! Google Launches Fake Call Detection Feature to Expose Forged Numbers

End of the No-Code Era! OpenAI Launches Sites Feature - Turn Ideas into Interactive Websites with Just Words

Mistral AI Enters the High-End Manufacturing Sector: Partnering with Airbus and BMW to Bet on a New Physical AI Track

Xiaopeng Automotive Advances Humanoid Robot Mass Production: Plans to Start Mass Production by the End of 2026 and Bring It to Stores in the Following Year

No Rehearsal, Real Fight on Stage! Meituan LongCat-Video-Avatar1.5 Open-Sourced: Fully Outperforming Mainstream Closed-Source Models

The End of the Free Era for Gemini? Google Quietly Introduces Quota Limits, Bringing a Harsh Monetization Test to the AI Industry

Say No to Smartphone Addiction! Google Teams Up with Fashion Brands to Launch Two High-End AI Smart Glasses

Free Beta Test Countdown: Tencent Cloud's Two Major AI Models to Enter Official Commercial Use at the End of the Month

Tencent Cloud Announces the End of Free Public Testing for Hy3 Preview and DeepSeek-V4-Pro Models, Transitioning to Commercial Use

AI News Recommendations

Major Open Source Release! Native Multimodal LongCat-Next Released, Making Vision and Speech the Mother Tongue of AI

Cracking Down on AI Fraud! Google Launches Fake Call Detection Feature to Expose Forged Numbers

End of the No-Code Era! OpenAI Launches Sites Feature - Turn Ideas into Interactive Websites with Just Words

Mistral AI Enters the High-End Manufacturing Sector: Partnering with Airbus and BMW to Bet on a New Physical AI Track

Xiaopeng Automotive Advances Humanoid Robot Mass Production: Plans to Start Mass Production by the End of 2026 and Bring It to Stores in the Following Year

No Rehearsal, Real Fight on Stage! Meituan LongCat-Video-Avatar1.5 Open-Sourced: Fully Outperforming Mainstream Closed-Source Models

The End of the Free Era for Gemini? Google Quietly Introduces Quota Limits, Bringing a Harsh Monetization Test to the AI Industry

Say No to Smartphone Addiction! Google Teams Up with Fashion Brands to Launch Two High-End AI Smart Glasses

Free Beta Test Countdown: Tencent Cloud's Two Major AI Models to Enter Official Commercial Use at the End of the Month

Tencent Cloud Announces the End of Free Public Testing for Hy3 Preview and DeepSeek-V4-Pro Models, Transitioning to Commercial Use