MOSS-TTSD Makes a Stunning Open Source Debut: A Million Hours of Training Creates a New King in AI Podcasts

AIbase基地

Published inAI News · 4 min read · Aug 1, 2025

MOSS-TTSD (Text to Spoken Dialogue), developed by the Tsinghua University Speech and Language Laboratory (Tencent AI Lab) in collaboration with Shanghai Chuangzhi College, Fudan University, and Musi Intelligent, has been officially open-sourced. This marks a major breakthrough in AI speech synthesis technology for dialogue scenarios.

This speech dialogue generation model is based on the Qwen3-1.7B-base model and is trained further using approximately 1 million hours of single-speaker voice data and 400,000 hours of dialog voice data. It uses a discrete speech sequence modeling method to achieve high expressive spoken dialogue generation in both Chinese and English, making it particularly suitable for long-form content creation such as AI podcasts, audiobooks, and film and television dubbing.

The core innovation of MOSS-TTSD is its XY-Tokenizer, which adopts a two-stage multi-task learning approach. By using eight RVQ codebooks, it compresses the speech signal to a bitrate of 1 kbps while preserving semantic and acoustic information, ensuring naturalness and fluency in the generated speech. The model supports ultra-long speech generation of up to 960 seconds, avoiding unnatural transitions caused by segment stitching in traditional TTS models. Additionally, MOSS-TTSD has zero-shot voice cloning capabilities, enabling two-person voice cloning by uploading complete dialogues or single-person audio, and supports voice event control, such as laughter, adding more expressiveness to the speech.

Compared to other voice models in the market, MOSS-TTSD significantly outperforms the open-source model MoonCast in objective Chinese metrics, with excellent prosody and naturalness. However, compared to ByteDance's Douba voice model, it slightly lags in tone and rhythm. Nevertheless, with the advantages of being open-source and free for commercial use, MOSS-TTSD still shows strong application potential. Model weights, inference code, and API interfaces are fully open-sourced via GitHub (https://github.com/OpenMOSS/MOSS-TTSD) and HuggingFace (https://huggingface.co/fnlp/MOSS-TTSD-v0.5). Official documentation and online demo experiences are also available, providing developers with convenient access.

The release of MOSS-TTSD brings new vitality to the field of AI speech interaction, especially in scenarios such as long interviews, podcast production, and film and television dubbing, where its stability and expressiveness will drive the intelligent process of content creation. In the future, the team plans to further optimize the model, enhancing the accuracy of speech switching and emotional expression in multi-speaker scenarios.

Address: https://github.com/OpenMOSS/MOSS-TTSD

ByteDance Launches a New AI Programming Model! It Only Costs 3 Jiao to Make Web Development Easier

ByteDance launched the Doubao-Seed-Code programming model, achieving breakthroughs in performance, cost, and migration. The model is compatible with the Claude API and has efficient code generation capabilities, simplifying web development. By deeply integrating with the TRAE development environment, it can solve programming problems faster and with higher accuracy, performing excellently in the SWE-Bench validation.

Tencent Cloud Major Model Upgrade Notice: DeepSeek-V3/DeepSeek-R1 Models Will Be Discontinued on the 24th

Tencent Cloud announced that the DeepSeek-V3 and DeepSeek-R1 models will be discontinued on November 24, 2025, and all service integrations will be stopped at that time. The official recommends users to migrate to the latest stable version to keep up with the continuous upgrades of major model technology.

Alibaba Qwen3-Max-Thinking Tops Global Math Competition, Challenging OpenAI's Leadership

Alibaba launched the upgraded AI reasoning model Qwen3-Max-Thinking, which achieved a perfect score in the American Invitational Mathematics Examination and the Harvard-MIT Mathematics Tournament, becoming the first Chinese AI model to achieve 100% accuracy in these two top-tier math competitions, highlighting its strong reasoning and problem-solving capabilities.

Research Finds Google AI Model Veo-3 Can Generate Realistic Surgical Videos but Lacks Medical Logic Understanding

Google's Veo-3 model can generate realistic surgical videos, but lacks understanding of medical procedures. In tests, the AI predicted the progression of surgery 8 seconds in advance based on surgical images, and evaluated 50 real surgical videos using the SurgVeo standard. Four surgeons participated in the evaluation and found flaws in key medical steps performed by the model.

AI Daily: Shanghai's First AI Prompt Copyright Case Judged; Kimi K2 Thinking Released; New King of Chinese Image Editing, UniWorld-V2 Released

Shanghai Huangpu District Court ruled in the first instance that AI prompts do not possess originality and do not constitute copyright infringement. This is the first copyright case involving AI prompts in Shanghai. The court held that prompts lack originality and therefore are not protected by copyright law.

Latest AI News

AI Daily Brief

AI Product Finder

AI Product Rankings

AI Product Submit

AI Tools Directory

AI Models Finder

LLM Leaderboard

Model Providers

Submit Your Model

Compare LLMs

LLM Cost Calculator

LLM Arena

MCP Servers

MCP Client

MCP Case Tutorials

MCP Ranking

MCP Service Submission

MCP Playground

MCP Inspector

AI Brand Monitoring Tool

GEO Services​

AI Search Visibility Checker

AI Model Compatibility Checker

AI Deployment Calculator

AI Dataset Collection

Intelligent Document Recognition

MOSS-TTSD Makes a Stunning Open Source Debut: A Million Hours of Training Creates a New King in AI Podcasts

AIbase基地

This article is from AIbase Daily

AI News Recommendations

ByteDance Launches a New AI Programming Model! It Only Costs 3 Jiao to Make Web Development Easier

Tencent Cloud Major Model Upgrade Notice: DeepSeek-V3/DeepSeek-R1 Models Will Be Discontinued on the 24th

Editing audio like editing a Word document? StepXenon releases the 3 billion parameter audio editing model Step-Audio-EditX

Step-Audio-EditX Launch: 3 Billion Parameter Audio LLM Opens the Era of Voice Editing

Alibaba Qwen3-Max-Thinking Tops Global Math Competition, Challenging OpenAI's Leadership

Research Finds Google AI Model Veo-3 Can Generate Realistic Surgical Videos but Lacks Medical Logic Understanding

AI Daily: Shanghai's First AI Prompt Copyright Case Judged; Kimi K2 Thinking Released; New King of Chinese Image Editing, UniWorld-V2 Released

New King in Chinese Image Editing! UniWorld-V2 Released: Select and Edit, Accurate Rendering of Chinese Fonts, Performance Surpasses GPT-Image and Gemini

HeyGen's AI Video Translation Shakes the Market! Foreigners Speak Chinese with Precision, Lip Sync Accurate to the Millisecond

Google Gemini 3 Pro Preview Appears in Vertex AI: Supports a Million-Level Context Window

GEO Services