VibeVoice is an open-source family of speech AI models, including long-form speech recognition (ASR) and text-to-speech (TTS) models. It innovatively uses a continuous speech tokenizer, which can process long sequences at a very low frame rate and complete a 60-minute audio transcription in a single pass, generating structured outputs. VibeVoice supports multiple languages and focuses on improving the naturalness and expressiveness of speech generation, making it highly suitable for research and development purposes. Users are required to use it responsibly. This product is free and open-source, suitable for researchers and developers in speech recognition and synthesis.