Spirit LM

Multimodal language model that integrates text and speech

CommonProductProductivityMultimodalLanguage Model
Spirit LM is a fundamental multimodal language model that can freely combine text and speech. The model is based on a 7B pretrained text language model and extends to the speech modality through continuous training on both text and speech units. Speech and text sequences are concatenated into a single token stream and trained using a small automatically curated speech-text parallel corpus with a word-level interleaving approach. Spirit LM offers two versions: the basic version uses speech phoneme units (HuBERT), while the expressive version adds pitch and style units to simulate expressiveness. For both versions, text is encoded using subword BPE tokens. This model not only demonstrates the semantic capabilities of text models but also showcases the expressive abilities of speech models. Furthermore, we demonstrate that Spirit LM can learn new tasks across modalities with few samples (e.g., ASR, TTS, speech classification).
Visit

Spirit LM Visit Over Time

Monthly Visits

13210

Bounce Rate

45.30%

Page per Visit

1.5

Visit Duration

00:00:05

Spirit LM Visit Trend

Spirit LM Visit Geography

Spirit LM Traffic Sources

Spirit LM Alternatives