Spirit LM
Multimodal language model that integrates text and speech
CommonProductProductivityMultimodalLanguage Model
Spirit LM is a fundamental multimodal language model that can freely combine text and speech. The model is based on a 7B pretrained text language model and extends to the speech modality through continuous training on both text and speech units. Speech and text sequences are concatenated into a single token stream and trained using a small automatically curated speech-text parallel corpus with a word-level interleaving approach. Spirit LM offers two versions: the basic version uses speech phoneme units (HuBERT), while the expressive version adds pitch and style units to simulate expressiveness. For both versions, text is encoded using subword BPE tokens. This model not only demonstrates the semantic capabilities of text models but also showcases the expressive abilities of speech models. Furthermore, we demonstrate that Spirit LM can learn new tasks across modalities with few samples (e.g., ASR, TTS, speech classification).
Spirit LM Visit Over Time
Monthly Visits
13210
Bounce Rate
45.30%
Page per Visit
1.5
Visit Duration
00:00:05