Recently, a team of researchers from Johns Hopkins University launched mmBERT, a new multilingual encoder designed to fill the gap in the current multilingual natural language processing field. The model outperforms XLM-R on multiple tasks and is 2 to 4 times faster than previous models, providing stronger support for research and development of multilingual applications.

image.png

The architecture of mmBERT consists of two main configurations: the base model and the small model. The base model has 22 transformer layers, with a hidden layer dimension of 1152 and approximately 307 million total parameters, while the small model has 140 million parameters. mmBERT uses an advanced Gemma2 tokenizer, supporting a vocabulary size of 256k, and utilizes rotary position embeddings (RoPE) and FlashAttention2 technology, significantly improving processing efficiency. Additionally, the sequence length has been extended from 1024 tokens to 8192 tokens, meaning it can process longer context information.

image.png

In terms of training data, mmBERT used 3 trillion tokens from multiple sources, covering 1833 languages. English accounted for only 10% to 34% of the entire corpus. The training was divided into three stages: pre-training, mid-training, and decay stage. In each stage, the model gradually encounters more languages and higher-quality data, which helps improve the performance of low-resource languages.

mmBERT demonstrates outstanding performance on multiple benchmark tests. In the English natural language understanding (GLUE) task, the base model of mmBERT scored 86.3, surpassing XLM-R's 83.3. In the multilingual natural language understanding (XTREME) task, mmBERT scored 72.8, also exceeding XLM-R's 70.4. Moreover, mmBERT performed well in embedding tasks and code retrieval tasks, showing its potential in various application scenarios.

By focusing particularly on low-resource languages, mmBERT ensures that these languages are fully utilized during the training process. In multiple benchmark tests, mmBERT's performance on low-resource languages such as Faroese and Tigrinya outperformed other large models, proving that encoder models can effectively tackle the challenges of low-resource scenarios after careful training.

mmBERT not only improves the speed and efficiency of multilingual processing but also lays a solid foundation for the next generation of multilingual natural language processing systems. It redefines the potential of multilingual encoders in an efficient and open manner, marking the arrival of a new era.

github: https://github.com/JHU-CLSP/mmBERT?tab=readme-ov-file

Key Points:

🌍 mmBERT model outperforms XLM-R on multiple tasks, becoming the new benchmark for multilingual NLP.

⚡ The model is 2 to 4 times faster and supports input up to 8192 tokens.

📊 mmBERT pays special attention to the training performance of low-resource languages, demonstrating strong adaptability and wide application potential.