Redditee recently announced the opening of its first large-scale model — dots.llm1, which has 142 billion parameters and is a mixture-of-experts (MoE) model. One notable feature of its design is that only 14 billion parameters are activated during inference, allowing this structure to maintain high performance while significantly reducing training and inference costs.

image.png

The dots.llm1 model uses 11.2 trillion non-synthetic high-quality training data, which is very rare in current open-source large models, indicating Redditee's strong resources in language processing. The model performs excellently in Chinese tests with an average score of 91.3, surpassing several competitors such as DeepSeek's V2, V3, and Alibaba's Qwen2.5 series.

In terms of technical architecture, dots.llm1 adopts a unidirectional decoder Transformer structure and replaces traditional feedforward networks with MoE. Unlike conventional models, MoE separates multiple expert networks, where each expert network focuses on different features of the input data, thereby activating only a small portion of the networks for computation during inference, greatly saving computational power requirements.

Specifically, dots.llm1 includes 128 routing experts and 2 shared experts. Each expert is a network with two layers of feedforward structures using the SwiGLU activation function to capture complex relationships in the data. When processing input tokens, the model dynamically selects 6 most relevant experts and 2 shared experts for computation.

In addition, dots.llm1 introduces an improved RMSNorm normalization operation during training to stabilize model performance and outputs. In the MoE module, the introduction of load balancing strategies ensures balanced use of all expert networks, avoiding over-reliance on certain experts.

To improve training efficiency, dots.llm1 also uses the AdamW optimizer, an optimization algorithm that effectively prevents overfitting and controls gradient explosion.

Data processing is crucial for training large models. Dots.llm1 has undergone a rigorous three-tier data processing pipeline to ensure high-quality training data. After a series of filtering and processing, it finally forms 11.2 trillion high-quality token training data. Moreover, Redditee has also open-sourced intermediate training checkpoints for every 1 trillion tokens to promote further academic research development.

Open-source address: https://huggingface.co/rednote-hilab/dots.llm1.base/tree/main

Key points:

🌟 dots.llm1 is Redditee's first open-source large model, using an expert mixture structure with 142 billion parameters.

📊 The model uses 11.2 trillion non-synthetic data and performs excellently in Chinese tests.

🔍 Through strict data processing pipelines, it ensures the effectiveness and reliability of high-quality training data.