Ant Group and Renmin University jointly developed the native MoE architecture diffusion language model (dLLM) LLaDA-MoE, which was trained from scratch on about 20T data for the MoE architecture diffusion language model, verifying the scalability and stability of industrial-scale large-scale training; its performance exceeds the previously released dense diffusion language models LLaDA1.0/1.5 and Dream-7B, matching equivalent autoregressive models and maintaining a significant advantage in inference speed. The model will be fully open-sourced soon to promote technological development in dLLM within the global AI community.

On September 11, at the 2025 Inclusion·Bund Conference, Ant Group and Renmin University jointly launched the industry's first native MoE architecture diffusion language model (dLLM) "LLaDA-MoE." Assistant Professor Li Chongxuan from the Guangqi Institute of Artificial Intelligence at Renmin University, Director of the General Artificial Intelligence Research Center at Ant Group, Adjunct Researcher at West Lake University, and Founder of West Lake Xinchen, Lan Zhenzhong, participated in the launch ceremony.

1757647141286.jpg

(Renmin University and Ant Group jointly launched the first MoE architecture diffusion model LLaDA-MoE)

According to the introduction, this new model through a non-autoregressive mask diffusion mechanism, achieved language intelligence comparable to Qwen2.5 (such as context learning, instruction following, code and math reasoning) for the first time in large-scale language models with natively trained MoE, challenging the mainstream perception that "language models must be autoregressive."

Performance data shows that the LLaDA-MoE model outperforms diffusion language models such as LLaDA1.0/1.5 and Dream-7B in tasks such as code, mathematics, and Agent, approaching or surpassing the autoregressive model Qwen2.5-3B-Instruct, achieving the performance of an equivalent 3B dense model by activating only 1.4B parameters.

1757647166389.jpg

(Performance of LLaDA-MoE)

"The LLaDA-MoE model has verified the scalability and stability of industrial-scale large-scale training, meaning we have taken another step forward in scaling up dLLM to larger scales," said Lan Zhenzhong at the launch event.

Assistant Professor Li Chongxuan from the Guangqi Institute of Artificial Intelligence at Renmin University introduced, "After two years, the capabilities of AI large models have advanced rapidly, but some problems have not been fundamentally solved. The reason is that the current prevalent autoregressive generation paradigm used by large models is inherently unidirectional modeling, generating one token after another from front to back. This makes it difficult for them to capture bidirectional dependencies between tokens."

Facing these issues, some researchers have chosen to take a different approach, turning their attention to parallel decoding diffusion language models. However, existing dLLMs are all based on dense architectures, making it difficult to replicate the "parameter expansion and computational efficiency" advantages of MoE in ARM. In this industry context, the joint research team from Ant and Renmin University introduced the first native diffusion language model LLaDA-MoE on the MoE architecture.

Lan Zhenzhong also stated, "We will open-source the model weights and our self-developed inference framework to the global community to jointly drive the next breakthrough in AGI."

According to the information, the Ant and Renmin University team worked for three months, rewriting the training code based on LLaDA-1.0, and using Ant's self-developed distributed framework ATorch to provide EP parallel and other parallel acceleration technologies. Based on the training data of Ant's Ling2.0 base model, they made breakthroughs in core challenges such as load balancing and noise sampling drift, and finally completed efficient training on about 20T data using the 7B-A1B (total 7B, activated 1.4B) MoE architecture.

Under Ant's self-developed unified evaluation framework, LLaDA-MoE achieved an average improvement of 8.4% on 17 benchmarks such as HumanEval, MBPP, GSM8K, MATH, IFEval, BFCL, leading LLaDA-1.5 by 13.2%, and tied with Qwen2.5-3B-Instruct. The experiment once again verified that the "MoE amplifier" law also applies to the dLLM field, providing a feasible path for subsequent 10B–100B sparse models.

According to Lan Zhenzhong, in addition to the model weights, Ant will also open-source the inference engine optimized for the parallel characteristics of dLLM. Compared with NVIDIA's official fast-dLLM, this engine achieves significant acceleration. Related code and technical reports will be released on GitHub and Hugging Face communities soon.