The Ant Group officially open-sourced the industry's first high-performance diffusion language model inference framework, dInfer, on October 13th.

In benchmark tests, dInfer improved the inference speed of diffusion language models by 10.7 times compared to NVIDIA's diffusion model framework Fast-dLLM; in the code generation task HumanEval, dInfer achieved a speed of 1011 Tokens/second in single-batch inference, becoming the first in the open-source community to achieve significantly faster single-batch inference speed than autoregressive models. The work of dInfer demonstrates that diffusion language models have significant efficiency potential and can be realized through systematic engineering innovation, providing a highly competitive option for the architectural path towards AGI.

Diffusion language models, as a new paradigm, treat text generation as a "denoising process" of gradually restoring a complete sequence from random noise, offering three major advantages: high parallelism, global perspective, and flexible structure. With these advantages, models such as LLaDA-MoE released by Ant Group and Renmin University have shown accuracy comparable to top autoregressive (AR) models in multiple benchmark tests. However, in terms of inference efficiency, the theoretical strong potential of dLLMs has long been constrained by harsh realities. Efficient inference of dLLMs faces three major challenges: high computational cost, KV cache failure, and parallel decoding. These bottlenecks have made the inference speed of diffusion language models unsatisfactory. How to break free from these constraints and unlock the potential of diffusion language models in inference efficiency has become an urgent problem in the field.

dInfer is a high-performance inference framework specifically designed for diffusion language models, combining algorithms and systems deeply. It supports various diffusion language models, including LLaDA, LLaDA-MoE, and LLaDA-MoE-TD.

dInfer includes four core modules: Model, KV-Cache Manager, Iteration Manager, and Decoder. This plug-and-play architecture allows developers to combine and explore different module optimization strategies like building with LEGO bricks, and conduct standardized evaluations on a unified platform. More importantly, dInfer integrates targeted solutions for the above three challenges in each module.

image.png

(Figure description: Architecture of dInfer)

On a node equipped with 8 NVIDIA H800 GPUs, dInfer's performance is impressive:

Compared to the previous dLLM inference solution Fast-dLLM, dInfer achieved a 10.7 times increase in average inference speed (avg TPS) while maintaining similar model performance (681 vs 63.6); in the code generation task HumanEval, dInfer achieved a speed of 1011 tokens/second in single-batch inference. Compared to the AR model Qwen2.5-3B, which runs on the industry-leading inference service framework vLLM and has similar parameters and performance, dInfer's average inference speed is 2.5 times higher (681 vs 277).

Ant Group stated that dInfer connects cutting-edge research with industrial application, marking a crucial step for diffusion language models from "theoretical feasibility" to "practical efficiency." This open-source initiative also invites developers and researchers around the world to jointly explore the great potential of diffusion language models and build a more efficient and open AI new ecosystem.