Recently, Ant Group officially open-sourced the industry's first high-performance diffusion language model inference framework - dInfer. The release of this framework not only marks a significant breakthrough in the inference speed of diffusion language models, but also signifies an important step for this emerging technology towards practical applications.
In the latest benchmark tests, dInfer's inference speed is 10.7 times faster than NVIDIA's Fast-dLLM framework. In the code generation task HumanEval, dInfer achieved a speed of 1011 Tokens per second in a single inference, which is the first time in the open-source community that the inference speed of a diffusion language model has significantly surpassed traditional autoregressive models. Such progress has generated great expectations for the future of diffusion language models, and it is believed that they will become an important technological path towards Artificial General Intelligence (AGI).

The unique aspect of diffusion language models lies in their view of text generation as a "denoising process" to gradually recover a complete sequence from random noise, with characteristics of high parallelism, global perspective, and flexible structure. Although theoretically having great potential, dLLM has been limited in practical inference by high computational costs, KV cache failure, and parallel decoding, among other challenges. These issues have prevented the full potential of diffusion language models from being realized, and breakthroughs are urgently needed.
To address these challenges, dInfer is specifically designed for diffusion language models and includes four core modules: model access, KV cache manager, diffusion iteration manager, and decoding strategy. This modular design allows developers to flexibly combine and optimize each module, while conducting standardized evaluations on a unified platform, much like building with LEGO toys.
On a node equipped with eight NVIDIA H800 GPUs, dInfer performs exceptionally well. Compared to Fast-dLLM, dInfer achieves an average inference speed of 681 Tokens per second under comparable performance, while Fast-dLLM's speed is only 63.6 Tokens per second. Furthermore, compared to the autoregressive model Qwen2.5-3B running on the industry-leading inference service framework vLLM, dInfer's speed is 2.5 times faster.
Ant Group stated that the release of dInfer is an important step in connecting cutting-edge research with industrial applications. They look forward to working with developers and researchers around the world to explore the huge potential of diffusion language models and build a more efficient and open AI ecosystem.


