In recent technological advancements, NVIDIA, in collaboration with MIT and the University of Hong Kong, introduced a new framework called Fast-dLLM, which significantly boosts the inference speed of diffusion-based language models (Diffusion-based LLMs) by up to 27.6 times. This innovation opens new frontiers for the application of language models.

Diffusion models are considered strong competitors to autoregressive models, utilizing bidirectional attention mechanisms that enable them to theoretically generate multiple tokens simultaneously, thereby accelerating decoding speed. However, in practical applications, diffusion models often fall short of autoregressive models in terms of inference speed because each generation requires redundant computations of all attention states, leading to high computational costs. Additionally, during multi-token decoding, dependencies between tokens can be disrupted, affecting the quality of generated content, which limits their practical application.

image.png

To overcome these bottlenecks, NVIDIA's research team introduced two core innovations in the Fast-dLLM framework: block-wise approximate KV caching mechanism and confidence-aware parallel decoding strategy. The KV cache reduces computational redundancy by dividing sequences into blocks and precomputing and storing the activation values of other blocks. Its DualCache version further improves efficiency by leveraging the high similarity between adjacent inference steps to cache prefix and suffix tokens.

image.png

Meanwhile, the confidence-aware decoding strategy selectively decodes high-confidence tokens based on set thresholds, avoiding potential dependency conflicts from synchronous sampling and ensuring the quality of the generated content remains unaffected.

Fast-dLLM demonstrated impressive performance across multiple benchmark tests. On the GSM8K dataset, the framework achieved an astonishing 27.6x acceleration in generating 1024 tokens under 8-shot configuration, reaching an accuracy rate of 76.0%; in the MATH benchmark test, it accelerated by 6.5x with an accuracy rate of approximately 39.3%; in HumanEval and MBPP tests, it achieved accelerations of 3.2x and 7.8x, respectively, maintaining accuracy rates at 54.3% and close to baseline levels.

image.png

Fast-dLLM maintains an accuracy drop of only 1-2 percentage points while achieving acceleration, showcasing its excellent balance between speed and quality. This research achievement provides stronger support for the application of diffusion models in practical language generation tasks, enabling them to compete with autoregressive models and laying a solid foundation for future widespread adoption.