Recently, tech giant NVIDIA, in collaboration with the Massachusetts Institute of Technology (MIT) and the University of Hong Kong, released a new framework called Fast-dLLM. This innovative framework aims to significantly boost the inference speed of diffusion-based large language models (Diffusion-based LLMs), achieving up to a 27.6-fold increase, providing stronger technical support for AI applications.
The Challenges and Opportunities of Diffusion Models
Diffusion models are considered powerful competitors to traditional autoregressive models (Autoregressive Models). They use bidirectional attention mechanisms (Bidirectional Attention Mechanisms) to theoretically accelerate decoding by synchronously generating multiple tokens (Multi-token Generation). However, in practical applications, diffusion models often lag behind autoregressive models in terms of inference speed due to the need to recalculate all attention states during each generation step, resulting in high computational costs. Additionally, when decoding multiple tokens simultaneously, dependencies between tokens can easily be disrupted, affecting the quality of the generated output.
Innovations in the Fast-dLLM Framework
To address these issues, the NVIDIA team developed the Fast-dLLM framework, introducing two key innovations: block-wise approximate KV caching mechanism and confidence-aware parallel decoding strategy.
1. **Block-wise Approximate KV Caching Mechanism**: This mechanism divides the sequence into multiple blocks (Blocks), precomputes and stores the activation values (KV Activations) for each block, and reuses them during subsequent decoding steps. This approach significantly reduces computational redundancy and boosts efficiency. Its DualCache version further caches prefix and suffix tokens, leveraging the high similarity between adjacent inference steps to enhance processing speed.
2. **Confidence-Aware Parallel Decoding Strategy**: This strategy selectively decodes high-confidence tokens based on a set threshold (Confidence Threshold), avoiding dependency conflicts caused by synchronous sampling, thereby ensuring the quality of the generated output.
Outstanding Performance
Fast-dLLM has demonstrated excellent performance in various benchmark tests. On the GSM8K dataset, in an 8-shot configuration, it achieved a 27.6-fold speedup when generating sequences of 1024 tokens, with an accuracy rate of 76.0%. In the MATH benchmark test, the acceleration factor was 6.5x, with an accuracy rate of approximately 39.3%. In the HumanEval and MBPP tests, it achieved acceleration factors of 3.2x and 7.8x, respectively, maintaining accuracy rates at around 54.3% and baseline levels. Overall, Fast-dLLM achieves a balance between speed and quality, with only a 1-2 percentage point drop in accuracy while significantly increasing speed.
By addressing issues of inference efficiency and decoding quality, Fast-dLLM enables diffusion models to compete with autoregressive models in real-world language generation tasks, laying the groundwork for broader future applications. With the promotion of this technology, we can expect to see more practical applications of artificial intelligence in various fields.
Project: https://nvlabs.github.io/Fast-dLLM/