Tencent WeChat AI team has released a new diffusion language model framework - WeDLM (WeChat Diffusion Language Model). The design of this model aims to break through the limitations of traditional large language models (such as the GPT series) in parallel inference efficiency, providing more efficient text generation capabilities.

image.png

WeDLM combines diffusion models with standard causal attention mechanisms through an innovative topological reordering technique. This integration allows WeDLM to be compatible with KV cache technology, effectively solving the inference speed limitations caused by bidirectional attention in traditional diffusion models. This improvement not only enhances inference speed but also ensures effective generation quality, especially excelling in handling complex reasoning tasks.

In practical performance testing, WeDLM shows significant speed advantages. For example, in the mathematical reasoning task GSM8K, the WeDLM-8B model's inference speed is about three times faster than that of an optimized autoregressive model (such as Qwen3-8B), and in low-entropy scenario counting tasks, the speed improvement can even reach more than 10 times. At the same time, in multiple benchmark tests (such as ARC, MMLU, Hellaswag), WeDLM's generation quality is comparable or even better than traditional autoregressive baseline models, indicating that it not only achieves breakthroughs in efficiency but also maintains high accuracy.

WeDLM's efficient inference capability makes it suitable for various scenarios, including intelligent customer service, code assistance generation, and real-time Q&A. With its promotion in practical applications, WeDLM is expected to reduce computing costs, improve user experience, and promote the broader application of AI technology.

github:https://github.com/tencent/WeDLM

Key Points:

- 🚀 WeDLM improves inference speed through topological reordering technology, solving the bottlenecks of traditional models.

- 📊 In tasks such as GSM8K, the speed of WeDLM-8B is about three times faster than that of optimized autoregressive models.

- 💡 Suitable for multiple scenarios such as intelligent customer service and real-time Q&A, reducing computing costs and improving user experience.