At the recent ACL2025 awards ceremony, a research paper co-authored by Dr. Wenfeng Liang from DeepSeek, in collaboration with Peking University and other institutions, won the Best Paper Award. This conference was unprecedented in scale, with nearly double the number of submissions, reaching 8,360 papers, highlighting the intense competition.
The paper introduced a new mechanism called Native Sparse Attention (NSA), which can increase the processing speed of long texts by an astonishing 11 times through collaborative optimization of algorithms and hardware. More encouragingly, the performance of this technology not only improved but also surpassed traditional full-attention models. With this technology, the research team successfully extended the context length to an impressive 1 million tokens, laying the foundation for future advanced models.
The core of the NSA mechanism lies in its dynamic hierarchical sparsity strategy, combined with three parallel attention branches, effectively capturing important information in the text. First is "compression attention," responsible for summarizing global information; second is "selective attention," focusing on important word blocks; and finally, "sliding attention," ensuring the integrity of local context. This design makes the model more flexible and has been deeply optimized on modern GPU hardware, achieving a native trainable mode.
In testing, NSA increased decoding speed by 11.6 times when processing texts of 64k length, and forward and backward propagation speeds increased by 9 times and 6 times respectively. More importantly, NSA performed exceptionally well in various benchmark tests. A 27B parameter model outperformed the full-attention baseline in 7 out of 9 evaluation metrics, showing obvious advantages in complex tasks such as multi-hop question answering and code understanding.
This research opens up new possibilities for long-text processing, truly achieving a win-win situation in speed and accuracy, proving the wide application prospects of the NSA mechanism in the AI field.
Paper URL: https://arxiv.org/pdf/2502.11089