The AntBelle Large Model team recently announced the open-source release of its new efficient inference model —— Ring-mini-sparse-2.0-exp. This model is based on the Ling2.0 architecture and is optimized for long sequence decoding, adopting an innovative sparse attention mechanism.

This new architecture integrates a high sparsity ratio Mixture of Experts (MoE) structure with a sparse attention mechanism, aiming to enhance the model's performance in complex long sequence reasoning scenarios.

image.png

The team stated that due to the deep collaborative optimization of the architecture and inference framework, Ring-mini-sparse-2.0-exp has nearly tripled the throughput when processing long sequences compared to its predecessor Ring-mini-2.0.

In multiple high-difficulty reasoning benchmark tests, the model consistently maintained SOTA (State of the Art) performance, demonstrating its excellent context processing capabilities and efficient reasoning ability, providing the open-source community with a new lightweight solution.

The Ling2.0Sparse architecture mainly aims to address two core trends in the future development of large language models: expansion of context length and expansion at test time. The team drew inspiration from the design concept of Mixture of Block Attention (MoBA), adopting block-wise sparse attention, which divides the input Key and Value into blocks, and each query selects top-k blocks on the head dimension.

Only softmax computation is performed on the selected blocks, significantly reducing computational costs. In addition, the team combined MoBA design with Grouped Query Attention (GQA), allowing query heads within the same group to share the top-k block selection results, thereby reducing I/O costs.

GitHub: https://github.com/inclusionAI/Ring-V2/tree/main/moba

Key Points:   

🌟 The new model Ring-mini-sparse-2.0-exp performs exceptionally well in long sequence reasoning, with nearly triple the throughput.   

🔍 The model adopts an innovative sparse attention mechanism, balancing efficient reasoning and context processing capabilities.   

📥 The model is open-sourced on multiple platforms, making it convenient for the community to apply and research.