In the wave of global artificial intelligence development, the speed and efficiency of model inference have become focal points. Recently, Huawei's mathematical team introduced a new technology called FlashComm during the DeepSeek open-source cycle. This technology aims to significantly enhance the performance of large model inference through three innovative measures, with up to an 80% increase in speed.

Firstly, FlashComm technology focuses on optimizing the AllReduce communication operation. Traditional AllReduce methods are like container trucks carrying full loads, lacking flexibility. Huawei's team intelligently divides the data into two parts: first performing ReduceScatter, then AllGather. This reorganization process reduces subsequent communication by 35% and key computational volume to 1/8 of the original, improving inference performance by 22% to 26%.

image.png

Secondly, during the inference process, Huawei discovered that adjusting the parallel dimensions of matrix multiplication can alleviate communication burdens. By flattening three-dimensional tensors into two-dimensional matrices while maintaining result accuracy, combined with INT8 quantization technology, the amount of data transmission drops by 86%, and overall inference speed increases by 33%. This strategy is akin to loading large goods into smaller containers, making data transmission more efficient.

image.png

Finally, Huawei's multi-stream parallelism technology breaks the limitations of traditional serial computing. During the inference process of MoE models, Huawei's team dissects and reorganizes complex calculation workflows, achieving precise parallelism among three computational streams with Ascend hardware's multi-stream engines. This method enables one group of data to enter the expert calculation stage while another group simultaneously proceeds to the gating decision stage, maximizing computational efficiency.

image.png

The release of FlashComm marks a significant technical breakthrough for Huawei in the field of large model inference. It not only enhances the speed of model inference but also propels the development of AI applications, creating new opportunities for AI applications in scientific research and industrial fields.