Recently, Huawei demonstrated an astonishing breakthrough with its newly launched "Ascend + Pangu Ultra MoE" system: a MoE large model with nearly one trillion parameters was able to understand and answer a higher mathematics problem in just 2 seconds. All of this was achieved without the use of GPUs, showcasing Huawei's strong capabilities in self-controlled domestic computing power and model training.
In technical terms, Huawei's team successfully enhanced the overall performance of the training system by intelligently selecting parallel strategies and optimizing computational communication, significantly improving cluster training efficiency. In its technical report, Huawei detailed several technological innovations implemented on the CloudMatrix384 super node, including improved communication mechanisms and load balancing strategies. These innovations reduced the expert parallel communication overhead for large-scale MoE training almost to zero while effectively balancing computational loads.
Additionally, Huawei made significant progress in enhancing single-node computing power. By optimizing the execution of training operators, they were able to double the micro-batch size and resolve efficiency issues related to operator distribution. This advancement means that Huawei's system can utilize existing resources more efficiently when handling complex computational tasks.
Huawei's series of technological innovations not only greatly improved the training efficiency of MoE models but also opened up new possibilities for the training and application of future large-scale AI models.