Cambricon has announced that it has completed the "Day0" adaptation of the entire DeepSeek's latest open-source model DeepSeek-V4 series based on the vLLM inference framework. This adaptation includes the Flash version with 285B parameters and the Pro version with 1.6T parameters, ensuring stable operation of the model on Cambricon hardware platforms on the day of release. The corresponding adaptation code has been officially open-sourced to the GitHub community.

For the unique sparse attention and compressed structure of DeepSeek-V4, Cambricon has specially accelerated core modules such as Compressor by using its self-developed vector fusion operator library Torch-MLU-Ops. Using the high-performance programming language BangC, the Cambricon team has written highly optimized kernels for hot operators such as sparse Attention and GroupGemm, and fully supports five-dimensional hybrid parallel strategies (TP/PP/SP/DP/EP), low-precision quantization, and PD separation deployment in the vLLM framework. These technical approaches significantly improve the token throughput of end-to-end inference while meeting latency constraints.

On the hardware side, Cambricon deeply leverages the memory access and sorting acceleration features of MLU to effectively address the complex indexing structure of DeepSeek-V4. With the advantages of high interconnect bandwidth and low-latency communication, this solution maximizes the reduction of communication loss in Prefill and Decode scenarios, improving inference utilization.

Industry analysis indicates that DeepSeek-V4, with its ultra-long context of one million words (1M) and top-tier logical reasoning performance, imposes strict requirements on the underlying computing architecture. Cambricon's agile adaptation on the day of the model's release not only demonstrates the capability of domestic computing platforms to support ultra-large-scale, complex-structured models, but also indicates that the domestic AI industry chain has entered a mature stage in software-hardware collaboration, providing an efficient computing foundation for the popularization of large model applications.