Same computing power and data, why do some models perform better? Moonshot AI provides a fundamental answer.

On March 16, Kimi released a major technical report “Attention Residuals” (Attention Residuals). This research thoroughly restructures the "foundation" of large models since 2015 — residual connections (Residual Connections). Experiments show that, with the same computing power, the new method achieves the same performance as the baseline model using 1.25 times the computing power.

image.png

This breakthrough quickly caused a stir in the Silicon Valley AI community, with social media openly praising it as “Impressive work from Kimi.”

Jerry Tworek (main inventor of OpenAI o1): Called it the beginning of “Deep Learning 2.0.”

Andrej Karpathy (co-founder of former OpenAI): Expressed that the industry still has room to explore the understanding of “Attention is All You Need.”

Why modify the “time-honored foundation”?

Although traditional residual connections solve the problem of training deep networks, their “equal addition” approach is too crude. As the network deepens, the new contribution of each layer tends to be overwhelmed by accumulated information, leading many intermediate layers to become “ineffective workers.”

image.png

Kimi's “Elegant Rotation”:

The team found that the loss of information in the depth direction is highly consistent with the forgetting in the time dimension of RNNs. They then rotated the attention mechanism, originally used for processing text sequences, 90 degrees horizontally and applied it to the vertical depth dimension.

Through this, each layer no longer passively receives accumulated information but actively and selectively decides how much information to extract from previous layers through a small “query vector.” To address memory overhead in large-scale training, the team also innovatively proposed the Block AttnRes solution, dividing the network into several blocks. This ensures performance while keeping the inference delay increase within 2%.

image.png

In experiments, this architecture demonstrated strong generalization ability. It achieved a 7.5% improvement on the GPQA-Diamond science reasoning task, and significant gains of 3.6% and 3.1% in math and code generation tasks, respectively.

image.png

As the founder stated in his speech at GTC2026, the industry is gradually encountering the limits of scaling and must restructure foundational elements such as optimizers and residual connections. While most people are still focusing on “high-level renovation,” chose to go to the deepest level, striking a heavy blow to the future of deep learning with one decisive move.