Traditional Transformer models often seem "wasteful" when dealing with repetitive knowledge, as they need to recompute the same patterns each time, which not only consumes depth but also wastes computational resources. To break through this bottleneck, the DeepSeek research team recently introduced an innovative module called Engram, which introduces an efficient "conditional memory axis" for sparse large language models (LLMs).

Different from existing Mixture of Experts (MoE) models, Engram is not intended to replace it, but rather to complement it, modernizing the classic N-gram embedding technique into a scalable lookup repository with query complexity of $O(1)$. In simple terms, Engram acts like a "quick memory book" for the model, specifically storing common phrases, entities, and other static patterns, allowing the model's core network to focus on more complex reasoning and long-range interactions.
In practical applications, the
Additionally, Engram performs well in long text processing. After expanding to a context window of 32,768 tokens, the Engram model demonstrated stronger accuracy in tasks such as multi-query "needle-in-a-haystack" (NIAH) and variable tracking. This design not only enhances the model's knowledge base but also effectively increases the model's effective depth by offloading static reconstruction tasks, making AI smarter and more efficient.
Key points:
🧠 Innovative Architecture:
introduced the Engram module, which efficiently retrieves static knowledge through $O(1)$ hash lookup, allowing the model's core to focus more on logical reasoning.DeepSeek 📈 Performance Leap: With the same computing resources, the 27B and 40B models incorporating Engram outperformed traditional MoE architectures in key rankings such as MMLU, math, and code.
📑 Enhanced Long Text Processing: This technology significantly improves the model's recall capability in long context environments, performing well in tests of 32k length, and effectively reducing the layer-to-layer loss required for prediction.



