Recently, the NVIDIA research team announced the release of Jet-Nemotron, a new series of language models (available in 2B and 4B parameter versions), which generates content 53.6 times faster than the current leading full-attention language models and achieves accuracy that matches or even exceeds these models. This breakthrough was not achieved by retraining the model from scratch, but by using a new technology called "PostNAS" to improve existing pre-trained models.

image.png

With the widespread application of modern language models, such as Qwen3, Llama3.2, and Gemma3, these models have set new benchmarks in accuracy and flexibility. However, their O(n²) self-attention mechanisms lead to high computational and memory costs, especially when processing long texts, making large-scale deployment extremely expensive and almost impossible on edge devices or memory-constrained devices. Although some attempts have been made to replace full-attention Transformers with more efficient architectures (such as Mamba2, GLA, RWKV, etc.), they have struggled to achieve breakthroughs in accuracy until now.

PostNAS, the core innovation of Jet-Nemotron, mainly includes the following steps: first, select an advanced full-attention model (such as Qwen2.5) and freeze its multi-layer perceptron (MLP) layers to protect the model's learning ability and significantly reduce training costs; then, replace the computationally expensive full-attention modules with the new hardware-efficient linear attention module JetBlock; finally, through hypernetwork training and beam search, automatically determine the optimal positions for full-attention layers to maintain accuracy on specific tasks.

The performance metrics of Jet-Nemotron are impressive: its 2B model is comparable to or better than Qwen3-1.7B-Base on major benchmark tests, and its generation throughput has increased by 47 times. At a context length of 256K, the decoding speed has improved by 53.6 times, reducing the cost of inference by 98%. This brings a transformative change for deployment on edge devices.

In addition, the release of Jet-Nemotron means that enterprises can achieve higher return on investment at lower costs. For practitioners, Jet-Nemotron can retrofit existing models without changing the data pipeline, enhancing the capabilities of real-time AI services. For researchers, PostNAS reduces the cost of language model architecture innovation, accelerating the development of AI technology.

Project: https://github.com/NVlabs/Jet-Nemotron

Key Points:   

🌟 Jet-Nemotron achieves a 53.6 times increase in generation speed and a 98% reduction in inference cost compared to existing models.   

💻 The PostNAS technology allows efficient retrofitting of existing pre-trained models while maintaining accuracy.   

📈 The release of the new model enables enterprises and researchers to gain dual benefits in terms of cost and performance.