NVIDIA Launches Jet-Nemotron: A Hybrid-Architecture Language Model That Speeds Up by 53 Times and Saves 98% in Inference Costs

AIbase基地

Published inAI News · 5 min read · Aug 27, 2025

Recently, the NVIDIA research team announced the release of Jet-Nemotron, a new series of language models (available in 2B and 4B parameter versions), which generates content 53.6 times faster than the current leading full-attention language models and achieves accuracy that matches or even exceeds these models. This breakthrough was not achieved by retraining the model from scratch, but by using a new technology called "PostNAS" to improve existing pre-trained models.

With the widespread application of modern language models, such as Qwen3, Llama3.2, and Gemma3, these models have set new benchmarks in accuracy and flexibility. However, their O(n²) self-attention mechanisms lead to high computational and memory costs, especially when processing long texts, making large-scale deployment extremely expensive and almost impossible on edge devices or memory-constrained devices. Although some attempts have been made to replace full-attention Transformers with more efficient architectures (such as Mamba2, GLA, RWKV, etc.), they have struggled to achieve breakthroughs in accuracy until now.

PostNAS, the core innovation of Jet-Nemotron, mainly includes the following steps: first, select an advanced full-attention model (such as Qwen2.5) and freeze its multi-layer perceptron (MLP) layers to protect the model's learning ability and significantly reduce training costs; then, replace the computationally expensive full-attention modules with the new hardware-efficient linear attention module JetBlock; finally, through hypernetwork training and beam search, automatically determine the optimal positions for full-attention layers to maintain accuracy on specific tasks.

The performance metrics of Jet-Nemotron are impressive: its 2B model is comparable to or better than Qwen3-1.7B-Base on major benchmark tests, and its generation throughput has increased by 47 times. At a context length of 256K, the decoding speed has improved by 53.6 times, reducing the cost of inference by 98%. This brings a transformative change for deployment on edge devices.

In addition, the release of Jet-Nemotron means that enterprises can achieve higher return on investment at lower costs. For practitioners, Jet-Nemotron can retrofit existing models without changing the data pipeline, enhancing the capabilities of real-time AI services. For researchers, PostNAS reduces the cost of language model architecture innovation, accelerating the development of AI technology.

Project: https://github.com/NVlabs/Jet-Nemotron

Key Points:
🌟 Jet-Nemotron achieves a 53.6 times increase in generation speed and a 98% reduction in inference cost compared to existing models.
💻 The PostNAS technology allows efficient retrofitting of existing pre-trained models while maintaining accuracy.
📈 The release of the new model enables enterprises and researchers to gain dual benefits in terms of cost and performance.

Tencent's Self-developed Large Model Hunyuan 2.0 Released: Significant Improvements in Multiple Aspects

Tencent's self-developed large model Hunyuan 2.0 (Tencent HY2.0) has been officially released. At the same time, DeepSeek V3.2 is gradually integrated into Tencent's ecosystem. Currently, these two models have been launched first in Tencent's AI-native applications such as Yuanbao and ima. Tencent Cloud has also simultaneously opened up related model APIs and platform services. The newly released Tencent HY2.0 adopts a Mixture of Experts (MoE) architecture, with a total parameter count of up to 4

NVIDIA Launches New AI Framework, 8-Billion-Parameter Model Empowers Intelligent Tool Management

NVIDIA and HKU launched the 8-billion-parameter Orchestrator model, which coordinates tools and LLMs to solve complex tasks efficiently. It outperforms benchmarks in tool usage with lower costs and adapts to user preferences. Trained via the ToolOrchestra RL framework, it enhances small models' coordination skills.....

Volc Engine Launches Doubao Speech Recognition Model 2.0 to Improve Multilingual Recognition Accuracy

Volc Engine launches Doubao Speech Recognition Model 2.0, significantly enhancing inference capabilities and supporting multilingual and visual information recognition. The model is based on a 2 billion parameter audio encoder, optimized for complex scenarios, improving the accuracy of recognizing proper nouns, names, places, and polyphones.

Microsoft Open-Sources Real-Time Speech Model VibeVoice-Realtime-0.5B, 300ms Real-Time Voice Activation, No Breathing Even for 90-Minute Long Audio

Microsoft open-sources the real-time speech model VibeVoice-Realtime-0.5B, which offers extremely low latency and near-human voice performance. The model takes an average of only 300 milliseconds from text input to voice output, far less than traditional TTS models (1-3 seconds), achieving almost zero latency real-time speech synthesis.

The Domestic Computing Power Group Unites: The 10,000-Card Inference Engine and OpenSource Model Announced on the Same Day

The 2025 Guangming Science City Forum was held in Shenzhen, focusing on intelligent computing power and large model agents. Institutions such as the Shenzhen Institute of Advanced Technology announced four important achievements: the open-source multimodal model Pengcheng Haiwen 2.1 along with its corresponding dataset and toolchain; the domestic 10,000-card inference engine FenixCOS made its debut, supporting large-scale parallelism and efficient switching; the meteorological intelligent agent "Afu" was integrated into Pengcheng CloudBrain III, providing services for the 15th National Games.

Ten Million Salary + DeepSeek Core Member Joining, Xiaomi AI Large Model Accelerates: Lu Weibing Says Performance Exceeds Expectations

Xiaomi elevates AI large models as its core strategy for the next decade, with quarterly investment growth exceeding 50% over the past year. The company has launched a global talent recruitment drive, offering salaries up to 10 million yuan per position to address talent shortages. Former DeepSeek core members have joined, unveiling the MiMo team.....

Saudi AI Startup Launches First Arabic Large Language Model Kawn

Saudi AI lab Misraj AI launched Kawn, an Arabic large language model at AWS conference, to boost Arabic applications across industries. It features Mutarjim for bidirectional Arabic-English translation and Lahjawi, the first model supporting 15 dialects, enhancing cross-dialect communication, especially for chatbots.....

AliQwen APP Launches Qwen3-Learning Learning Large Model: Free, Unlimited Times, One-Click Question Review!

The AliQwen APP has launched the free learning model Qwen3-Learning, providing photo question-solving and homework correction functions for K-12 teachers and students, with no usage restrictions. This model performs excellently in recognizing multiple country curricula and solving problems quickly, comparable to paid services.

Latest AI News

AI Daily Brief

AI Product Finder

AI Product Rankings

AI Product Submit

AI Tools Directory

AI Models Finder

LLM Leaderboard

Model Providers

Compare LLMs

LLM Cost Calculator

LLM Arena

MCP Servers

MCP Client

MCP Case Tutorials

MCP Ranking

MCP Service Submission

MCP Playground

MCP Inspector

AI Brand Monitoring Tool

AI Search Visibility Checker

GEO Services

AI Model Compatibility Checker

AI Deployment Calculator

NVIDIA Launches Jet-Nemotron: A Hybrid-Architecture Language Model That Speeds Up by 53 Times and Saves 98% in Inference Costs

AIbase基地

This article is from AIbase Daily

AI News Recommendations

Tencent's Self-developed Large Model Hunyuan 2.0 Released: Significant Improvements in Multiple Aspects

NVIDIA Launches New AI Framework, 8-Billion-Parameter Model Empowers Intelligent Tool Management

Volc Engine Launches Doubao Speech Recognition Model 2.0 to Improve Multilingual Recognition Accuracy

Microsoft Open-Sources Real-Time Speech Model VibeVoice-Realtime-0.5B, 300ms Real-Time Voice Activation, No Breathing Even for 90-Minute Long Audio

The Domestic Computing Power Group Unites: The 10,000-Card Inference Engine and OpenSource Model Announced on the Same Day

Ten Million Salary + DeepSeek Core Member Joining, Xiaomi AI Large Model Accelerates: Lu Weibing Says Performance Exceeds Expectations

Xiaomi AI Large Model Accelerates Again! Lu Weibing Reveals Amazing Progress and Talent Recruitment Plan

The Strongest Code Model is Now Available! GPT-5.1-CodexMax API is Now Supported

Saudi AI Startup Launches First Arabic Large Language Model Kawn

AliQwen APP Launches Qwen3-Learning Learning Large Model: Free, Unlimited Times, One-Click Question Review!

Latest AI News

AI Daily Brief

AI Product Finder

AI Product Rankings

AI Product Submit

AI Tools Directory

AI Models Finder

LLM Leaderboard

Model Providers

Compare LLMs

LLM Cost Calculator

LLM Arena

MCP Servers

MCP Client

MCP Case Tutorials

MCP Ranking

MCP Service Submission

MCP Playground

MCP Inspector

AI Brand Monitoring Tool

AI Search Visibility Checker

GEO Services​

AI Model Compatibility Checker

AI Deployment Calculator

NVIDIA Launches Jet-Nemotron: A Hybrid-Architecture Language Model That Speeds Up by 53 Times and Saves 98% in Inference Costs

AIbase基地

This article is from AIbase Daily

AI News Recommendations

Tencent's Self-developed Large Model Hunyuan 2.0 Released: Significant Improvements in Multiple Aspects

NVIDIA Launches New AI Framework, 8-Billion-Parameter Model Empowers Intelligent Tool Management

Volc Engine Launches Doubao Speech Recognition Model 2.0 to Improve Multilingual Recognition Accuracy

Microsoft Open-Sources Real-Time Speech Model VibeVoice-Realtime-0.5B, 300ms Real-Time Voice Activation, No Breathing Even for 90-Minute Long Audio

The Domestic Computing Power Group Unites: The 10,000-Card Inference Engine and OpenSource Model Announced on the Same Day

Ten Million Salary + DeepSeek Core Member Joining, Xiaomi AI Large Model Accelerates: Lu Weibing Says Performance Exceeds Expectations

Xiaomi AI Large Model Accelerates Again! Lu Weibing Reveals Amazing Progress and Talent Recruitment Plan

The Strongest Code Model is Now Available! GPT-5.1-CodexMax API is Now Supported

Saudi AI Startup Launches First Arabic Large Language Model Kawn

AliQwen APP Launches Qwen3-Learning Learning Large Model: Free, Unlimited Times, One-Click Question Review!

GEO Services