NVIDIA launched the Nemotron-Labs-TwoTower discrete diffusion language model on July 2nd, aiming to address the issue of slow token-by-token generation speed in large models. The related weights have been open-sourced on Huggingface. The model is based on the existing Nemotron backbone network, reusing pre-trained weights without requiring a complete training from scratch, significantly reducing development costs.

image.png

60B Two-Tower Architecture, Parallel Processing to Improve Generation Efficiency

The model has a total parameter count of 60B, split into two independent 30B neural networks working collaboratively. Each tower activates 3B parameters and is equipped with 128 routable expert modules. The context tower is fixed and frozen, responsible for retaining the overall semantic information; the denoising tower is specifically trained, generating text in parallel using the diffusion mechanism, and the two towers exchange data through cross-attention.

Traditional models output tokens sequentially one by one, while the two-tower architecture can write text in parallel, greatly increasing the inference throughput, while maintaining speed and output quality. Benchmark test results show that the model's comprehensive capabilities retain 98.7% of the original level, and the text generation throughput is directly increased by 2.42 times, with only slight declines in code and math tasks.

Open Source Deployment, Suitable for Multi-Scenario Inference

The model is released under NVIDIA's exclusive open-source license, allowing developers to freely download and test, as well as commercial deployment. It requires pairing two H100 or A100 80GB GPUs, with a single card only supporting pure autoregressive mode. Full two-tower inference requires dual-card collaboration. Testing covers multiple tasks such as common sense, mathematics, code, and reading comprehension, with most indicators remaining comparable to the original version, balancing generation speed and content quality.