NVIDIA's artificial intelligence research team recently launched NitroGen, an open visual action foundation model for general game agents. NitroGen is able to learn how to play commercial games directly from online videos by understanding the game screen and controller operations. The entire model was trained on 40,000 hours of gameplay experience, covering over 1,000 games, and also provides an open dataset, a general simulator, and pre-trained strategies.

The construction process of NitroGen starts with publicly available game videos, which include input overlays such as the visualization of game controllers. The research team collected 71,000 hours of original video, and after quality filtering, finally obtained 40,000 hours of selected data, covering 38,739 videos from 818 creators. The data shows that these videos span 846 games, with 34.9% of the gameplay time coming from action role-playing games, 18.4% from platform games, 9.2% from action-adventure games, and the rest covering various categories such as sports, roguelike, and racing games.
During the extraction of each frame's actions, NitroGen uses a three-stage extraction process. First, the system locates the controller overlay layer using 300 controller templates. Then, a SegFormer-based classification segmentation model is used to parse the controller area, and finally, the coordinates are refined. This process ensures the accuracy of action prediction, allowing NitroGen to effectively perform large-scale behavior cloning.
In addition, NitroGen is equipped with a general simulator, which can package commercial Windows games into a Gymnasium-compatible interface, supporting frame-by-frame interaction without modifying the game code. This allows NitroGen to apply the same strategy directly across multiple games.
NitroGen adopts a strategy architecture based on Diffusion Transformer, which operates on RGB images at a resolution of 256×256. After pre-training, NitroGen demonstrates good zero-shot evaluation capabilities on multiple tasks, with task completion rates ranging between 45% and 60%. The pre-training of the model results in significant performance improvements when migrating to new games, with an improvement of up to 52% compared to training from scratch.
huggingface:https://huggingface.co/nvidia/NitroGen
Key points:
📊 NitroGen is an open visual action foundation model that can learn game operations directly from online videos.
🎮 The dataset includes 40,000 hours of game video, covering more than 1,000 games.
🚀 Pre-trained NitroGen shows significant performance improvements in new games, with up to a 52% performance improvement compared to training from scratch.


