Loïc Magne, Anas Awadalla, and Guanzhi Wang, along with collaborators from NVIDIA, Stanford, Caltech, UChicago, and UT Austin, introduced NitroGen, an open foundation model designed for generalist gaming agents. This model is trained on an internet-scale dataset comprising 40,000 hours of gameplay videos from over 1,000 different games, utilizing a large-scale behavior cloning approach. NitroGen demonstrates strong competence across diverse game domains and achieves up to a 52% relative improvement in task success rates when transferred to unseen games, addressing limitations in existing embodied AI research and offering an open-source framework for advancing generalist agent development.

NitroGen: An Open Foundation Model for Gaming Agents

NitroGen is an open foundation model designed for generalist gaming agents, trained on a massive dataset of 40,000 hours of gameplay videos spanning over 1,000 different games. This model utilizes three key components: an internet-scale video-action dataset, a multi-game benchmark environment, and a unified vision-action model trained through large-scale behavior cloning. The aim is to create agents capable of operating in unknown game environments, addressing a current limitation in embodied AI research due to a lack of diverse, labeled action data.

The project introduces a novel approach to data collection by leveraging publicly available gaming videos featuring on-screen displays of player input commands – “input overlays.” An action extraction pipeline uses keypoint matching and a hybrid classification-segmentation network to accurately reconstruct player inputs, circumventing the need for costly manual data collection. This curated dataset provides diverse demonstrations for large-scale training, supporting the development of adaptable gaming agents across a wide range of titles and genres.

Demonstrating the effectiveness of this approach, NitroGen achieves up to a 52% relative improvement in task success rates when fine-tuned on unseen games, compared to models trained from scratch. The researchers have open-sourced the dataset, a universal Gymnasium API for the evaluation suite, and the pre-trained model weights, intending NitroGen to serve as a foundational resource for accelerating research into more generalist embodied agents and new applications in the field.

Internet-Scale Video-Action Dataset Construction

NitroGen addresses the lack of large, labeled action datasets for embodied AI by constructing an internet-scale video-action dataset. This dataset comprises 40,000 hours of gameplay videos spanning over 1,000 games, sourced from publicly available content. A key innovation is automatically extracting action labels from these videos, specifically from “input overlays” showing players’ gamepad commands in real-time, eliminating the need for costly manual data collection and capturing diverse player behaviors.

The construction of this dataset relies on a pipeline that localizes gamepads within videos using keypoint matching with curated templates. A hybrid classification-segmentation network then predicts joystick positions and button states from cropped controller images, accurately reconstructing player inputs. This approach enables the creation of a large-scale dataset capturing a broad spectrum of real player behaviors across a diverse range of games, forming a crucial component of the NitroGen framework.

NitroGen’s dataset, alongside a multi-game benchmark and vision-action model, aims to accelerate research into generalist gaming agents. The dataset’s diverse composition – encompassing over 1,000 games – and automated labeling process represent a significant step forward in addressing the data scarcity that has hindered progress in embodied AI. Researchers can leverage this resource to develop and evaluate algorithms for building agents capable of operating in unknown game environments.

We find that replayed sequences begin visually diverging after one minute for games with continuous actions and after about three minutes for games with only discrete actions.

Multi-Game Benchmark Environment Design

NitroGen introduces a multi-game benchmark environment designed to assess generalization in realistic gaming scenarios. This suite comprises 30 varied tasks across 10 commercial games, challenging agents with combat, navigation, and puzzle-solving. Crucially, a universal Gymnasium API is provided, allowing researchers to wrap any game for testing, creating a flexible and expandable evaluation platform. This benchmark reflects the complexities of modern game environments requiring adaptive agents.

The research team curated an internet-scale dataset of 40,000 hours of gameplay videos spanning over 1,000 games. This dataset is unique due to the extraction of frame-level action labels from publicly available videos featuring on-screen gamepad input overlays. This method avoids costly manual data collection, capturing a wide range of real player behaviors and providing diverse demonstrations for large-scale training of gaming agents.

NitroGen’s approach demonstrates significant performance gains; models fine-tuned using pre-trained weights achieved up to a 52% relative improvement in task success rates compared to models trained from scratch—given a fixed data and compute budget. The open-sourcing of the dataset, simulator, and pre-trained weights aims to accelerate research toward more generalist embodied agents and foster innovation in AI gaming.

Large-Scale Behavior-Cloning Pre-Training

NitroGen introduces a new approach to training gaming agents through large-scale behavior-cloning pre-training. Utilizing 40,000 hours of gameplay videos from over 1,000 games, the system bypasses costly manual data collection by extracting action labels from publicly available videos displaying player input commands—often overlaid as “gamepad overlays.” This curated dataset fuels a vision-action transformer model, aiming to create a generalist policy capable of adapting to diverse game environments and challenges.

The benefits of this behavior-cloning pre-training are demonstrated through post-training experiments. Fine-tuning the model, initialized with the pre-trained NitroGen weights, achieved up to a 52% relative improvement in task success rates compared to a model trained from scratch – given the same data and compute budget. This highlights the effectiveness of leveraging large-scale internet data for building robust and adaptable gaming agents.

NitroGen’s architecture centers around a vision-action transformer model trained on the extracted video-action dataset. This model takes game observations as input and generates gamepad actions, enabling zero-shot gameplay across multiple titles. Alongside the model, researchers have also created a universal Gymnasium API – a simulator designed to wrap any commercial game – and released the dataset and pre-trained weights to encourage further research in generalist embodied agents.

Addressing Limitations in Embodied AI

Addressing limitations in embodied AI, the researchers introduced NitroGen, an open foundation model trained on 40,000 hours of gameplay videos from over 1,000 games. A key issue previously was the lack of large, labeled action datasets; NitroGen overcomes this by automatically extracting player actions from publicly available videos displaying on-screen gamepad commands. This innovative approach eliminates the need for costly manual data collection and captures diverse player behaviors, offering a significantly larger dataset than previously available.

To evaluate generalization, a multi-game benchmark environment was created comprising 30 tasks across 10 commercial games. A universal Gymnasium API was developed, enabling any game to be wrapped and tested with diverse agent capabilities. This benchmark covers challenges like combat, navigation, and puzzle-solving, mirroring the complexities of modern game environments and enabling rigorous assessment of an agent’s ability to adapt across varying game mechanics and objectives.

The model demonstrates the benefits of large-scale behavior-cloning pre-training, achieving up to a 52% relative improvement in task success rates when fine-tuned on unseen games compared to models trained from scratch. This improvement was achieved with a fixed data and compute budget, highlighting the effectiveness of leveraging the 40,000-hour dataset. NitroGen’s components—dataset, simulator, and pre-trained weights—are open-sourced to accelerate research in generalist embodied agents.

We find that replayed sequences begin visually diverging after one minute for games with continuous actions and after about three minutes for games with only discrete actions.

Extracting Actions from Gameplay Videos

NitroGen addresses the lack of large, labeled action datasets for embodied AI by creating an internet-scale dataset sourced from 40,000 hours of publicly available gameplay videos spanning over 1,000 games. This dataset is built by automatically extracting player actions from videos where content creators display their gamepad inputs – known as “input overlays.” An annotation model then accurately reconstructs player inputs, removing the need for expensive manual data collection and capturing a broad spectrum of player behaviors.

To assess and benchmark generalist gaming agents, NitroGen introduces a multi-game evaluation suite comprising 30 tasks across 10 commercial games. These tasks cover diverse challenges like combat, navigation, and puzzle-solving, reflecting the demands of modern game environments. A universal Gymnasium API acts as a wrapper, allowing any game to be controlled and tested, providing a standardized environment for evaluating agent capabilities and fostering research.

The project demonstrates that large-scale behavior-cloning pre-training significantly improves performance. Fine-tuning a model from pre-trained NitroGen weights achieved up to a 52% relative improvement in task success rates compared to training from scratch—given a fixed data and compute budget. NitroGen releases its dataset, simulator, and pre-trained weights to accelerate research into more generalist embodied agents.

Gamepad Overlay Data Challenges

NitroGen addresses the lack of large, labeled action datasets hindering embodied AI progress by leveraging publicly available gaming videos featuring “gamepad overlays.” These overlays display player inputs in real-time, enabling the creation of an internet-scale dataset of 40,000 hours spanning over 1,000 games. An annotation model extracts frame-level actions, removing the need for costly manual data collection and capturing diverse player behaviors—a key component for training generalist gaming agents.

To assess generalization, NitroGen incorporates a multi-game benchmark environment comprising 30 tasks across 10 commercial games. A “universal simulator” utilizes a Gymnasium API to wrap any game, allowing for testing agent capabilities across varied mechanics and objectives—including combat, navigation, and puzzle-solving. This benchmark reflects modern game environments demanding adaptability, and facilitates evaluation of cross-game performance.

The research demonstrates a 52% relative improvement in task success rates when fine-tuning a model pre-trained on NitroGen’s data, compared to training from scratch with the same data and compute budget. This highlights the benefits of large-scale behavior-cloning pre-training. NitroGen’s dataset, simulator, and pre-trained weights are open-sourced to encourage further research into generalist embodied agents.

Action Extraction Pipeline Details

The NitroGen project addresses the lack of large, labeled action datasets for embodied AI by creating an internet-scale dataset sourced from 40,000 hours of publicly available gameplay videos spanning over 1,000 games. This dataset is built using a novel action extraction pipeline, automatically identifying player actions from “input overlays” – on-screen displays of gamepad commands. Keypoint matching with SIFT and XFeat features, alongside a hybrid classification-segmentation network, accurately reconstructs player inputs, eliminating costly manual data collection.

To assess and benchmark generalist gaming agents, NitroGen includes a multi-game environment comprising 30 tasks from 10 commercial games. A universal Gymnasium API serves as a wrapper, enabling control of any game and facilitating testing of diverse agent capabilities. This benchmark covers challenges like combat, navigation, and puzzle-solving, reflecting the complexities of modern game environments and allowing for rigorous evaluation of agent adaptability across heterogeneous mechanics.

The core of NitroGen is a vision-action transformer model trained on the curated dataset using large-scale behavior cloning. This model demonstrates strong performance on the benchmark suite, achieving up to 52% relative improvement in task success rates compared to models trained from scratch, given the same data and compute budget. The project releases the dataset, simulator, and pre-trained weights, aiming to accelerate research into generalist embodied agents and foster advancements in the field.

Dataset Diversity Across Games and Genres

NitroGen addresses the lack of large, diverse datasets for training generalist gaming agents by introducing an internet-scale dataset comprised of 40,000 hours of gameplay videos spanning over 1,000 different games. This dataset is uniquely sourced from publicly available videos featuring overlaid player commands, allowing for automated extraction of frame-level actions. This approach bypasses the need for costly manual data collection and captures a broad spectrum of real player behaviors across numerous titles and genres, crucial for robust agent training.

The NitroGen project also includes a multi-game benchmark environment designed to assess generalization capabilities. Comprising 30 tasks from 10 commercial games, the benchmark covers challenges like combat, navigation, and puzzle-solving. A key component is the universal Gymnasium API, enabling a standardized interface for wrapping any game to facilitate testing and evaluation of agent performance. This comprehensive environment allows researchers to measure how well agents adapt to diverse game mechanics and objectives.

Demonstrating the efficacy of this approach, a vision-action transformer model trained on the NitroGen dataset achieved up to a 52% relative improvement in task success rates when fine-tuned on unseen games, compared to models trained from scratch. This highlights the benefits of large-scale behavior-cloning pre-training using readily available internet data. The open-sourcing of the dataset, simulator, and model weights aims to foster further research and development in generalist embodied AI.

Universal Simulator with Gymnasium API

NitroGen introduces a “universal simulator” designed to facilitate the training and evaluation of generalist gaming agents. This simulator functions as an environment wrapper, enabling control of any commercial game through a Gymnasium API. This standardized interface allows researchers to test diverse agent capabilities across a broad range of titles, overcoming limitations of specialized simulators previously required for embodied AI research. The API is a key component in the NitroGen framework, as illustrated in Figure 1.

The NitroGen project centers around an internet-scale dataset of 40,000 hours of gameplay videos, spanning over 1,000 games with extracted action labels. This dataset is built by automatically extracting player actions from publicly available videos where content creators display their gamepad inputs in real time – termed “input overlays”. This approach circumvents the need for costly manual data collection and captures a wide spectrum of real player behaviors, offering diverse demonstrations for large-scale training.

A multi-game benchmark suite is also central to the NitroGen framework. Comprising 30 tasks from 10 commercial games, the benchmark covers challenges like combat, navigation, and puzzle-solving. Importantly, the suite is designed to work with the universal Gymnasium API, allowing for standardized evaluation of agent performance across heterogeneous game mechanics and objectives. This benchmark is intended to reflect the demands of modern game environments.

NitroGen Components Overview

NitroGen is presented as an open foundation model designed for generalist gaming agents, trained on a substantial dataset of 40,000 hours of gameplay videos spanning over 1,000 different games. A key component is the creation of an internet-scale dataset of action-labeled videos, achieved by extracting player actions from publicly available gaming videos where content creators display their input commands. This approach bypasses the need for costly manual data collection and captures a broad range of player behaviors across diverse game titles.

The system utilizes a multi-game benchmark environment comprised of 30 tasks across 10 commercial games, covering challenges like combat, navigation, and puzzle-solving. A “universal simulator” employing a Gymnasium API allows any game to be wrapped for testing agent capabilities, providing a standardized environment for evaluation. This benchmark aims to assess the generalization of the NitroGen model in realistic and varied gaming scenarios, pushing the boundaries of embodied AI.

Training utilizes large-scale behavior cloning with a vision-action transformer model. Post-training on unseen games demonstrated up to a 52% relative improvement in success rates compared to models trained from scratch, highlighting the benefits of pre-training with the extensive dataset. NitroGen’s components – the dataset, simulator, and pre-trained weights – are released open-source to foster research and development in generalist embodied agents.

Benefits of Pre-Training NitroGen

NitroGen offers significant benefits through large-scale pre-training, demonstrated by up to a 52% relative improvement in task success rates when fine-tuned on unseen games. This improvement is achieved even with a fixed data and compute budget, highlighting the effectiveness of leveraging the pre-trained weights. The system utilizes a vision-action transformer model trained on a dataset of 40,000 hours of gameplay videos spanning over 1,000 games, establishing a strong foundation for multi-game policy learning.

The core of NitroGen’s approach lies in its three novel components: an internet-scale video dataset with action labels, a multi-game benchmark environment utilizing a Gymnasium wrapper, and a vision-action model. This system bypasses the need for costly manual data collection by extracting action labels from publicly available gameplay videos featuring overlaid gamepad inputs. This curated dataset, combined with a universal simulator, facilitates training and evaluating generalist gaming agents across diverse challenges.

NitroGen’s multi-game benchmark suite comprises 30 tasks from 10 commercial games, covering areas like combat, navigation, and puzzle-solving. The benchmark reflects the demands of modern game environments, requiring agents to adapt to heterogeneous mechanics. The framework is designed to be open-source, releasing the dataset, simulator, and pre-trained weights to accelerate research and development of more generalist embodied agents within the AI community.

Source: https://nitrogen.minedojo.org/assets/documents/nitrogen.pdf

Tags:

action dataset Embodied AI foundation model gaming agents video games

Quantum News

NitroGen Gaming Agent AI Achieves 52% Improvement in Unseen Game Task Success