Researchers are pushing the boundaries of artificial intelligence with VibeTensor, a complete deep learning system software stack built entirely through the work of AI agents. Led by Bing Xu, Terry Chen, and Fengzhe Zhou from NVIDIA, alongside Tianqi Chen, Yangqing Jia, and Vinod Grover et al, this project demonstrates a significant leap in AI-assisted software engineering , a fully functional runtime, from language bindings to CUDA memory management, generated and validated primarily by automated builds and tests. Unlike simple bindings, VibeTensor boasts its own tensor system and autograd engine, offering a compelling proof-of-concept that AI can create complex, coherent software with minimal human intervention, and potentially revolutionising how we approach system development.

AI agents build complete deep learning stack rapidly

Scientists have unveiled VIBETENSOR, a fully generated, open-source deep learning system software stack created by LLM-powered coding agents under high-level human guidance. This breakthrough demonstrates that AI agents can construct a coherent runtime, spanning from language bindings down to CUDA memory management, validated primarily through automated builds and tests. Unlike typical thin bindings, VIBETENSOR incorporates its own complete tensor and storage system, a schema-lite dispatcher, reverse-mode autograd engine, and a CUDA runtime featuring streams, events, and graphs, all generated by AI. The research team views this release as a significant milestone in AI-assisted software engineering, proving the feasibility of automated system software development at scale.
The core of VIBETENSOR is a PyTorch-style eager tensor library implemented with a modern C++20 core for both CPU and CUDA, complemented by a Python overlay utilising nanobind and an experimental Node. js/TypeScript interface. Crucially, the “fully generated” designation refers to code provenance; all implementation changes were proposed and applied as agent-driven diffs, with validation relying on agent-executed builds, tests, and differential checks, eliminating manual diff review. This approach represents a novel methodology for system software creation, shifting the focus from manual coding to agent-driven generation and automated validation. The team meticulously documented the architecture and workflow used to produce and validate the system, providing a detailed evaluation of the resulting artifact.

Researchers report a repository scale and detailed test-suite composition, alongside reproducible microbenchmarks from an AI-generated kernel suite, including a performance comparison of fused attention against PyTorch’s SDPA and FlashAttention implementations. End-to-end training sanity checks were successfully completed on three small workloads, sequence reversal, a ViT model on CIFAR-10, and a miniGPT-style model, utilising NVIDIA H100 (Hopper, SM90) and Blackwell-class GPUs. Multi-GPU results, achieved exclusively on Blackwell GPUs, leveraged an optional CUTLASS-based ring-allreduce plugin, contingent on CUDA 13+ and the sm103a toolchain. This comprehensive evaluation demonstrates the functional correctness and potential performance of the AI-generated deep learning runtime.
Furthermore, the study delves into the failure modes observed in generated system software, highlighting a “Frankenstein” composition effect where locally correct subsystems can interact to yield globally suboptimal performance. The open-source release of VIBETENSOR, accompanied by the evaluation artifacts, serves as a valuable resource for empirical investigation into AI-assisted system software development. The team’s work establishes a foundation for future research into “vibe-coded” system software, offering insights into the challenges and opportunities of leveraging AI agents for complex software engineering tasks and pushing the boundaries of automated code generation.

VIBETENSOR Development via LLM-Driven Code Generation streamlines

Scientists engineered VIBETENSOR, an open-source deep learning system, utilising LLM-powered coding agents under stringent human oversight. The research team defined “fully generated” code as implementation changes proposed and applied as agent-created diffs, validated through agent-run builds, tests, and differential checks, eschewing manual diff review for each change. VIBETENSOR implements a PyTorch-style eager tensor library with a C++20 core, supporting both CPU and CUDA, alongside a Python overlay via nanobind, and an experimental Node. js/TypeScript interface for broader accessibility. Unlike simple bindings, this system incorporates a bespoke tensor/storage system, a schema-lite dispatcher, reverse-mode autograd, and a CUDA runtime managing streams, events, and graphs for optimised performance.

The study pioneered a stable C ABI, enabling dynamically loaded operator plugins and facilitating interoperability with existing deep learning ecosystems. Researchers developed a stream-ordered caching allocator, complete with integrated diagnostics, to meticulously monitor and optimise memory management within the system, a crucial component for large-scale deep learning models. This approach enables the creation of a coherent deep learning runtime, spanning language bindings down to CUDA memory management, validated primarily through automated builds and tests, marking a significant milestone in AI-assisted software engineering. The team reports a repository scale and detailed test-suite composition, demonstrating the system’s breadth and robustness.

Experiments employed reproducible microbenchmarks from an AI-generated kernel suite, directly comparing fused attention performance against established PyTorch implementations like SDPA and FlashAttention, revealing areas for potential optimisation and innovation. End-to-end training sanity checks were conducted on three small workloads, sequence reversal, ViT, and miniGPT, utilising NVIDIA H100 (Hopper, SM90) and Blackwell-class GPUs to assess functional correctness and basic performance. Multi-GPU results, Blackwell-only, harnessed an optional CUTLASS-based ring-allreduce plugin, gated on CUDA 13+ and the sm103a toolchain, to accelerate communication between GPUs. Furthermore, the work details a practical workflow for generating and validating system-scale software with agents, leveraging builds, tests, differential checks, and multi-agent code review as essential guardrails, a novel methodology for large-scale software development. Scientists observed “Frankenstein” composition effects, where locally correct subsystems interacted to yield globally suboptimal performance, highlighting the challenges of integrating AI-generated components and the need for holistic system-level optimisation. The team documented DLPack interop, a stable C plugin ABI, and hooks for custom kernels authored in Triton and CUDA template libraries such as CUTLASS, fostering extensibility and customisation.

VIBETENSOR a LLM-generated deep learning stack

Scientists have unveiled VIBETENSOR, a fully generated deep learning system software stack created with the assistance of LLM-powered coding agents under high-level human guidance. The research demonstrates that coding agents can construct a coherent runtime, extending from language bindings down to CUDA memory management, validated primarily through automated builds and tests. This achievement represents a milestone in AI-assisted software engineering, showcasing the potential for automated creation of complex systems. Experiments revealed a repository scale and composition meticulously crafted by the agents, alongside a comprehensive test suite designed to ensure functionality and stability.

Reproducible microbenchmarks, including fused attention, were measured against established benchmarks like PyTorch SDPA and FlashAttention, providing quantifiable performance data. The team measured performance on NVIDIA H100 (Hopper, SM90) and Blackwell-class GPUs, with multi-GPU results, Blackwell-only, leveraging an optional CUTLASS-based ring-allreduce plugin requiring CUDA 13+ and the sm103a toolchain. Results demonstrate successful end-to-end training sanity checks across three small workloads: sequence reversal, a CIFAR-10 ViT model, and a miniGPT-style model. These tests confirm the system’s ability to handle basic deep learning tasks and validate the integration of its various components.

The Blackwell-only multi-GPU scaling benchmark further highlights the system’s potential for parallel processing and increased computational throughput. Measurements confirm the system’s adherence to a stable C ABI, enabling the dynamic loading of operator plugins and facilitating extensibility. Scientists recorded a “Frankenstein” composition effect, where locally correct subsystems interacted to yield globally suboptimal performance, highlighting a key challenge in AI-assisted system software development. The work details an AI-assisted development methodology, utilizing builds, tests, differential checks, and multi-agent code review as guardrails for system-scale generation. VIBETENSOR includes DLPack interop, a stable C plugin ABI, and hooks for custom kernels authored in Triton and CUTLASS, offering interoperability and extension points for future development. The open-source release, available at https://github. com/NVLabs/vibetensor0.1, serves as a technical artifact for empirical study and further innovation in AI-assisted software engineering.

AI Agents Build Complete Deep Learning Runtime

Scientists have created VIBETENSOR, an open-source deep learning system generated with the assistance of large language model-powered coding agents under human supervision. This system features a PyTorch-style eager tensor library with a C++20 core, alongside Python and Node. js interfaces, and importantly, includes its own tensor system, autograd, and CUDA runtime. The researchers validated the system through automated builds, tests, and differential checks, eschewing manual review of individual code changes. This work represents a significant milestone in AI-assisted software engineering, demonstrating that coding agents can construct a complete deep learning runtime, from language bindings to CUDA memory management, primarily through automated validation processes.

Microbenchmarks show performance comparable to established libraries like PyTorch, and end-to-end training tests were successfully completed on small workloads using both H100 and Blackwell GPUs. However, the authors acknowledge limitations, including potential performance issues arising from the “Frankenstein” effect, where locally correct subsystems interact suboptimally, and caution against using VIBETENSOR in production environments. The authors observed that system software generated in this manner can be susceptible to bugs related to stateful interactions and uninitialized buffers. Consequently, they emphasise the importance of regression tests that thoroughly exercise repeated execution and cross-stream behaviour. Future work could focus on addressing these failure modes and exploring how to better encode global performance objectives during the generation process. VIBETENSOR is released as a research and educational tool, intended to facilitate the study of large-scale software generation and benefit the broader GPU software community with hackable implementations and discussions around AI-assisted engineering methods.

👉 More information
🗞 VibeTensor: System Software for Deep Learning, Fully Generated by AI Agents
🧠 ArXiv: https://arxiv.org/abs/2601.16238

Tags:

Blackwell GPUs! C++20 CUDA CUTLASS Deep Learning eager tensor library LLM-powered coding agents PyTorch reverse-mode autograd VIBETENSOR

AI Agents Fully Generate VIBETENSOR Deep Learning System Software Stack

AI agents build complete deep learning stack rapidly

VIBETENSOR Development via LLM-Driven Code Generation streamlines

VIBETENSOR a LLM-generated deep learning stack

AI Agents Build Complete Deep Learning Runtime

Rohail T.

Latest Posts by Rohail T.:

Electric Fields Drive Two Distinct Material Phase Changes

Robots Learn Skills from 20,854 Hours of Human Video

Llms Show 69% Cell Culture Success for Novices