The challenge of creating truly autonomous vehicles demands increasingly efficient and accurate methods of processing complex visual data. Ellington Kirby, Alexandre Boulch, and Yihong Xu, alongside Yuan Yin, Gilles Puy, Éloi Zablocki, and colleagues at valeo.ai in Paris, France, address this need with their new research into a streamlined approach to end-to-end driving. Their work introduces DrivoR, a system leveraging pretrained Vision Transformers and a novel ‘register token’ mechanism to compress information from multiple cameras into a concise scene representation. This innovation significantly reduces computational demands while maintaining driving accuracy, and crucially, allows the system to adapt its behaviour based on desired characteristics like safety and comfort. By demonstrating superior performance on established benchmarks including NAVSIM and HUGSIM, DrivoR proves that a focused, token-based approach can deliver a viable pathway towards robust and adaptable autonomous driving systems.
This significantly reduces downstream computation without sacrificing accuracy, allowing for more efficient processing of visual data. These tokens drive two lightweight transformer decoders which generate and then score candidate trajectories, providing a framework for path planning. The scoring decoder learns to mimic an oracle and predicts interpretable sub-scores representing aspects such as safety, comfort, and efficiency, enabling behaviour-conditioned driving at inference.
Camera-Aware Tokens for Efficient Driving Models The research
The research team developed DrivoR, a novel transformer-based architecture for end-to-end autonomous driving, designed for both efficiency and accuracy. Central to this work is the implementation of camera-aware register tokens, which compress multi-camera visual features into a compact scene representation, directly addressing the computational bottleneck inherent in processing high-resolution or multi-camera sensor setups. By reducing the length of the visual representation, DrivoR enables faster processing without compromising performance. Scientists engineered a system where these register tokens drive two lightweight transformer decoders, one generating candidate trajectories and the other scoring them.
The trajectory decoder produces potential paths, while the scoring decoder learns to mimic an ‘oracle’ and assigns interpretable sub-scores reflecting safety, comfort, and efficiency. This scoring mechanism allows for behavior-conditioned driving, enabling the vehicle to adapt its actions based on desired driving characteristics. The scoring and generation modules are disentangled, improving stability and performance. Experiments employed a pure transformer architecture, eschewing intermediate representations like Bird’s Eye View (BEV) maps. The perception transformer encoder processes raw image tokens, compressing them into the fixed set of register tokens, maintaining planning-relevant context while significantly reducing the sequence length fed into the downstream decoders.
This delivers a streamlined pipeline, mapping raw sensor data and ego-state directly to driving decisions, reducing the need for costly and time-consuming intermediate labeling. The study pioneered the repurposing of Vision Transformer (ViT) register tokens for visual token reduction in end-to-end planning. Performance was rigorously evaluated across NAVSIM-v1, NAVSIM-v2, and the photorealistic closed-loop HUGSIM environment. DrivoR consistently outperformed or matched strong contemporary baselines, demonstrating that a pure transformer architecture, combined with targeted token compression, is sufficient for accurate, efficient, and adaptive autonomous driving. This establishes a new state-of-the-art in end-to-end planning, offering a computationally efficient and interpretable solution for future autonomous vehicles.
Camera-Aware Tokens for Efficient Autonomous Driving Scientists have
Scientists have developed DrivoR, a transformer-based system for end-to-end autonomous driving that prioritises simplicity and efficiency. The research team achieved this by leveraging pretrained Vision Transformers (ViTs) and introducing camera-aware register tokens, which compress multi-camera features into a compact scene representation, demonstrably reducing computational demands without compromising accuracy. This compression maintains crucial planning-relevant context within a reduced visual representation length. The core of DrivoR lies in its two lightweight decoder modules, one generating candidate trajectories and the other scoring them based on learned criteria.
The scoring decoder effectively mimics an idealised ‘oracle’, predicting interpretable sub-scores for safety, comfort, and efficiency, thereby enabling behaviour-conditioned driving. The team measured performance across NAVSIM-v1, NAVSIM-v2, and HUGSIM, consistently achieving results that either outperform or match strong contemporary baseline systems. A pure transformer architecture, combined with targeted token compression, is sufficient for accurate, efficient, and adaptive end-to-end driving. Specifically, the system relies solely on scoring annotations, eliminating the need for explicit 3D supervision, and still attains state-of-the-art results on all tested benchmarks.
The architecture comprises a perception encoder and two decoders , trajectory and scoring , all built upon standard transformer blocks. The perception encoder compresses perceptual information into camera-aware registers, forming scene tokens for subsequent processing. Technical accomplishments include the disentanglement of scoring and trajectory generation pathways, achieved by re-embedding and detaching decoded trajectories from the gradient computation graph, allowing for stronger performance and controllable behaviour. Scientists finetuned the ViT with LoRA, introducing sensor registers specific to each camera, and grouping these registers to form scene tokens. The researchers introduced camera-aware register tokens which effectively compress multi-camera input into a concise scene representation, reducing computational demands without compromising accuracy in trajectory prediction. This innovation allows for the creation of lightweight decoders capable of generating and evaluating potential driving paths, with the scoring decoder specifically designed to predict interpretable sub-scores related to safety, comfort, and driving efficiency. Through rigorous testing on benchmarks including NAVSIM-v1, NAVSIM-v2, and HUGSIM, DrivoR consistently matched or exceeded the performance of established contemporary systems. Ablation studies revealed the importance of employing LoRA finetuning, utilizing an optimal number of camera tokens (between 16 and 32), and maintaining separate branches for trajectory generation and scoring. The authors acknowledge limitations related to learning rate scheduling, suggesting that further refinement could potentially close the performance gap between full finetuning and LoRA.
👉 More information
🗞 Driving on Registers
🧠 ArXiv: https://arxiv.org/abs/2601.05083
