Rayrope Achieves 15% Improvement in Multi-View Attention Positional Encoding

Scientists are tackling the challenge of accurately representing spatial information within multi-view transformers, a crucial step for applications like 3D scene understanding and novel view synthesis. Yu Wu from Carnegie Mellon University, Minsik Jeon from Carnegie Mellon University, Jen-Hao Rick Chang from Apple, Oncel Tuzel from Apple, and Shubham Tulsiani from Carnegie Mellon University introduce RayRoPE, a novel positional encoding scheme that uniquely identifies image patches using predicted ray positions and achieves crucial SE(3) invariance. Unlike existing methods, RayRoPE cleverly incorporates geometry-aware encoding and analytically handles positional uncertainty, leading to significant performance gains , demonstrated by a 15% relative improvement on LPIPS in the CO3D dataset , and the ability to effectively utilise RGB-D data. This research represents a substantial advance in multi-view perception, paving the way for more robust and accurate 3D reconstruction and rendering techniques.

To achieve SE(3) invariance, RayRoPE computes query-frame projective coordinates, enabling multi-frequency similarity calculations, a crucial aspect for effective attention mechanisms. The core innovation lies in representing patch positions using rays but employing a predicted point along the ray rather than its direction, allowing the encoding to adapt to the scene’s geometry. This improvement highlights the effectiveness of the geometry-aware encoding and the analytical uncertainty handling.

This capability expands the applicability of the framework to scenarios where depth information is readily available, further enhancing performance and accuracy. The research establishes a new standard for positional encoding in multi-view transformers, offering a robust and adaptable solution for various 3D vision tasks. This breakthrough opens avenues for more accurate and efficient multi-view reconstruction, scene understanding, and robotic perception. The team achieved this by formulating a mechanism that leverages the rays corresponding to each patch to define position embeddings, ensuring uniqueness and multi-frequency encoding. RayRoPE overcomes the limitations of prior methods by using relative ray positions transformed to the query token’s camera frame, achieving SE(3) invariance. The analytical solution for handling uncertainty in predicted depths further refines the encoding, ensuring robustness and reliability in diverse scenarios.

Ray-Based Positional Encoding for Multi-View Transformers improves 3D

The team engineered a system to compute query-frame projective coordinates, ensuring SE(3)-invariance for multi-frequency similarity calculations within the attention mechanism. The technique reveals that encoding patch positions via predicted ray points allows the attention mechanism to learn geometry without explicit supervision. Researchers harnessed this capability to effectively integrate known geometry at inference, such as utilising RGB-D reference views for novel view synthesis. The work details how the team implemented RoPE, rotary positional encodings, as the foundation for their multi-view spatial relationship encoding, building upon existing translational invariance properties.

This research pioneered a departure from traditional methods that concatenate camera information with input features, instead focusing on relative ray positions transformed to the query token’s camera coordinate system. The study’s analytical solution for handling uncertainty in predicted depths represents a significant methodological innovation, allowing for robust performance even with imperfect 3D point estimations. Ultimately, the development of RayRoPE enables more accurate and efficient multi-view transformers, advancing the state-of-the-art in 3D vision tasks.

RayRoPE outperforms existing multi-view transformer encodings in several

To ensure SE(3) invariance, the team computed query-frame projective coordinates for multi-frequency similarity calculations. Specifically, on the CO3D dataset, RayRoPE achieved a PSNR of 20.47, a LPIPS of 0.284, and an SSIM of 0.692, surpassing baseline methods. Tests prove that RayRoPE’s performance advantage widens with increasing camera pose variation, highlighting its improved capability in reasoning about spatial geometry across disparate views. For example, on the Objaverse dataset, RayRoPE attained a PSNR of 25.19, a LPIPS of 0.067, and an SSIM of 0.929. The breakthrough delivers superior high-frequency details in generated views, particularly for target views overlapping significantly with reference views.

Furthermore, when trained with known depths at reference views, RayRoPE significantly outperformed PRoPE on both Objaverse and CO3D, achieving gains in PSNR and LPIPS. Measurements confirm that RayRoPE enhances stereo depth estimation accuracy when applied to UniMatch, a multi-view transformer-based model. On the RGBD dataset, RayRoPE achieved an Absolute Relative Difference of 0.106, a Squared Relative Difference of 0.197, an RMSE of 0.574, and an RMSE log of 0.177, consistently outperforming PRoPE. Ablation studies revealed that removing the uncertainty prediction component resulted in a slight performance decrease, demonstrating its contribution to the overall accuracy of the system. These findings underscore the effectiveness of RayRoPE for multi-view attention and its potential for broader applications in 3D scene understanding.

RayRoPE encodes geometry via predicted 3D points

This new approach uniquely encodes patches, facilitates SE(3)-invariant representations with multi-frequency similarity, and adapts to scene geometry. To maintain SE(3) invariance, the system computes query-frame projective coordinates for multi-frequency similarity calculations. Ablation studies confirmed the importance of uncertainty modeling, geometric adaptiveness, and multi-frequency similarities for robust performance, particularly on challenging datasets like CO3D. The authors acknowledge that the “Sq Rel” metric is less reliable for the RGBD dataset due to imperfections in depth and camera pose annotations. Future work could explore extending RayRoPE to other multi-view applications and investigating more sophisticated methods for predicting 3D point locations along rays. These findings establish a significant advancement in positional encoding for multi-view transformers, offering enhanced accuracy and adaptability for tasks requiring 3D scene understanding.

👉 More information
🗞 RayRoPE: Projective Ray Positional Encoding for Multi-view Attention
🧠 ArXiv: https://arxiv.org/abs/2601.15275

Rohail T.

Rohail T.

As a quantum scientist exploring the frontiers of physics and technology. My work focuses on uncovering how quantum mechanics, computing, and emerging technologies are transforming our understanding of reality. I share research-driven insights that make complex ideas in quantum science clear, engaging, and relevant to the modern world.

Latest Posts by Rohail T.:

Qufid Advances Quantum Program Fidelity Estimation with Adaptive Measurement Budgets

Qufid Advances Quantum Program Fidelity Estimation with Adaptive Measurement Budgets

January 23, 2026
Scsimulator Achieves Supply Chain Partner Selection Via LLM-Driven Multi-Agent Simulation

Scsimulator Achieves Supply Chain Partner Selection Via LLM-Driven Multi-Agent Simulation

January 23, 2026
Lookbench Advances Fashion Image Retrieval with Live, Challenging Benchmarks and Timestamps

Lookbench Advances Fashion Image Retrieval with Live, Challenging Benchmarks and Timestamps

January 23, 2026