Zero-shot 3D Alignment Achieves Object-Object Relations with Vision-Language and Geometry

Scientists are tackling the complex problem of aligning 3D meshes without prior training, a crucial step for virtual content creation and scene building. Rotem Gatenyo and Ohad Fried, both from Reichman University, alongside their colleagues, present a novel approach that directly optimises the relative pose of objects using text prompts describing their spatial relationship. This research significantly advances the field by moving beyond traditional geometric alignment and 2D-conditioned methods, instead employing CLIP-driven gradients and a differentiable renderer to update translation, rotation and scale at test time. By combining language supervision with geometry-aware objectives , including soft-ICP and penetration loss , and a new benchmark dataset, the team demonstrate superior performance in achieving semantically accurate and physically realistic alignments.

Text prompts guide zero-shot 3D mesh alignment effectively

Scientists have achieved a breakthrough in zero-shot 3D alignment, successfully aligning two meshes using only a text prompt describing their intended spatial relationship. This capability is essential for advancements in content creation and scene assembly, offering a new level of control and automation in 3D workflows. The research team directly optimises the relative pose, translation, rotation, and isotropic scale, at test time, utilising CLIP-driven gradients through a differentiable renderer, crucially without requiring any new model training. This innovative approach bypasses the need for extensive datasets or pre-trained alignment data, addressing a significant limitation in the field.
The study unveils a framework that intelligently combines language supervision with geometry-aware objectives, enhancing the realism and accuracy of the alignments. A novel fractional soft-Iterative Closest Point (ICP) term encourages surface attachment, ensuring the meshes connect convincingly, while a penetration loss actively discourages interpenetration, preventing unrealistic overlaps. This dual approach ensures both semantic correctness and physical plausibility in the resulting arrangement. Furthermore, a phased schedule progressively strengthens these contact constraints over time, refining the alignment with each iteration, and camera control focuses the optimisation on the critical interaction region.

To rigorously evaluate their method, researchers curated a new benchmark comprising 50 diverse mesh-prompt pairs, enabling standardised assessment of object-object alignment (OOA). Comparisons against existing geometric and language model-based baselines consistently demonstrate superior performance, yielding alignments that are both semantically faithful to the text prompt and physically plausible in their arrangement. The team’s method outperforms all alternatives, establishing a new state-of-the-art in zero-shot 3D alignment and opening doors to more intuitive and efficient 3D content creation pipelines. This work establishes two key contributions: a text-guided, test-time optimisation framework for estimating relative pose and scale, and the aforementioned benchmark for standardised evaluation.

Experiments show that the framework effectively couples differentiable rendering-based language supervision with explicit contact and penetration objectives, achieving higher semantic agreement and lower intersection volume than competing methods. The. Experiments revealed a phased optimisation schedule that strengthens contact constraints over time, transitioning from broad exploration to focused refinement! The fractional soft-ICP weight increased across phases, preventing premature sticking and consolidating attachment, while the penetration-loss weight progressively suppressed interpenetration, weights were increased by ×10 between the three phases for a rooster-comb pair.

Camera control was also implemented, concentrating the optimisation on the interaction region by interpolating the look-at target from the target mesh centroid to the source mesh centroid, defined. This progressive zoom-in and look-at progression across phases allowed for both global context and detailed refinement. Data shows that the method outperforms all alternatives, yielding semantically faithful and physically plausible alignments. The team curated a benchmark containing 50 mesh pairs and text prompts, covering diverse object-object relations such as “A sundae with a cherry on top” and “A candle sits inside a candle holder”.

Tests prove that the optimisation runs for 2,000 steps per pair with a batch of 8 camera views, utilising P=3 phases with a logarithmic increase in the fractional soft-ICP and penetration weights. Measurements confirm that semantic alignment is assessed using three vision-language encoders, CLIP, ALIGN, and SigLIP, with higher values indicating stronger text alignment. Physical plausibility is quantified by the intersection ratio, Inter. = Vol(mesh1 ∩mesh2) / Vol(mesh1 ∪mesh2), ranging from 0 to 1, where lower values indicate fewer interpenetrations. Furthermore, the team employed N=5 randomised initial poses and selected the best-scoring pose based on the total objective value, mitigating local minima and demonstrating robust performance. An LLM-guided hyperparameter selection process was also implemented, adjusting parameters like penetration policy, initial scale, and attachment ratio based on the text prompt and object names.

MPSText prompts align 3D meshes directly

Scientists have developed a new framework for aligning two 3D meshes using only a text prompt describing their spatial relationship. This research introduces a method that directly optimises the relative pose, translation, rotation, and isotropic scale, at test time, utilising CLIP-driven gradients through a differentiable renderer, without requiring additional training. The framework combines language supervision with geometry-aware objectives, including a soft-Iterative Closest Point (ICP) term for surface attachment and a penetration loss to prevent interpenetration, enhanced by a phased schedule strengthening contact constraints and camera control focusing on the interaction region. Researchers curated a benchmark dataset of text-mesh-mesh triplets to facilitate standardised evaluation of text-guided object alignment, demonstrating superior performance compared to existing approaches.

Extensive experiments confirm the method’s robustness and highlight the importance of both geometric and language-based objectives in achieving semantically faithful and physically plausible alignments .The authors acknowledge limitations related to performance when objects exhibit significant size differences or heavy occlusion, as these scenarios can lead to unreliable language-vision gradients, aligning with known biases in CLIP towards more prominent objects. Future work will focus on integrating stronger vision-language models, ensuring multi-view consistency, and incorporating physics-based reasoning to further enhance the realism of the alignments.

👉 More information
🗞 Copy-Trasform-Paste: Zero-Shot Object-Object Alignment Guided by Vision-Language and Geometric Constraints
🧠 ArXiv: https://arxiv.org/abs/2601.14207

Rohail T.

Rohail T.

As a quantum scientist exploring the frontiers of physics and technology. My work focuses on uncovering how quantum mechanics, computing, and emerging technologies are transforming our understanding of reality. I share research-driven insights that make complex ideas in quantum science clear, engaging, and relevant to the modern world.

Latest Posts by Rohail T.:

Geup Corrections Extend Supermassive Black Hole Lifetimes, Hawking Temperature Scales

Geup Corrections Extend Supermassive Black Hole Lifetimes, Hawking Temperature Scales

January 23, 2026
A study shows that Deep Research Agents regress on 27% of revisions.

Deep Research Agents Regress on 27% of Revisions, Study Demonstrates

January 23, 2026
Lightonocr-2-1b Achieves State-Of-The-Art OCR with a 1 Billion Parameter Model

Lightonocr-2-1b Achieves State-Of-The-Art OCR with a 1 Billion Parameter Model

January 23, 2026