Scientists are increasingly focused on understanding how vision-language models (VLMs) represent the relationship between images and text. Grégoire Dhimoïla from ENS Paris Saclay, Thomas Fel from Brown University, and Victor Boutin from Brown University, alongside Agustin Picard et al., have investigated the underlying geometry of these embedding spaces, revealing a surprising degree of structure. Their work centres on the Iso-Energy Assumption, leveraging cross-modal redundancy to enforce consistency between visual and textual representations using an Aligned Sparse Autoencoder. This research is significant because it not only provides a framework for analysing VLM alignment, but also demonstrates how a carefully chosen inductive bias can improve interpretability and enable controlled manipulation of the latent space, potentially leading to more robust and effective multimodal AI systems.
Geometric alignment of multimodal embeddings via the Iso-Energy Assumption
Researchers have developed a new framework for analysing how vision-language models organise and align semantic content across different modalities. This work centres on understanding the geometry of shared embedding spaces within these models, which are crucial for applications ranging from visual question answering to autonomous driving.
The study introduces the Iso-Energy Assumption, proposing that concepts genuinely shared between image and text should exhibit consistent average energy regardless of the input modality. Applying this framework to foundational vision-language models reveals a clear structure with significant practical implications for model behaviour and interpretability.
Specifically, the research demonstrates that sparse bimodal atoms carry the entire cross-modal alignment signal, effectively bridging the gap between visual and textual information. Unimodal atoms, conversely, function as modality-specific biases, fully explaining the observed modality gap, and their removal collapses this gap without compromising overall performance.
Furthermore, restricting vector arithmetic to this bimodal subspace enables in-distribution edits and enhances retrieval accuracy, suggesting a pathway towards more controllable and effective vision-language systems. These findings indicate that a carefully chosen inductive bias can simultaneously preserve model fidelity and render the underlying latent geometry both interpretable and actionable.
The work provides a new lens through which to view the internal workings of these complex models, potentially paving the way for the design of more robust and transparent vision-language architectures. The SAE was trained to reconstruct inputs while simultaneously enforcing sparsity and, crucially, energy consistency between visual and textual modalities.
This consistency was achieved by penalising discrepancies in the average squared activation, termed ‘energy’, of learned concepts across image and text data. The study operationalised the Iso-Energy Assumption, positing that genuinely shared concepts should exhibit equivalent energy levels regardless of input modality.
To implement this, the SAE’s training objective incorporated an alignment penalty, directly encouraging the model to learn representations where bimodal concepts possess similar energies. Reconstruction loss ensured the preservation of information, while the alignment penalty guided the emergence of a geometrically interpretable latent space.
Validation of the Aligned SAE involved controlled experiments utilising synthetic data with known alignment properties. These tests confirmed that the framework accurately identifies and enhances alignment when the Iso-Energy principle holds, and conversely, maintains separation when it does not. Application of this method to foundational VLMs revealed a distinct geometric decomposition of the embedding space.
Specifically, the research demonstrated that sparse bimodal atoms encapsulate the entire cross-modal alignment signal, while unimodal atoms function as modality-specific biases, fully accounting for the observed modality gap. Removing these unimodal components collapsed the gap without compromising overall performance. Research utilising this framework identifies that unimodal features function as modality-specific biases, fully accounting for the modality gap observed in these models.
Removing these unimodal features collapses the modality gap without compromising overall performance. The study characterises the latent space through novel metrics, establishing that the shared concept space, denoted as Γ, is equivalent to the cone spanned by a binary mask and a dictionary of latent atoms.
Conversely, modality-specific subspaces, ΩI and ΩT, are also defined as cones, each utilising a separate binary mask. This characterisation is based on analysing four complementary aspects of the dictionary, validating a concept-based approach to understanding VLM encoders. Work formalises a multimodal concept generative process, modelling data as originating from a latent concept vector and a domain-specific generator.
The Iso-Energy Assumption, central to this research, posits that genuinely multimodal concepts should exhibit consistent average energy across modalities. This assumption narrows the solution space during dictionary recovery, promoting stable and plausible solutions. Specifically, the framework satisfies Iso-Energy when the second moment of each coordinate of the learned encoder is domain-invariant across all domains.
Through the use of a Matching Pursuit sparse autoencoder, the research demonstrates that high-energy unimodal features correspond to modality-specific biases. Restricting vector arithmetic to the bimodal subspace yields in-distribution edits and improved retrieval performance, highlighting the practical value of this interpretable latent geometry. These findings suggest that a carefully chosen inductive bias can preserve model fidelity while simultaneously rendering the latent space interpretable and actionable.
Geometric Structure Reveals Cross-Modal Alignment and Bias in Vision-Language Models
Researchers have developed a framework for analysing the geometric structure within the embedding spaces of vision-language models. The analysis reveals that cross-modal alignment signals are entirely carried by sparse bimodal components, while unimodal components function as modality-specific biases that account for the gap between image and text embeddings.
Removing these unimodal components effectively closes this gap without compromising performance. Furthermore, restricting vector arithmetic operations to the bimodal subspace enables in-distribution edits and improves information retrieval. These findings demonstrate that a carefully chosen inductive bias can simultaneously maintain model accuracy and create a latent geometry that is both interpretable and useful.
The authors acknowledge that measuring the modality gap presents challenges, as simple metrics like mean distance or linear separability may not fully capture distributional mismatches after intervention. Future research could explore more robust methods for quantifying these mismatches and further investigate the potential for manipulating embeddings to achieve specific outcomes. The demonstrated ability to isolate and manipulate cross-modal information opens avenues for improved semantic editing and retrieval in vision-language models, offering a path towards more controllable and interpretable multimodal AI systems.
👉 More information
🗞 Cross-Modal Redundancy and the Geometry of Vision-Language Embeddings
🧠 ArXiv: https://arxiv.org/abs/2602.06218
