Scientists are increasingly interested in understanding how the brain integrates information from different senses, and a new study by Milano and Nolfi, from the Institute of Cognitive Sciences and Technologies, National Research Council, alongside colleagues, sheds light on whether language, vision and action are processed using separate or shared internal representations. Challenging the traditional view of specialised processing, the researchers demonstrate a surprising alignment between these modalities, training an agent to perform actions based on language and then comparing its internal representations with those of large language models like LLaMA and vision-language models such as CLIP. This research is significant because it reveals that action-based representations strongly correlate with those of certain language models, suggesting a shared semantic structure across modalities and opening up possibilities for more effective cross-domain learning in artificial intelligence and robotics.
Embodied agents learn action representations aligning with large language models through multimodal perception
Scientists investigate whether representational convergence extends to embodied action learning by training a transformer-based agent to execute goal-directed behaviors in response to natural language instructions. Using behavioral cloning on the BabyAI platform, they generate action-grounded language embeddings shaped exclusively by sensorimotor control requirements. Alignment with language-only and vision–language models such as BERT and CLIP is significantly weaker, yet non-negligible, indicating that linguistic, visual, and action representations converge toward partially shared semantic structures. These findings support the hypothesis of modality-independent semantic organization and highlight the potential for cross-domain transfer in embodied artificial intelligence systems.
Internal representations in neural networks mediate the transformation from input to output, and it is traditionally assumed that models exposed to different data modalities or optimized for distinct tasks develop correspondingly specialized representational structures. This assumption arises from the belief that internal feature spaces are shaped jointly by the statistical properties of training data and the constraints imposed by learning objectives. Consequently, differences in modality or task demands are expected to produce task-specific, non-transferable representations rather than shared semantic structures. This view has strongly influenced theories of semantic knowledge, leading to the long-standing belief that systems trained solely on linguistic input cannot acquire grounded or embodied representations comparable to those formed through direct perceptual and motor interaction with the physical world.
From this perspective, language-based models are seen as operating within a self-contained symbolic domain, lacking the experiential grounding necessary to form concepts anchored in sensorimotor experience. Without explicit grounding, such models are thought to capture only syntactic or statistical relationships among symbols, rather than the rich conceptual knowledge derived from interaction with the environment. Under this assumption, semantic representations would necessarily remain modality-specific, with linguistic, visual, and action-based knowledge occupying fundamentally distinct representational spaces. If this view were correct, minimal structural correspondence would be expected among representations learned through different modalities.
However, growing empirical evidence challenges this traditional boundary. Recent studies reveal that systems trained on different modalities, tasks, or learning objectives often converge toward similar internal representational geometries. Large language models have been shown to acquire implicit knowledge of the physical world from text alone, demonstrating abilities such as reasoning about color perception, spatial structure, affordances, and action planning. For example, prior work has shown that representations learned by text-only language models can be linearly mapped onto those of vision–language models, indicating substantial overlap in their semantic structures. Additional studies have found partial isomorphism between language and vision embeddings, further suggesting cross-modal alignment.
These findings collectively imply that linguistic co-occurrence statistics encode substantial information about the physical and functional structure of the world, sufficient to approximate aspects of grounded cognition. Building on this insight, the present study addresses a critical open question: whether representations learned through embodied action—arguably the most grounded form of learning—align with those learned through passive observation of language and vision. Because action representations are shaped by the pragmatic demands of achieving goals through interaction, their alignment with linguistic and visual representations would provide strong evidence that core semantic structures transcend modality.
To investigate this question, the authors train a transformer-based agent on the BabyAI platform using behavioral cloning, enabling a simulated agent to execute action sequences in response to natural language instructions. The resulting internal representations are compared with those extracted from large language models (including LLaMA, Qwen, DeepSeek, and BERT) and vision–language models (including CLIP and BLIP). The analysis reveals partial but meaningful alignment between action-based representations and those learned through language and vision, supporting the hypothesis that semantic organization is fundamentally modality-independent.
The novelty of this work lies in extending cross-modal alignment analysis to include action representations and in contrasting learning through passive observation with learning through embodied interaction. Whereas language and vision models learn from static datasets, the embodied agent actively shapes its sensory experience through its actions and learns from the consequences of those actions. Despite this fundamental difference, the observed alignment suggests the existence of shared semantic structures linking linguistic, visual, and sensorimotor domains within a unified representational framework. These results challenge the assumption that grounded knowledge requires direct sensorimotor experience and instead point toward common representational principles that enable cross-domain integration and transfer.
The experimental framework is based on the BabyAI platform, a simulated 2D partially observable environment designed to study language-conditioned action learning. In each episode, the agent receives a natural language instruction and a visual observation of its local surroundings and must generate an action sequence to accomplish the specified goal. The environment contains objects such as balls, boxes, keys, and doors of various colors, with the agent able to perform six discrete actions: turning left or right, moving forward, picking up, dropping, and opening objects. The precise action sequence depends on the instruction, the agent’s initial position and orientation, and the spatial configuration of objects, allowing for a diverse and challenging set of embodied tasks. This setting provides a controlled yet expressive testbed for studying the emergence and alignment of action-grounded semantic representations.
Training a language-conditioned agent within a 2D embodied environment requires careful reward shaping
Scientists investigated whether learning across modalities, language, vision, and action, yields distinct or shared internal representations. The study pioneered a methodology using the BabyAI platform, a Python library for language-conditioned action learning, to train a transformer-based agent. Researchers generated action-grounded language embeddings solely through sensorimotor control requirements within the BabyAI environment.
The BabyAI platform features a 2D, partially observable environment containing objects like balls, boxes, keys, and doors in colours ranging from yellow to purple. Each evaluation episode begins with randomised agent position, orientation, and object placement. The agent executes six actions, turn left, turn right, go forward, pick up, drop, and open, to complete missions described by natural language instructions.
Instructions combine phrases like “go to”, “pick up”, and “put next to” with object descriptions and spatial/temporal relationships. The team engineered a visuo-linguistic transformer network controller. Language requests were converted into sequences of 100-dimensional token embeddings, initialised identically but differentiated during learning.
A transformer block with multi-head self-attention and a multi-layered perceptron then processed these embeddings into a single 128-dimensional language embedding, EL. Simultaneously, the agent receives images of its local environment, providing visual input. The study focused on three tasks: GO-TO, PICK-UP, and ESCAPE-FROM, incorporating synonymous expressions to increase linguistic variation.
Action-grounded learning yields semantic alignment across vision, language and control, fostering robust generalization
Scientists achieved a significant convergence of representations across different learning modalities, language, vision, and action, challenging traditional assumptions about specialized internal representations. The research team trained a transformer-based agent on the BabyAI platform, generating action-grounded language embeddings driven by sensorimotor control requirements.
Results demonstrate significantly weaker alignment with CLIP and BERT, indicating that the shared semantic structures are not universal across all models. Tests prove that linguistic, visual, and action representations converge towards partially shared semantic structures, supporting modality-independent semantic organization.
The BabyAI platform, a Python library, was used to investigate language-conditioned action learning, where agents respond to natural language requests by generating action sequences in a 2D environment. The transformer-based neural network controller processed visual and linguistic components separately, creating embedded representations of the input image and language instruction.
Images were mapped into embedding vectors using a residual convolutional neural network pre-trained on ImageNet, while language requests were transformed into 128-dimensional embedding vectors. This work highlights the potential for cross-domain transfer in embodied systems and suggests that grounded knowledge may not always require direct sensorimotor experience.
Action-grounded embeddings mirror structure within large language models, revealing connections between perception and representation
Scientists have demonstrated that representations acquired through diverse learning methods, text prediction, visual-linguistic alignment, and situated action, exhibit considerable structural similarity. This research involved training a transformer agent on the BabyAI platform to perform goal-directed actions based on natural language instructions, generating action-grounded language embeddings.
These embeddings were then compared with those from large language models (LLaMA, Qwen, DeepSeek, BERT) and vision-language models (CLIP, BLIP). The findings reveal strong correlations between action representations and decoder-only language models, alongside BLIP, achieving a precision of 0.70-0.73. This level of correspondence approaches that observed between language models themselves, while alignment with CLIP and BERT proved significantly weaker.
These results suggest a convergence towards shared semantic structures across modalities, supporting the idea of modality-independent semantic organisation and potential for cross-domain transfer in embodied systems. The authors acknowledge that the observed correspondences indicate shared representational organisation, but do not necessarily imply identical semantic grounding or understanding.
Future research should investigate whether these effects extend to more complex environments, richer linguistic inputs, continuous action spaces, and alternative learning paradigms. This work establishes that despite differences in training data and objectives, language, vision, and action representations can converge, offering a potential bridge for knowledge transfer and improved adaptability in multimodal agents.
👉 More information
🗞 Alignment among Language, Vision and Action Representations
🧠 ArXiv: https://arxiv.org/abs/2601.22948
