Current vision-language models (VLMs) often struggle to fully utilise visual information due to a limited connection between image processing and language generation. Cheng Chen, Yuyu Guo, and Pengpeng Zeng, from Ant Group and Tongji University, alongside Jingkuan Song, Peng Di, and Hang Yu, address this issue with a new framework called Cross-Layer Injection (CLI). This research introduces a dynamic system that creates a more robust link between visual and linguistic data, allowing large language models to better interpret complex images. By granting language models access to the complete visual hierarchy, CLI significantly improves their ability to integrate detail with broader context, ultimately enhancing multimodal understanding. Extensive testing across multiple benchmarks demonstrates CLI’s effectiveness and establishes it as a scalable solution for deeper vision-language fusion.
Dynamic Visual Hierarchy Access via Cross-Layer Injection
Scientists demonstrate a significant advancement in vision-language models (VLMs) by addressing a critical limitation in how visual information is processed. Current VLMs often suffer from a bottleneck created by a simplistic connection between vision encoders and large language models (LLMs), restricting the LLM’s ability to fully utilise hierarchical visual knowledge. This research introduces Cross-Layer Injection (CLI), a novel framework designed to forge a dynamic, many-to-many connection between these two modalities, enabling more comprehensive multimodal understanding. The team achieved this by moving beyond static architectures that limit access to visual data, instead granting LLMs on-demand access to the full visual hierarchy.
The study reveals that existing VLMs employ a crude, asymmetric connection, linking only the final output of the vision encoder to the LLM, which fundamentally limits accurate integration of visual details with broader semantic understanding. To overcome this, researchers developed CLI, comprising two key components: an Adaptive Multi-Projection (AMP) module and an Adaptive Gating Fusion (AGF) mechanism. AMP efficiently harmonises features extracted from various layers of the vision encoder, while AGF empowers the LLM to selectively inject the most pertinent visual information based on its real-time decoding context, creating a dynamic bridge between vision and language processing. Experiments show that this allows the LLM to dynamically query the entire visual hierarchy, accessing both fine-grained details and high-level semantic concepts as needed.
This breakthrough establishes CLI as a scalable paradigm for deep vision-language fusion, validated through integration with both LLaVA-OneVision and LLaVA-1.5 architectures. Extensive experiments conducted across 18 diverse benchmarks demonstrate significant performance improvements, highlighting the framework’s versatility and effectiveness. Specifically, when implemented with LLaVA-OV-7B, the research team observed performance gains of 6.5, 3.3, and 4.7 points on the LLaVA-in-the-Wild, MME, and OCR-Bench benchmarks respectively, proving CLI’s ability to unlock deeper multimodal understanding. The work opens new avenues for applications requiring nuanced visual reasoning, such as advanced document analysis, complex chart interpretation, and general visual perception tasks. Visualisation of the gating weights within the CLI framework confirms the efficacy of these “criss-cross connections”, revealing that deeper LLM layers dynamically query features from across the entire spectrum of vision encoder layers. This dynamic interaction moves beyond simple one-to-one mappings, enabling the LLM to effectively integrate local details with global semantics for more accurate and coherent reasoning, ultimately bringing VLMs closer to achieving human-like cognitive abilities.
Cross-Layer Injection for Enhanced Vision-Language Understanding
Researchers addressed limitations in vision-language models (VLMs) stemming from a constricted connection between vision encoders and large language models (LLMs). The study identified that existing architectures rely on a single, final vision encoder layer, creating a bottleneck and hindering comprehensive visual understanding. To overcome this, the team engineered Cross-Layer Injection (CLI), a framework designed to establish a dynamic, many-to-many connection between visual and linguistic features. This innovative approach allows LLMs to access and integrate information from multiple layers within the vision encoder, unlocking a more nuanced perception of visual data.
The core of CLI comprises two synergistic modules: Adaptive Multi-Projection (AMP) and Adaptive Gating Fusion (AGF). Scientists developed AMP using Low-Rank Adaptation (LoRA) to efficiently harmonize features extracted from diverse vision layers into a unified semantic space. This process ensures that visual information, regardless of its origin within the encoder, is represented in a consistent and comparable format. Crucially, the AGF mechanism acts as a context-sensitive controller, enabling the LLM to selectively integrate the most pertinent visual information at each decoding step. This adaptive gating allows the model to prioritize fine-grained textures from shallow layers or abstract concepts from deeper layers, based on the specific reasoning demands of the task.
To validate CLI’s effectiveness, the research team integrated it into two distinct VLM architectures, LLaVA-OneVision and LLaVA-1.5. Experiments employed a comprehensive suite of 18 diverse benchmarks, encompassing document analysis, chart reasoning, and general visual perception tasks. Results demonstrated significant performance improvements across all benchmarks, with LLaVA-OV-7B achieving gains of 6.5, 3.3, and 4.7 points on the LLaVA-in-the-Wild, MME, and OCR-Bench benchmarks, respectively. This work establishes CLI as a scalable paradigm, granting LLMs on-demand access to the full visual hierarchy and unlocking deeper multimodal understanding.
The study pioneered a method for dynamic visual feature integration, moving beyond static, one-to-one connections. By empowering the LLM to actively query the visual hierarchy, CLI enables a more flexible and context-aware fusion of visual and linguistic information. This approach achieves a substantial leap in performance, establishing a new state-of-the-art on several key visual reasoning tasks and demonstrating the potential for more sophisticated and accurate VLMs.
Cross-Layer Injection Boosts Visual-Language Performance
Scientists achieved substantial performance gains in visual-language models (VLMs) through the implementation of Cross-Layer Injection (CLI), a novel framework designed to improve multimodal understanding. The research team successfully integrated CLI into both LLaVA-OneVision and LLaVA-1.5 architectures, demonstrating its versatility and broad applicability across diverse model configurations. Experiments revealed that CLI consistently outperformed strong baseline models and competing fusion strategies on a comprehensive suite of 18 benchmarks, establishing a new paradigm for scalable multimodal learning. The core of the breakthrough lies in CLI’s ability to forge a dynamic, many-to-many connection between vision and language encoders, granting large language models (LLMs) on-demand access to the full visual hierarchy.
Quantitative results, detailed in accompanying tables, show that CLI significantly improves performance over the Baseline Projector across both 0.5 billion and 7 billion parameter model scales. Specifically, on the AI2D benchmark, LLaVA-OneVision with CLI achieved a score of 57.2, a 0.7 point increase over the baseline, while on ChartQA, the model scored 64.7, a 0.2 point improvement. These gains, though seemingly incremental, demonstrate a consistent and reliable enhancement in visual understanding. Further measurements confirm CLI’s superiority over alternative deep fusion strategies. Compared to DeepStack, which injects final-layer visual tokens, CLI delivered substantial improvements, with DeepStack often degrading performance and falling below the baseline by as much as 50.7 points on the combined benchmark score.
Shallow-Layer Injection, a statically-wired method, provided only marginal gains, highlighting the benefits of CLI’s dynamic and adaptive approach. The team recorded a peak performance of 686.7 on the combined benchmark using InternVL-2-26B, demonstrating the potential for scaling these improvements to larger models. Tests prove that CLI unlocks a deeper level of multimodal understanding by allowing LLMs to iteratively refine their visual perception and “re-examine” visual evidence at varying granularities throughout the generation process. Data shows that the framework excels in tasks requiring fine-grained perception, such as chart and document understanding (AI2D, ChartQA, DocVQA, InfoVQA), as well as complex reasoning and knowledge integration (MME, MMVet, MathVerse, ScienceQA). The breakthrough delivers robust gains, particularly on challenging benchmarks like LLaVA-in-the-Wild, signifying a substantial advancement in real-world understanding and visual chat capabilities.
👉 More information
🗞 From One-to-One to Many-to-Many: Dynamic Cross-Layer Injection for Deep Vision-Language Fusion
🧠 ArXiv: https://arxiv.org/abs/2601.10710
