Accuracy Achieved in Roadside Infrastructure Perception Using Vision-Language Models

The accurate automated perception of urban roadside infrastructure represents a significant challenge for effective smart city management, as existing computer vision models frequently fail to identify crucial details and adhere to specific engineering standards. Luxuan Fu, Chong Liu, and Bisheng Yang, from the State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing at Wuhan University, alongside Zhen Dong from Hubei Luojia Laboratory, present a novel framework designed to unlock the potential of Large Vision-Language Models (VLMs) for this complex task. Their research addresses the limitations of current VLMs in interpreting intricate infrastructure states by developing a domain-adapted system that combines efficient fine-tuning with knowledge-grounded reasoning. This innovative approach leverages open-vocabulary learning and a retrieval-augmented generation module to improve both detection and attribute recognition, achieving a marked improvement in performance on a newly compiled dataset of urban roadside scenes and offering a robust solution for intelligent infrastructure monitoring.

Large Vision, Language Models (VLMs) demonstrate proficiency in open-world recognition, yet often exhibit inaccuracies when interpreting complex facility states according to stringent engineering standards. This limitation results in unreliable performance within practical, real-world applications. To overcome this challenge, researchers propose a domain-adapted framework designed to refine VLMs into specialised agents for intelligent infrastructure analysis. The proposed approach combines a data-efficient fine-tuning strategy with a knowledge-grounded reasoning mechanism, aiming to improve both accuracy and reliability. Specifically, the team leverages open-vocabulary fine-tuning on Grounding DINO to robustly localise diverse assets with minimal supervision.

Qwen-VL and RAG for Infrastructure Assessment

The research details a new framework for intelligent monitoring of urban roadside infrastructure, focusing on both detection of assets and recognition of their specific attributes. The methodology centres on leveraging a Qwen-VL model, initially pretrained and then adapted using a Low-Rank Adaptation (LoRA) technique for detailed semantic attribute reasoning. To improve reliability and ensure adherence to professional standards, a dual-modality Retrieval-Augmented Generation (RAG) module was implemented, dynamically accessing relevant industry standards and visual examples during the analysis process. This approach aims to move beyond simple object detection towards a more comprehensive understanding of infrastructure condition.

The experimental procedure involved evaluating the framework on a newly created dataset comprising urban roadside scenes, designed to represent the complexity of real-world infrastructure. The system was tested on its ability to both locate infrastructure elements and accurately identify their attributes, such as identifying specific types of damage or condition. Performance was measured using mean Average Precision (mAP) for detection, achieving a score of 58.9, and attribute recognition accuracy, which reached 95.5%. These results suggest the framework offers a robust solution for automated infrastructure assessment.

A key challenge addressed by the research is the limitation of traditional computer vision methods, which require extensive labelled data and struggle with fine-grained attribute recognition. The team proposes that vision-language models (VLMs) and large language models (LLMs) offer a potential solution by combining visual recognition with natural language reasoning. However, they note that existing models often produce unstructured outputs unsuitable for engineering applications and lack the necessary constraints for precise, structured state recognition. The framework seeks to overcome these limitations by harnessing the power of large models while grounding them in domain-specific knowledge and schema-level constraints. The RAG module plays a crucial role in this process, providing the model with access to authoritative information and visual references, thereby mitigating potential inaccuracies and ensuring compliance with industry standards. Experiments revealed a detection performance of 58.9 mean Average Precision (mAP) and an attribute recognition accuracy of 95.5% when evaluated on a newly created, comprehensive dataset of urban roadside scenes. This robust solution demonstrates significant progress in intelligent infrastructure monitoring capabilities. The research team employed open-vocabulary fine-tuning on Grounding DINO to accurately localize diverse roadside assets with minimal supervision, followed by LoRA-based adaptation on Qwen-VL for in-depth semantic attribute reasoning.

This two-stage process effectively bridges perception and structured reasoning, allowing the system to generate both interpretable and machine-readable outputs. Measurements confirm that the fine-tuned Qwen-VL model enables multimodal dialogue, allowing users to directly query specific attributes, states, or conditions of roadside infrastructure from images. Further enhancing the system, the team introduced a dual-modality Retrieval-Augmented Generation (RAG) module, dynamically integrating authoritative industry standards and visual exemplars during inference. This innovation mitigates potential hallucinations and ensures compliance with professional engineering standards.

Attribute-guided fine-tuning of Qwen-VL, using Low-Rank Adaptation (LoRA), was implemented to improve attribute reasoning and domain adaptation performance, formulating the task as a supervised visual instruction tuning problem. Data shows that the system transforms annotated data into instruction-following pairs, structured as image, instruction, and output, forcing the model to learn the correspondence between visual features and the specific attribute schema for urban infrastructure. The team evaluated three fine-tuning strategies, closed-set, open-set continued pre-training, and open-vocabulary, demonstrating that open-vocabulary fine-tuning uniquely maintains prior knowledge while enabling scalable detection of unseen objects, crucial for real-world systems where infrastructure is constantly evolving. This unified system links visual grounding, attribute reasoning, and human-machine interaction, supporting both automated perception and interactive querying.

Roadside Infrastructure Analysis via Domain Adaptation

This research presents a novel domain-adapted framework designed to enhance the performance of Large Vision Language Models (VLMs) in analysing urban roadside infrastructure. By combining open-vocabulary fine-tuning with a knowledge-grounded reasoning mechanism, the authors have successfully transformed a general-purpose VLM into a specialized agent capable of detailed infrastructure assessment. The framework utilizes both textual and visual retrieval-augmented generation to improve accuracy and ensure compliance with industry standards. Evaluations conducted on a newly compiled dataset of urban scenes demonstrate significant improvements in both object detection and attribute recognition.

Specifically, the adapted model achieved a mean Average Precision (mAP) of 58.9 for detection and 95.5% accuracy in attribute recognition, surpassing the performance of existing general-purpose VLMs. This suggests a robust solution for automated infrastructure monitoring and smart city management applications. The authors acknowledge that fine-tuning on new categories may result in a slight reduction in performance on previously known categories, representing a typical trade-off in machine learning. They also note that variations in data acquisition conditions, such as weather and lighting, can impact performance, particularly when comparing results across different cities. Future work could focus on mitigating these effects and further refining the dual-modality retrieval mechanism to enhance reasoning consistency and broaden the applicability of the framework to diverse environmental conditions.

👉 More information
🗞 Unleashing the Capabilities of Large Vision-Language Models for Intelligent Perception of Roadside Infrastructure
🧠 ArXiv: https://arxiv.org/abs/2601.10551

Rohail T.

Rohail T.

As a quantum scientist exploring the frontiers of physics and technology. My work focuses on uncovering how quantum mechanics, computing, and emerging technologies are transforming our understanding of reality. I share research-driven insights that make complex ideas in quantum science clear, engaging, and relevant to the modern world.

Latest Posts by Rohail T.:

Quantum Ramps Achieve Zero-Temperature Universality Class Probing at Finite Temperature

Quantum Ramps Achieve Zero-Temperature Universality Class Probing at Finite Temperature

January 19, 2026
Vision-language Alignment Achieves 5% Precision Gains with Multi-Agent Cooperative Learning

Vision-language Alignment Achieves 5% Precision Gains with Multi-Agent Cooperative Learning

January 19, 2026
Quantum Geometry Advances Rotating Shallow Water Equations with Three Band Descriptions

Quantum Geometry Advances Rotating Shallow Water Equations with Three Band Descriptions

January 19, 2026