Saved in:
Bibliographic Details
Main Authors: Fu, Luxuan, Liu, Chong, Yang, Bisheng, Dong, Zhen
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2601.10551
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866911377692557312
author Fu, Luxuan
Liu, Chong
Yang, Bisheng
Dong, Zhen
author_facet Fu, Luxuan
Liu, Chong
Yang, Bisheng
Dong, Zhen
contents Automated perception of urban roadside infrastructure is crucial for smart city management, yet general-purpose models often struggle to capture the necessary fine-grained attributes and domain rules. While Large Vision Language Models (VLMs) excel at open-world recognition, they often struggle to accurately interpret complex facility states in compliance with engineering standards, leading to unreliable performance in real-world applications. To address this, we propose a domain-adapted framework that transforms VLMs into specialized agents for intelligent infrastructure analysis. Our approach integrates a data-efficient fine-tuning strategy with a knowledge-grounded reasoning mechanism. Specifically, we leverage open-vocabulary fine-tuning on Grounding DINO to robustly localize diverse assets with minimal supervision, followed by LoRA-based adaptation on Qwen-VL for deep semantic attribute reasoning. To mitigate hallucinations and enforce professional compliance, we introduce a dual-modality Retrieval-Augmented Generation (RAG) module that dynamically retrieves authoritative industry standards and visual exemplars during inference. Evaluated on a comprehensive new dataset of urban roadside scenes, our framework achieves a detection performance of 58.9 mAP and an attribute recognition accuracy of 95.5%, demonstrating a robust solution for intelligent infrastructure monitoring.
format Preprint
id arxiv_https___arxiv_org_abs_2601_10551
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle Unleashing the Capabilities of Large Vision-Language Models for Intelligent Perception of Roadside Infrastructure
Fu, Luxuan
Liu, Chong
Yang, Bisheng
Dong, Zhen
Computer Vision and Pattern Recognition
Automated perception of urban roadside infrastructure is crucial for smart city management, yet general-purpose models often struggle to capture the necessary fine-grained attributes and domain rules. While Large Vision Language Models (VLMs) excel at open-world recognition, they often struggle to accurately interpret complex facility states in compliance with engineering standards, leading to unreliable performance in real-world applications. To address this, we propose a domain-adapted framework that transforms VLMs into specialized agents for intelligent infrastructure analysis. Our approach integrates a data-efficient fine-tuning strategy with a knowledge-grounded reasoning mechanism. Specifically, we leverage open-vocabulary fine-tuning on Grounding DINO to robustly localize diverse assets with minimal supervision, followed by LoRA-based adaptation on Qwen-VL for deep semantic attribute reasoning. To mitigate hallucinations and enforce professional compliance, we introduce a dual-modality Retrieval-Augmented Generation (RAG) module that dynamically retrieves authoritative industry standards and visual exemplars during inference. Evaluated on a comprehensive new dataset of urban roadside scenes, our framework achieves a detection performance of 58.9 mAP and an attribute recognition accuracy of 95.5%, demonstrating a robust solution for intelligent infrastructure monitoring.
title Unleashing the Capabilities of Large Vision-Language Models for Intelligent Perception of Roadside Infrastructure
topic Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2601.10551