Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Fu, Luxuan, Liu, Chong, Yang, Bisheng, Dong, Zhen
Format:	Preprint
Published:	2026
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2601.10551
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866911377692557312
author	Fu, Luxuan Liu, Chong Yang, Bisheng Dong, Zhen
author_facet	Fu, Luxuan Liu, Chong Yang, Bisheng Dong, Zhen
contents	Automated perception of urban roadside infrastructure is crucial for smart city management, yet general-purpose models often struggle to capture the necessary fine-grained attributes and domain rules. While Large Vision Language Models (VLMs) excel at open-world recognition, they often struggle to accurately interpret complex facility states in compliance with engineering standards, leading to unreliable performance in real-world applications. To address this, we propose a domain-adapted framework that transforms VLMs into specialized agents for intelligent infrastructure analysis. Our approach integrates a data-efficient fine-tuning strategy with a knowledge-grounded reasoning mechanism. Specifically, we leverage open-vocabulary fine-tuning on Grounding DINO to robustly localize diverse assets with minimal supervision, followed by LoRA-based adaptation on Qwen-VL for deep semantic attribute reasoning. To mitigate hallucinations and enforce professional compliance, we introduce a dual-modality Retrieval-Augmented Generation (RAG) module that dynamically retrieves authoritative industry standards and visual exemplars during inference. Evaluated on a comprehensive new dataset of urban roadside scenes, our framework achieves a detection performance of 58.9 mAP and an attribute recognition accuracy of 95.5%, demonstrating a robust solution for intelligent infrastructure monitoring.
format	Preprint
id	arxiv_https___arxiv_org_abs_2601_10551
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Unleashing the Capabilities of Large Vision-Language Models for Intelligent Perception of Roadside Infrastructure Fu, Luxuan Liu, Chong Yang, Bisheng Dong, Zhen Computer Vision and Pattern Recognition Automated perception of urban roadside infrastructure is crucial for smart city management, yet general-purpose models often struggle to capture the necessary fine-grained attributes and domain rules. While Large Vision Language Models (VLMs) excel at open-world recognition, they often struggle to accurately interpret complex facility states in compliance with engineering standards, leading to unreliable performance in real-world applications. To address this, we propose a domain-adapted framework that transforms VLMs into specialized agents for intelligent infrastructure analysis. Our approach integrates a data-efficient fine-tuning strategy with a knowledge-grounded reasoning mechanism. Specifically, we leverage open-vocabulary fine-tuning on Grounding DINO to robustly localize diverse assets with minimal supervision, followed by LoRA-based adaptation on Qwen-VL for deep semantic attribute reasoning. To mitigate hallucinations and enforce professional compliance, we introduce a dual-modality Retrieval-Augmented Generation (RAG) module that dynamically retrieves authoritative industry standards and visual exemplars during inference. Evaluated on a comprehensive new dataset of urban roadside scenes, our framework achieves a detection performance of 58.9 mAP and an attribute recognition accuracy of 95.5%, demonstrating a robust solution for intelligent infrastructure monitoring.
title	Unleashing the Capabilities of Large Vision-Language Models for Intelligent Perception of Roadside Infrastructure
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2601.10551

Similar Items