Saved in:
Bibliographic Details
Main Authors: Liu, Chong, Fu, Luxuan, Jia, Yang, Dong, Zhen, Yang, Bisheng
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2601.10535
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866914257688330240
author Liu, Chong
Fu, Luxuan
Jia, Yang
Dong, Zhen
Yang, Bisheng
author_facet Liu, Chong
Fu, Luxuan
Jia, Yang
Dong, Zhen
Yang, Bisheng
contents The automated creation of digital twins and precise asset inventories is a critical task in smart city construction and facility lifecycle management. However, utilizing cost-effective sparse imagery remains challenging due to limited robustness, inaccurate localization, and a lack of fine-grained state understanding. To address these limitations, SVII-3D, a unified framework for holistic asset digitization, is proposed. First, LoRA fine-tuned open-set detection is fused with a spatial-attention matching network to robustly associate observations across sparse views. Second, a geometry-guided refinement mechanism is introduced to resolve structural errors, achieving precise decimeter-level 3D localization. Third, transcending static geometric mapping, a Vision-Language Model agent leveraging multi-modal prompting is incorporated to automatically diagnose fine-grained operational states. Experiments demonstrate that SVII-3D significantly improves identification accuracy and minimizes localization errors. Consequently, this framework offers a scalable, cost-effective solution for high-fidelity infrastructure digitization, effectively bridging the gap between sparse perception and automated intelligent maintenance.
format Preprint
id arxiv_https___arxiv_org_abs_2601_10535
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle SVII-3D: Advancing Roadside Infrastructure Inventory with Decimeter-level 3D Localization and Comprehension from Sparse Street Imagery
Liu, Chong
Fu, Luxuan
Jia, Yang
Dong, Zhen
Yang, Bisheng
Computer Vision and Pattern Recognition
The automated creation of digital twins and precise asset inventories is a critical task in smart city construction and facility lifecycle management. However, utilizing cost-effective sparse imagery remains challenging due to limited robustness, inaccurate localization, and a lack of fine-grained state understanding. To address these limitations, SVII-3D, a unified framework for holistic asset digitization, is proposed. First, LoRA fine-tuned open-set detection is fused with a spatial-attention matching network to robustly associate observations across sparse views. Second, a geometry-guided refinement mechanism is introduced to resolve structural errors, achieving precise decimeter-level 3D localization. Third, transcending static geometric mapping, a Vision-Language Model agent leveraging multi-modal prompting is incorporated to automatically diagnose fine-grained operational states. Experiments demonstrate that SVII-3D significantly improves identification accuracy and minimizes localization errors. Consequently, this framework offers a scalable, cost-effective solution for high-fidelity infrastructure digitization, effectively bridging the gap between sparse perception and automated intelligent maintenance.
title SVII-3D: Advancing Roadside Infrastructure Inventory with Decimeter-level 3D Localization and Comprehension from Sparse Street Imagery
topic Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2601.10535