Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Zhang, Sen, Li, Runmei, Deng, Shizhuang, Zheng, Zhichao, Zhang, Yuhe, Li, Jiani, Zhang, Kailun, Zhang, Tao, Wu, Wenjun, Wang, Qunbo
Format:	Preprint
Published:	2026
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2603.27112
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866915951422734336
author	Zhang, Sen Li, Runmei Deng, Shizhuang Zheng, Zhichao Zhang, Yuhe Li, Jiani Zhang, Kailun Zhang, Tao Wu, Wenjun Wang, Qunbo
author_facet	Zhang, Sen Li, Runmei Deng, Shizhuang Zheng, Zhichao Zhang, Yuhe Li, Jiani Zhang, Kailun Zhang, Tao Wu, Wenjun Wang, Qunbo
contents	As Automatic Train Operation (ATO) advances toward GoA4 and beyond, it increasingly depends on efficient, reliable cab-view visual perception and decision-oriented inference to ensure safe operation in complex and dynamic railway environments. However, existing approaches focus primarily on basic perception and often generalize poorly to rare yet safety-critical corner cases. They also lack the high-level reasoning and planning capabilities required for operational decision-making. Although recent Large Multi-modal Models (LMMs) show strong generalization and cognitive capabilities, their use in safety-critical ATO is hindered by high computational cost and hallucination risk. Meanwhile, reliable domain-specific benchmarks for systematically evaluating cognitive capabilities are still lacking. To address these gaps, we introduce RailVQA-bench, the first VQA benchmark for cab-view visual cognition in ATO, comprising 20,000 single-frame and 1,168 video based QA pairs to evaluate cognitive generalization and interpretability in both static and dynamic scenarios. Furthermore, we propose RailVQA-CoM, a collaborative large-small model framework that combines small-model efficiency with large-model cognition via a transparent three-module architecture and adaptive temporal sampling, improving perceptual generalization and enabling more efficient reasoning and planning. Experiments demonstrate that the proposed approach substantially improves performance, enhances interpretability, improves efficiency, and strengthens cross-domain generalization in autonomous driving systems. Code and datasets will be available at https://cybereye-bjtu.github.io/RailVQA.html.
format	Preprint
id	arxiv_https___arxiv_org_abs_2603_27112
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	RailVQA: A Benchmark and Framework for Efficient Interpretable Visual Cognition in Automatic Train Operation Zhang, Sen Li, Runmei Deng, Shizhuang Zheng, Zhichao Zhang, Yuhe Li, Jiani Zhang, Kailun Zhang, Tao Wu, Wenjun Wang, Qunbo Computer Vision and Pattern Recognition As Automatic Train Operation (ATO) advances toward GoA4 and beyond, it increasingly depends on efficient, reliable cab-view visual perception and decision-oriented inference to ensure safe operation in complex and dynamic railway environments. However, existing approaches focus primarily on basic perception and often generalize poorly to rare yet safety-critical corner cases. They also lack the high-level reasoning and planning capabilities required for operational decision-making. Although recent Large Multi-modal Models (LMMs) show strong generalization and cognitive capabilities, their use in safety-critical ATO is hindered by high computational cost and hallucination risk. Meanwhile, reliable domain-specific benchmarks for systematically evaluating cognitive capabilities are still lacking. To address these gaps, we introduce RailVQA-bench, the first VQA benchmark for cab-view visual cognition in ATO, comprising 20,000 single-frame and 1,168 video based QA pairs to evaluate cognitive generalization and interpretability in both static and dynamic scenarios. Furthermore, we propose RailVQA-CoM, a collaborative large-small model framework that combines small-model efficiency with large-model cognition via a transparent three-module architecture and adaptive temporal sampling, improving perceptual generalization and enabling more efficient reasoning and planning. Experiments demonstrate that the proposed approach substantially improves performance, enhances interpretability, improves efficiency, and strengthens cross-domain generalization in autonomous driving systems. Code and datasets will be available at https://cybereye-bjtu.github.io/RailVQA.html.
title	RailVQA: A Benchmark and Framework for Efficient Interpretable Visual Cognition in Automatic Train Operation
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2603.27112

Similar Items