Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Hamid, Kaiser, Cui, Can, Akbar, Khandakar Ashrafi, Wang, Ziran, Liang, Nade
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2511.12708
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866910051251257344
author	Hamid, Kaiser Cui, Can Akbar, Khandakar Ashrafi Wang, Ziran Liang, Nade
author_facet	Hamid, Kaiser Cui, Can Akbar, Khandakar Ashrafi Wang, Ziran Liang, Nade
contents	Understanding not only where drivers look but also why their attention shifts is essential for interpretable human-AI collaboration in autonomous driving. Driver attention is not purely perceptual but semantically structured. Thus, attention shifts can be learned through minimal semantic supervision rather than dense large-scale annotation. We present \textbf{FSDAM} (\textbf{F}ew-\textbf{S}hot \textbf{D}river \textbf{A}ttention \textbf{M}odeling), a framework that achieves joint spatial attention prediction and structured explanation generation using 90 annotated examples. Our key insight is to decompose attention into an explicit reasoning representation, including scene context, current focus, anticipated next focus, and causal explanation, and to learn next-focus anticipation through minimal-pair supervision. To address task conflict and large sample requirements of existing models, and to mitigate task interference under limited data, we introduce a novel dual-pathway architecture in which separate modules handle spatial prediction and caption generation. In addition, we use a training-only vision-language alignment mechanism that injects semantic priors into spatial learning without increasing inference complexity, mitigating task interference under few-shot training. Despite extreme data scarcity, FSDAM achieves competitive performance in gaze prediction, and generates coherent, context-aware structural reasoning for improved interpretability. The model further demonstrates strong zero-shot generalization across multiple driving benchmarks.
format	Preprint
id	arxiv_https___arxiv_org_abs_2511_12708
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	FSDAM: Few-Shot Driving Attention Modeling via Vision-Language Coupling Hamid, Kaiser Cui, Can Akbar, Khandakar Ashrafi Wang, Ziran Liang, Nade Computer Vision and Pattern Recognition Understanding not only where drivers look but also why their attention shifts is essential for interpretable human-AI collaboration in autonomous driving. Driver attention is not purely perceptual but semantically structured. Thus, attention shifts can be learned through minimal semantic supervision rather than dense large-scale annotation. We present \textbf{FSDAM} (\textbf{F}ew-\textbf{S}hot \textbf{D}river \textbf{A}ttention \textbf{M}odeling), a framework that achieves joint spatial attention prediction and structured explanation generation using 90 annotated examples. Our key insight is to decompose attention into an explicit reasoning representation, including scene context, current focus, anticipated next focus, and causal explanation, and to learn next-focus anticipation through minimal-pair supervision. To address task conflict and large sample requirements of existing models, and to mitigate task interference under limited data, we introduce a novel dual-pathway architecture in which separate modules handle spatial prediction and caption generation. In addition, we use a training-only vision-language alignment mechanism that injects semantic priors into spatial learning without increasing inference complexity, mitigating task interference under few-shot training. Despite extreme data scarcity, FSDAM achieves competitive performance in gaze prediction, and generates coherent, context-aware structural reasoning for improved interpretability. The model further demonstrates strong zero-shot generalization across multiple driving benchmarks.
title	FSDAM: Few-Shot Driving Attention Modeling via Vision-Language Coupling
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2511.12708

Similar Items