Saved in:
Bibliographic Details
Main Authors: Hamid, Kaiser, Cui, Can, Akbar, Khandakar Ashrafi, Wang, Ziran, Liang, Nade
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2511.12708
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866910051251257344
author Hamid, Kaiser
Cui, Can
Akbar, Khandakar Ashrafi
Wang, Ziran
Liang, Nade
author_facet Hamid, Kaiser
Cui, Can
Akbar, Khandakar Ashrafi
Wang, Ziran
Liang, Nade
contents Understanding not only where drivers look but also why their attention shifts is essential for interpretable human-AI collaboration in autonomous driving. Driver attention is not purely perceptual but semantically structured. Thus, attention shifts can be learned through minimal semantic supervision rather than dense large-scale annotation. We present \textbf{FSDAM} (\textbf{F}ew-\textbf{S}hot \textbf{D}river \textbf{A}ttention \textbf{M}odeling), a framework that achieves joint spatial attention prediction and structured explanation generation using 90 annotated examples. Our key insight is to decompose attention into an explicit reasoning representation, including scene context, current focus, anticipated next focus, and causal explanation, and to learn next-focus anticipation through minimal-pair supervision. To address task conflict and large sample requirements of existing models, and to mitigate task interference under limited data, we introduce a novel dual-pathway architecture in which separate modules handle spatial prediction and caption generation. In addition, we use a training-only vision-language alignment mechanism that injects semantic priors into spatial learning without increasing inference complexity, mitigating task interference under few-shot training. Despite extreme data scarcity, FSDAM achieves competitive performance in gaze prediction, and generates coherent, context-aware structural reasoning for improved interpretability. The model further demonstrates strong zero-shot generalization across multiple driving benchmarks.
format Preprint
id arxiv_https___arxiv_org_abs_2511_12708
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle FSDAM: Few-Shot Driving Attention Modeling via Vision-Language Coupling
Hamid, Kaiser
Cui, Can
Akbar, Khandakar Ashrafi
Wang, Ziran
Liang, Nade
Computer Vision and Pattern Recognition
Understanding not only where drivers look but also why their attention shifts is essential for interpretable human-AI collaboration in autonomous driving. Driver attention is not purely perceptual but semantically structured. Thus, attention shifts can be learned through minimal semantic supervision rather than dense large-scale annotation. We present \textbf{FSDAM} (\textbf{F}ew-\textbf{S}hot \textbf{D}river \textbf{A}ttention \textbf{M}odeling), a framework that achieves joint spatial attention prediction and structured explanation generation using 90 annotated examples. Our key insight is to decompose attention into an explicit reasoning representation, including scene context, current focus, anticipated next focus, and causal explanation, and to learn next-focus anticipation through minimal-pair supervision. To address task conflict and large sample requirements of existing models, and to mitigate task interference under limited data, we introduce a novel dual-pathway architecture in which separate modules handle spatial prediction and caption generation. In addition, we use a training-only vision-language alignment mechanism that injects semantic priors into spatial learning without increasing inference complexity, mitigating task interference under few-shot training. Despite extreme data scarcity, FSDAM achieves competitive performance in gaze prediction, and generates coherent, context-aware structural reasoning for improved interpretability. The model further demonstrates strong zero-shot generalization across multiple driving benchmarks.
title FSDAM: Few-Shot Driving Attention Modeling via Vision-Language Coupling
topic Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2511.12708