Saved in:
Bibliographic Details
Main Authors: Chang, Qing, Dai, Wei, Shuai, Zhihao, Yu, Limin, Yue, Yutao
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2503.04078
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866916645142790144
author Chang, Qing
Dai, Wei
Shuai, Zhihao
Yu, Limin
Yue, Yutao
author_facet Chang, Qing
Dai, Wei
Shuai, Zhihao
Yu, Limin
Yue, Yutao
contents Naturalistic driving action recognition is essential for vehicle cabin monitoring systems. However, the complexity of real-world backgrounds presents significant challenges for this task, and previous approaches have struggled with practical implementation due to their limited ability to observe subtle behavioral differences and effectively learn inter-frame features from video. In this paper, we propose a novel Spatial-Temporal Perception (STP) architecture that emphasizes both temporal information and spatial relationships between key objects, incorporating a causal decoder to perform behavior recognition and temporal action localization. Without requiring multimodal input, STP directly extracts temporal and spatial distance features from RGB video clips. Subsequently, these dual features are jointly encoded by maximizing the expected likelihood across all possible permutations of the factorization order. By integrating temporal and spatial features at different scales, STP can perceive subtle behavioral changes in challenging scenarios. Additionally, we introduce a causal-aware module to explore relationships between video frame features, significantly enhancing detection efficiency and performance. We validate the effectiveness of our approach using two publicly available driver distraction detection benchmarks. The results demonstrate that our framework achieves state-of-the-art performance.
format Preprint
id arxiv_https___arxiv_org_abs_2503_04078
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle Spatial-Temporal Perception with Causal Inference for Naturalistic Driving Action Recognition
Chang, Qing
Dai, Wei
Shuai, Zhihao
Yu, Limin
Yue, Yutao
Computer Vision and Pattern Recognition
Naturalistic driving action recognition is essential for vehicle cabin monitoring systems. However, the complexity of real-world backgrounds presents significant challenges for this task, and previous approaches have struggled with practical implementation due to their limited ability to observe subtle behavioral differences and effectively learn inter-frame features from video. In this paper, we propose a novel Spatial-Temporal Perception (STP) architecture that emphasizes both temporal information and spatial relationships between key objects, incorporating a causal decoder to perform behavior recognition and temporal action localization. Without requiring multimodal input, STP directly extracts temporal and spatial distance features from RGB video clips. Subsequently, these dual features are jointly encoded by maximizing the expected likelihood across all possible permutations of the factorization order. By integrating temporal and spatial features at different scales, STP can perceive subtle behavioral changes in challenging scenarios. Additionally, we introduce a causal-aware module to explore relationships between video frame features, significantly enhancing detection efficiency and performance. We validate the effectiveness of our approach using two publicly available driver distraction detection benchmarks. The results demonstrate that our framework achieves state-of-the-art performance.
title Spatial-Temporal Perception with Causal Inference for Naturalistic Driving Action Recognition
topic Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2503.04078