Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Shaowu, Xu, Xibin, Jia, Junyu, Gao, Qianmei, Sun, Jing, Chang, Chao, Fan
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2507.06603
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866909715079888896
author	Shaowu, Xu Xibin, Jia Junyu, Gao Qianmei, Sun Jing, Chang Chao, Fan
author_facet	Shaowu, Xu Xibin, Jia Junyu, Gao Qianmei, Sun Jing, Chang Chao, Fan
contents	Long-term action recognition (LTAR) is challenging due to extended temporal spans with complex atomic action correlations and visual confounders. Although vision-language models (VLMs) have shown promise, they often rely on statistical correlations instead of causal mechanisms. Moreover, existing causality-based methods address modal-specific biases but lack cross-modal causal modeling, limiting their utility in VLM-based LTAR. This paper proposes \textbf{C}ross-\textbf{M}odal \textbf{D}ual-\textbf{C}ausal \textbf{L}earning (CMDCL), which introduces a structural causal model to uncover causal relationships between videos and label texts. CMDCL addresses cross-modal biases in text embeddings via textual causal intervention and removes confounders inherent in the visual modality through visual causal intervention guided by the debiased text. These dual-causal interventions enable robust action representations to address LTAR challenges. Experimental results on three benchmarks including Charades, Breakfast and COIN, demonstrate the effectiveness of the proposed model. Our code is available at https://github.com/xushaowu/CMDCL.
format	Preprint
id	arxiv_https___arxiv_org_abs_2507_06603
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Cross-Modal Dual-Causal Learning for Long-Term Action Recognition Shaowu, Xu Xibin, Jia Junyu, Gao Qianmei, Sun Jing, Chang Chao, Fan Computer Vision and Pattern Recognition Long-term action recognition (LTAR) is challenging due to extended temporal spans with complex atomic action correlations and visual confounders. Although vision-language models (VLMs) have shown promise, they often rely on statistical correlations instead of causal mechanisms. Moreover, existing causality-based methods address modal-specific biases but lack cross-modal causal modeling, limiting their utility in VLM-based LTAR. This paper proposes \textbf{C}ross-\textbf{M}odal \textbf{D}ual-\textbf{C}ausal \textbf{L}earning (CMDCL), which introduces a structural causal model to uncover causal relationships between videos and label texts. CMDCL addresses cross-modal biases in text embeddings via textual causal intervention and removes confounders inherent in the visual modality through visual causal intervention guided by the debiased text. These dual-causal interventions enable robust action representations to address LTAR challenges. Experimental results on three benchmarks including Charades, Breakfast and COIN, demonstrate the effectiveness of the proposed model. Our code is available at https://github.com/xushaowu/CMDCL.
title	Cross-Modal Dual-Causal Learning for Long-Term Action Recognition
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2507.06603

Similar Items