Saved in:
Bibliographic Details
Main Authors: Jiang, Siyang, Yuan, Mu, Ji, Xiang, Yang, Bufang, Liu, Zeyu, Xu, Lilin, Li, Yang, He, Yuting, Dong, Liran, Lu, Wenrui, Yan, Zhenyu, Jiang, Xiaofan, Gao, Wei, Chen, Hongkai, Xing, Guoliang
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2512.07136
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866911306732273664
author Jiang, Siyang
Yuan, Mu
Ji, Xiang
Yang, Bufang
Liu, Zeyu
Xu, Lilin
Li, Yang
He, Yuting
Dong, Liran
Lu, Wenrui
Yan, Zhenyu
Jiang, Xiaofan
Gao, Wei
Chen, Hongkai
Xing, Guoliang
author_facet Jiang, Siyang
Yuan, Mu
Ji, Xiang
Yang, Bufang
Liu, Zeyu
Xu, Lilin
Li, Yang
He, Yuting
Dong, Liran
Lu, Wenrui
Yan, Zhenyu
Jiang, Xiaofan
Gao, Wei
Chen, Hongkai
Xing, Guoliang
contents Multimodal human action recognition (HAR) leverages complementary sensors for activity classification. Beyond recognition, recent advances in large language models (LLMs) enable detailed descriptions and causal reasoning, motivating new tasks: human action understanding (HAU) and human action reasoning (HARn). However, most LLMs, especially large vision language models (LVLMs), struggle with non-RGB modalities such as depth, IMU, and mmWave due to the lack of large-scale data-caption resources. Existing HAR datasets mainly provide coarse data-label annotations, which are insufficient to capture fine-grained action dynamics needed for HAU and HARn. We consider two ground-truth pair types: (1) data label (discrete category) and (2) data caption (textual description). Naively generating captions from labels often lacks logical and spatiotemporal consistency. We introduce CUHK-X, a large-scale multimodal dataset and benchmark suite for HAR, HAU, and HARn. CUHK-X contains 58,445 samples covering 40 actions performed by 30 participants across two indoor environments. To improve caption consistency, we propose a prompt-based scene creation method that leverages LLMs to generate logically connected activity sequences, followed by human validation. CUHK-X includes three benchmarks with six evaluation tasks. Experiments report average accuracies of 76.52% (HAR), 40.76% (HAU), and 70.25% (HARn). CUHK-X aims to enable the community to apply and develop data-intensive learning methods for robust, multimodal human activity analysis. Project page and code: https://openaiotlab.github.io/CUHK-X/ and https://github.com/openaiotlab/CUHK-X.
format Preprint
id arxiv_https___arxiv_org_abs_2512_07136
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle A Large-Scale Multimodal Dataset and Benchmarks for Human Activity Scene Understanding and Reasoning
Jiang, Siyang
Yuan, Mu
Ji, Xiang
Yang, Bufang
Liu, Zeyu
Xu, Lilin
Li, Yang
He, Yuting
Dong, Liran
Lu, Wenrui
Yan, Zhenyu
Jiang, Xiaofan
Gao, Wei
Chen, Hongkai
Xing, Guoliang
Computer Vision and Pattern Recognition
Artificial Intelligence
Multimodal human action recognition (HAR) leverages complementary sensors for activity classification. Beyond recognition, recent advances in large language models (LLMs) enable detailed descriptions and causal reasoning, motivating new tasks: human action understanding (HAU) and human action reasoning (HARn). However, most LLMs, especially large vision language models (LVLMs), struggle with non-RGB modalities such as depth, IMU, and mmWave due to the lack of large-scale data-caption resources. Existing HAR datasets mainly provide coarse data-label annotations, which are insufficient to capture fine-grained action dynamics needed for HAU and HARn. We consider two ground-truth pair types: (1) data label (discrete category) and (2) data caption (textual description). Naively generating captions from labels often lacks logical and spatiotemporal consistency. We introduce CUHK-X, a large-scale multimodal dataset and benchmark suite for HAR, HAU, and HARn. CUHK-X contains 58,445 samples covering 40 actions performed by 30 participants across two indoor environments. To improve caption consistency, we propose a prompt-based scene creation method that leverages LLMs to generate logically connected activity sequences, followed by human validation. CUHK-X includes three benchmarks with six evaluation tasks. Experiments report average accuracies of 76.52% (HAR), 40.76% (HAU), and 70.25% (HARn). CUHK-X aims to enable the community to apply and develop data-intensive learning methods for robust, multimodal human activity analysis. Project page and code: https://openaiotlab.github.io/CUHK-X/ and https://github.com/openaiotlab/CUHK-X.
title A Large-Scale Multimodal Dataset and Benchmarks for Human Activity Scene Understanding and Reasoning
topic Computer Vision and Pattern Recognition
Artificial Intelligence
url https://arxiv.org/abs/2512.07136