Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Jian, Haohang, Zhang, Jinlu, Wu, Junyi, Tu, Zhigang
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2504.08718
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866918220284297216
author	Jian, Haohang Zhang, Jinlu Wu, Junyi Tu, Zhigang
author_facet	Jian, Haohang Zhang, Jinlu Wu, Junyi Tu, Zhigang
contents	Expressive Human Pose and Shape Estimation (EHPS) aims to jointly estimate human pose, hand gesture, and facial expression from monocular images. Existing methods predominantly rely on Transformer-based architectures, which suffer from quadratic complexity in self-attention, leading to substantial computational overhead, especially in multi-person scenarios. Recently, Mamba has emerged as a promising alternative to Transformers due to its efficient global modeling capability. However, it remains limited in capturing fine-grained local dependencies, which are essential for precise EHPS. To address these issues, we propose EMO-X, the Efficient Multi-person One-stage model for multi-person EHPS. Specifically, we explore a Scan-based Global-Local Decoder (SGLD) that integrates global context with skeleton-aware local features to iteratively enhance human tokens. Our EMO-X leverages the superior global modeling capability of Mamba and designs a local bidirectional scan mechanism for skeleton-aware local refinement. Comprehensive experiments demonstrate that EMO-X strikes an excellent balance between efficiency and accuracy. Notably, it achieves a significant reduction in computational complexity, requiring 69.8% less inference time compared to state-of-the-art (SOTA) methods, while outperforming most of them in accuracy.
format	Preprint
id	arxiv_https___arxiv_org_abs_2504_08718
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	EMO-X: Efficient Multi-Person Pose and Shape Estimation in One-Stage Jian, Haohang Zhang, Jinlu Wu, Junyi Tu, Zhigang Computer Vision and Pattern Recognition Expressive Human Pose and Shape Estimation (EHPS) aims to jointly estimate human pose, hand gesture, and facial expression from monocular images. Existing methods predominantly rely on Transformer-based architectures, which suffer from quadratic complexity in self-attention, leading to substantial computational overhead, especially in multi-person scenarios. Recently, Mamba has emerged as a promising alternative to Transformers due to its efficient global modeling capability. However, it remains limited in capturing fine-grained local dependencies, which are essential for precise EHPS. To address these issues, we propose EMO-X, the Efficient Multi-person One-stage model for multi-person EHPS. Specifically, we explore a Scan-based Global-Local Decoder (SGLD) that integrates global context with skeleton-aware local features to iteratively enhance human tokens. Our EMO-X leverages the superior global modeling capability of Mamba and designs a local bidirectional scan mechanism for skeleton-aware local refinement. Comprehensive experiments demonstrate that EMO-X strikes an excellent balance between efficiency and accuracy. Notably, it achieves a significant reduction in computational complexity, requiring 69.8% less inference time compared to state-of-the-art (SOTA) methods, while outperforming most of them in accuracy.
title	EMO-X: Efficient Multi-Person Pose and Shape Estimation in One-Stage
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2504.08718

Similar Items