Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Liu, Zhizheng, Lin, Joe, Wu, Wayne, Zhou, Bolei
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2501.02158
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866911470874263552
author	Liu, Zhizheng Lin, Joe Wu, Wayne Zhou, Bolei
author_facet	Liu, Zhizheng Lin, Joe Wu, Wayne Zhou, Bolei
contents	Reconstructing human motion and its surrounding environment is crucial for understanding human-scene interaction and predicting human movements in the scene. While much progress has been made in capturing human-scene interaction in constrained environments, those prior methods can hardly reconstruct the natural and diverse human motion and scene context from web videos. In this work, we propose JOSH, a novel optimization-based method for 4D human-scene reconstruction in the wild from monocular videos. JOSH uses techniques in both dense scene reconstruction and human mesh recovery as initialization, and then it leverages the human-scene contact constraints to jointly optimize the scene, the camera poses, and the human motion. Experiment results show JOSH achieves better results on both global human motion estimation and dense scene reconstruction by joint optimization of scene geometry and human motion. We further design a more efficient model, JOSH3R, and directly train it with pseudo-labels from web videos. JOSH3R outperforms other optimization-free methods by only training with labels predicted from JOSH, further demonstrating its accuracy and generalization ability.
format	Preprint
id	arxiv_https___arxiv_org_abs_2501_02158
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Joint Optimization for 4D Human-Scene Reconstruction in the Wild Liu, Zhizheng Lin, Joe Wu, Wayne Zhou, Bolei Computer Vision and Pattern Recognition Reconstructing human motion and its surrounding environment is crucial for understanding human-scene interaction and predicting human movements in the scene. While much progress has been made in capturing human-scene interaction in constrained environments, those prior methods can hardly reconstruct the natural and diverse human motion and scene context from web videos. In this work, we propose JOSH, a novel optimization-based method for 4D human-scene reconstruction in the wild from monocular videos. JOSH uses techniques in both dense scene reconstruction and human mesh recovery as initialization, and then it leverages the human-scene contact constraints to jointly optimize the scene, the camera poses, and the human motion. Experiment results show JOSH achieves better results on both global human motion estimation and dense scene reconstruction by joint optimization of scene geometry and human motion. We further design a more efficient model, JOSH3R, and directly train it with pseudo-labels from web videos. JOSH3R outperforms other optimization-free methods by only training with labels predicted from JOSH, further demonstrating its accuracy and generalization ability.
title	Joint Optimization for 4D Human-Scene Reconstruction in the Wild
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2501.02158

Similar Items