Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Miao, Xingyu, Dong, Junting, Zhao, Qin, Yang, Yuhang, Chen, Junhao, Long, Yang
Format:	Preprint
Published:	2026
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2602.01661
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866914303191285760
author	Miao, Xingyu Dong, Junting Zhao, Qin Yang, Yuhang Chen, Junhao Long, Yang
author_facet	Miao, Xingyu Dong, Junting Zhao, Qin Yang, Yuhang Chen, Junhao Long, Yang
contents	In this work, we focus on the challenge of temporally consistent human-centric dense prediction across video sequences. Existing models achieve strong per-frame accuracy but often flicker under motion, occlusion, and lighting changes, and they rarely have paired human video supervision for multiple dense tasks. We address this gap with a scalable synthetic data pipeline that generates photorealistic human frames and motion-aligned sequences with pixel-accurate depth, normals, and masks. Unlike prior static data synthetic pipelines, our pipeline provides both frame-level labels for spatial learning and sequence-level supervision for temporal learning. Building on this, we train a unified ViT-based dense predictor that (i) injects an explicit human geometric prior via CSE embeddings and (ii) improves geometry-feature reliability with a lightweight channel reweighting module after feature fusion. Our two-stage training strategy, combining static pretraining with dynamic sequence supervision, enables the model first to acquire robust spatial representations and then refine temporal consistency across motion-aligned sequences. Extensive experiments show that we achieve state-of-the-art performance on THuman2.1 and Hi4D and generalize effectively to in-the-wild videos.
format	Preprint
id	arxiv_https___arxiv_org_abs_2602_01661
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	From Frames to Sequences: Temporally Consistent Human-Centric Dense Prediction Miao, Xingyu Dong, Junting Zhao, Qin Yang, Yuhang Chen, Junhao Long, Yang Computer Vision and Pattern Recognition In this work, we focus on the challenge of temporally consistent human-centric dense prediction across video sequences. Existing models achieve strong per-frame accuracy but often flicker under motion, occlusion, and lighting changes, and they rarely have paired human video supervision for multiple dense tasks. We address this gap with a scalable synthetic data pipeline that generates photorealistic human frames and motion-aligned sequences with pixel-accurate depth, normals, and masks. Unlike prior static data synthetic pipelines, our pipeline provides both frame-level labels for spatial learning and sequence-level supervision for temporal learning. Building on this, we train a unified ViT-based dense predictor that (i) injects an explicit human geometric prior via CSE embeddings and (ii) improves geometry-feature reliability with a lightweight channel reweighting module after feature fusion. Our two-stage training strategy, combining static pretraining with dynamic sequence supervision, enables the model first to acquire robust spatial representations and then refine temporal consistency across motion-aligned sequences. Extensive experiments show that we achieve state-of-the-art performance on THuman2.1 and Hi4D and generalize effectively to in-the-wild videos.
title	From Frames to Sequences: Temporally Consistent Human-Centric Dense Prediction
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2602.01661

Similar Items