Saved in:
| Main Authors: | , , , , , , , |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2602.22960 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866911471530672128 |
|---|---|
| author | Xu, Tianxing Wang, Zixuan Wang, Guangyuan Hu, Li Zhang, Zhongyi Zhang, Peng Zhang, Bang Zhang, Song-Hai |
| author_facet | Xu, Tianxing Wang, Zixuan Wang, Guangyuan Hu, Li Zhang, Zhongyi Zhang, Peng Zhang, Bang Zhang, Song-Hai |
| contents | World models based on video generation demonstrate remarkable potential for simulating interactive environments but face persistent difficulties in two key areas: maintaining long-term content consistency when scenes are revisited and enabling precise camera control from user-provided inputs. Existing methods based on explicit 3D reconstruction often compromise flexibility in unbounded scenarios and fine-grained structures. Alternative methods rely directly on previously generated frames without establishing explicit spatial correspondence, thereby constraining controllability and consistency. To address these limitations, we present UCM, a novel framework that unifies long-term memory and precise camera control via a time-aware positional encoding warping mechanism. To reduce computational overhead, we design an efficient dual-stream diffusion transformer for high-fidelity generation. Moreover, we introduce a scalable data curation strategy utilizing point-cloud-based rendering to simulate scene revisiting, facilitating training on over 500K monocular videos. Extensive experiments on real-world and synthetic benchmarks demonstrate that UCM significantly outperforms state-of-the-art methods in long-term scene consistency, while also achieving precise camera controllability in high-fidelity video generation. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2602_22960 |
| institution | arXiv |
| publishDate | 2026 |
| record_format | arxiv |
| spellingShingle | UCM: Unifying Camera Control and Memory with Time-aware Positional Encoding Warping for World Models Xu, Tianxing Wang, Zixuan Wang, Guangyuan Hu, Li Zhang, Zhongyi Zhang, Peng Zhang, Bang Zhang, Song-Hai Computer Vision and Pattern Recognition World models based on video generation demonstrate remarkable potential for simulating interactive environments but face persistent difficulties in two key areas: maintaining long-term content consistency when scenes are revisited and enabling precise camera control from user-provided inputs. Existing methods based on explicit 3D reconstruction often compromise flexibility in unbounded scenarios and fine-grained structures. Alternative methods rely directly on previously generated frames without establishing explicit spatial correspondence, thereby constraining controllability and consistency. To address these limitations, we present UCM, a novel framework that unifies long-term memory and precise camera control via a time-aware positional encoding warping mechanism. To reduce computational overhead, we design an efficient dual-stream diffusion transformer for high-fidelity generation. Moreover, we introduce a scalable data curation strategy utilizing point-cloud-based rendering to simulate scene revisiting, facilitating training on over 500K monocular videos. Extensive experiments on real-world and synthetic benchmarks demonstrate that UCM significantly outperforms state-of-the-art methods in long-term scene consistency, while also achieving precise camera controllability in high-fidelity video generation. |
| title | UCM: Unifying Camera Control and Memory with Time-aware Positional Encoding Warping for World Models |
| topic | Computer Vision and Pattern Recognition |
| url | https://arxiv.org/abs/2602.22960 |