Saved in:
| Main Authors: | , , , , |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2512.01481 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866908683889278976 |
|---|---|
| author | Wang, Qisen Zhao, Yifan Shen, Peisen Li, Jialu Li, Jia |
| author_facet | Wang, Qisen Zhao, Yifan Shen, Peisen Li, Jialu Li, Jia |
| contents | Although prevailing camera-controlled video generation models can produce cinematic results, lifting them directly to the generation of 3D-consistent and high-fidelity time-synchronized multi-view videos remains challenging, which is a pivotal capability for taming 4D worlds. Some works resort to data augmentation or test-time optimization, but these strategies are constrained by limited model generalization and scalability issues. To this end, we propose ChronosObserver, a training-free method including World State Hyperspace to represent the spatiotemporal constraints of a 4D world scene, and Hyperspace Guided Sampling to synchronize the diffusion sampling trajectories of multiple views using the hyperspace. Experimental results demonstrate that our method achieves high-fidelity and 3D-consistent time-synchronized multi-view videos generation without training or fine-tuning for diffusion models. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2512_01481 |
| institution | arXiv |
| publishDate | 2025 |
| record_format | arxiv |
| spellingShingle | ChronosObserver: Taming 4D World with Hyperspace Diffusion Sampling Wang, Qisen Zhao, Yifan Shen, Peisen Li, Jialu Li, Jia Computer Vision and Pattern Recognition Although prevailing camera-controlled video generation models can produce cinematic results, lifting them directly to the generation of 3D-consistent and high-fidelity time-synchronized multi-view videos remains challenging, which is a pivotal capability for taming 4D worlds. Some works resort to data augmentation or test-time optimization, but these strategies are constrained by limited model generalization and scalability issues. To this end, we propose ChronosObserver, a training-free method including World State Hyperspace to represent the spatiotemporal constraints of a 4D world scene, and Hyperspace Guided Sampling to synchronize the diffusion sampling trajectories of multiple views using the hyperspace. Experimental results demonstrate that our method achieves high-fidelity and 3D-consistent time-synchronized multi-view videos generation without training or fine-tuning for diffusion models. |
| title | ChronosObserver: Taming 4D World with Hyperspace Diffusion Sampling |
| topic | Computer Vision and Pattern Recognition |
| url | https://arxiv.org/abs/2512.01481 |