Table of Contents: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Zhao, Haoyu, Zhang, Zihao, Gu, Jiaxi, Chen, Haoran, Zheng, Qingping, Tang, Pin, Jin, Yeyin, Zhang, Yuang, Cheng, Junqi, Lu, Zenghui, Shu, Peng, Wu, Zuxuan, Jiang, Yu-Gang
Format:	Preprint
Published:	2026
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2604.09201
Tags:	Add Tag No Tags, Be the first to tag this record!

Table of Contents:

Camera-controllable video generation aims to synthesize videos with flexible and physically plausible camera movements. However, existing methods either provide imprecise camera control from text prompts or rely on labor-intensive manual camera trajectory parameters, limiting their use in automated scenarios. To address these issues, we propose a novel Vision-Language-Camera model, termed CT-1 (Camera Transformer 1), a specialized model designed to transfer spatial reasoning knowledge to video generation by accurately estimating camera trajectories. Built upon vision-language modules and a Diffusion Transformer model, CT-1 employs a Wavelet-based Regularization Loss in the frequency domain to effectively learn complex camera trajectory distributions. These trajectories are integrated into a video diffusion model to enable spatially aware camera control that aligns with user intentions. To facilitate the training of CT-1, we design a dedicated data curation pipeline and construct CT-200K, a large-scale dataset containing over 47M frames. Experimental results demonstrate that our framework successfully bridges the gap between spatial reasoning and video synthesis, yielding faithful and high-quality camera-controllable videos and improving camera control accuracy by 25.7% over prior methods.

Similar Items