Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Zhou, Milton, Qin, Sizhong, Li, Yongzhi, Chen, Quan, Jiang, Peng
Format:	Preprint
Published:	2026
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2603.28366
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866912988616720384
author	Zhou, Milton Qin, Sizhong Li, Yongzhi Chen, Quan Jiang, Peng
author_facet	Zhou, Milton Qin, Sizhong Li, Yongzhi Chen, Quan Jiang, Peng
contents	Short-form videos have become a primary medium for digital advertising, requiring scalable and efficient content creation. However, current workflows and AI tools remain disjoint and modality-specific, leading to high production costs and low overall efficiency. To address this issue, we propose AutoCut, an end-to-end advertisement video editing framework based on multimodal discretization and controllable editing. AutoCut employs dedicated encoders to extract video and audio features, then applies residual vector quantization to discretize them into unified tokens aligned with textual representations, constructing a shared video-audio-text token space. Built upon a foundation model, we further develop a multimodal large language model for video editing through combined multimodal alignment and supervised fine-tuning, supporting tasks covering video selection and ordering, script generation, and background music selection within a unified editing framework. Finally, a complete production pipeline converts the predicted token sequences into deployable long video outputs. Experiments on real-world advertisement datasets show that AutoCut reduces production cost and iteration time while substantially improving consistency and controllability, paving the way for scalable video creation.
format	Preprint
id	arxiv_https___arxiv_org_abs_2603_28366
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	AutoCut: End-to-end advertisement video editing based on multimodal discretization and controllable generation Zhou, Milton Qin, Sizhong Li, Yongzhi Chen, Quan Jiang, Peng Computer Vision and Pattern Recognition Short-form videos have become a primary medium for digital advertising, requiring scalable and efficient content creation. However, current workflows and AI tools remain disjoint and modality-specific, leading to high production costs and low overall efficiency. To address this issue, we propose AutoCut, an end-to-end advertisement video editing framework based on multimodal discretization and controllable editing. AutoCut employs dedicated encoders to extract video and audio features, then applies residual vector quantization to discretize them into unified tokens aligned with textual representations, constructing a shared video-audio-text token space. Built upon a foundation model, we further develop a multimodal large language model for video editing through combined multimodal alignment and supervised fine-tuning, supporting tasks covering video selection and ordering, script generation, and background music selection within a unified editing framework. Finally, a complete production pipeline converts the predicted token sequences into deployable long video outputs. Experiments on real-world advertisement datasets show that AutoCut reduces production cost and iteration time while substantially improving consistency and controllability, paving the way for scalable video creation.
title	AutoCut: End-to-end advertisement video editing based on multimodal discretization and controllable generation
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2603.28366

Similar Items