Saved in:
Bibliographic Details
Main Authors: Zhou, Milton, Qin, Sizhong, Li, Yongzhi, Chen, Quan, Jiang, Peng
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2603.28366
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866912988616720384
author Zhou, Milton
Qin, Sizhong
Li, Yongzhi
Chen, Quan
Jiang, Peng
author_facet Zhou, Milton
Qin, Sizhong
Li, Yongzhi
Chen, Quan
Jiang, Peng
contents Short-form videos have become a primary medium for digital advertising, requiring scalable and efficient content creation. However, current workflows and AI tools remain disjoint and modality-specific, leading to high production costs and low overall efficiency. To address this issue, we propose AutoCut, an end-to-end advertisement video editing framework based on multimodal discretization and controllable editing. AutoCut employs dedicated encoders to extract video and audio features, then applies residual vector quantization to discretize them into unified tokens aligned with textual representations, constructing a shared video-audio-text token space. Built upon a foundation model, we further develop a multimodal large language model for video editing through combined multimodal alignment and supervised fine-tuning, supporting tasks covering video selection and ordering, script generation, and background music selection within a unified editing framework. Finally, a complete production pipeline converts the predicted token sequences into deployable long video outputs. Experiments on real-world advertisement datasets show that AutoCut reduces production cost and iteration time while substantially improving consistency and controllability, paving the way for scalable video creation.
format Preprint
id arxiv_https___arxiv_org_abs_2603_28366
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle AutoCut: End-to-end advertisement video editing based on multimodal discretization and controllable generation
Zhou, Milton
Qin, Sizhong
Li, Yongzhi
Chen, Quan
Jiang, Peng
Computer Vision and Pattern Recognition
Short-form videos have become a primary medium for digital advertising, requiring scalable and efficient content creation. However, current workflows and AI tools remain disjoint and modality-specific, leading to high production costs and low overall efficiency. To address this issue, we propose AutoCut, an end-to-end advertisement video editing framework based on multimodal discretization and controllable editing. AutoCut employs dedicated encoders to extract video and audio features, then applies residual vector quantization to discretize them into unified tokens aligned with textual representations, constructing a shared video-audio-text token space. Built upon a foundation model, we further develop a multimodal large language model for video editing through combined multimodal alignment and supervised fine-tuning, supporting tasks covering video selection and ordering, script generation, and background music selection within a unified editing framework. Finally, a complete production pipeline converts the predicted token sequences into deployable long video outputs. Experiments on real-world advertisement datasets show that AutoCut reduces production cost and iteration time while substantially improving consistency and controllability, paving the way for scalable video creation.
title AutoCut: End-to-end advertisement video editing based on multimodal discretization and controllable generation
topic Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2603.28366