Saved in:
Bibliographic Details
Main Authors: He, Qingdong, Wang, Chaoyi, Tang, Peng, Yang, Yifan, Hu, Xiaobin
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2603.21901
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866917480458354688
author He, Qingdong
Wang, Chaoyi
Tang, Peng
Yang, Yifan
Hu, Xiaobin
author_facet He, Qingdong
Wang, Chaoyi
Tang, Peng
Yang, Yifan
Hu, Xiaobin
contents Video subtitle removal aims to distinguish text overlays from background content while preserving temporal coherence. Existing diffusion-based methods necessitate explicit mask sequences during both training and inference phases, which restricts their practical deployment. In this paper, we present CLEAR (Context-aware Learning for End-to-end Adaptive Video Subtitle Removal), a mask-free framework that achieves truly end-to-end inference through context-aware adaptive learning. Our two-stage design decouples prior extraction from generative refinement: Stage I learns disentangled subtitle representations via self-supervised orthogonality constraints on dual encoders, while Stage II employs LoRA-based adaptation with generation feedback for dynamic context adjustment. Notably, our method only requires 0.77% of the parameters of the base diffusion model for training. On Chinese subtitle benchmarks, CLEAR outperforms mask-dependent baselines by + 6.77dB PSNR and -74.7% VFID, while demonstrating superior zero-shot generalization across six languages (English, Korean, French, Japanese, Russian, German), a performance enabled by our generation-driven feedback mechanism that ensures robust subtitle removal without ground-truth masks during inference.
format Preprint
id arxiv_https___arxiv_org_abs_2603_21901
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle CLEAR: Context-Aware Learning with End-to-End Mask-Free Inference for Adaptive Video Subtitle Removal
He, Qingdong
Wang, Chaoyi
Tang, Peng
Yang, Yifan
Hu, Xiaobin
Computer Vision and Pattern Recognition
Video subtitle removal aims to distinguish text overlays from background content while preserving temporal coherence. Existing diffusion-based methods necessitate explicit mask sequences during both training and inference phases, which restricts their practical deployment. In this paper, we present CLEAR (Context-aware Learning for End-to-end Adaptive Video Subtitle Removal), a mask-free framework that achieves truly end-to-end inference through context-aware adaptive learning. Our two-stage design decouples prior extraction from generative refinement: Stage I learns disentangled subtitle representations via self-supervised orthogonality constraints on dual encoders, while Stage II employs LoRA-based adaptation with generation feedback for dynamic context adjustment. Notably, our method only requires 0.77% of the parameters of the base diffusion model for training. On Chinese subtitle benchmarks, CLEAR outperforms mask-dependent baselines by + 6.77dB PSNR and -74.7% VFID, while demonstrating superior zero-shot generalization across six languages (English, Korean, French, Japanese, Russian, German), a performance enabled by our generation-driven feedback mechanism that ensures robust subtitle removal without ground-truth masks during inference.
title CLEAR: Context-Aware Learning with End-to-End Mask-Free Inference for Adaptive Video Subtitle Removal
topic Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2603.21901