Saved in:
| Main Authors: | Li, Ke, Li, Maoliang, Chen, Jialiang, Chen, Jiayu, Zheng, Zihao, Wang, Shaoqi, Chen, Xiang |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2604.04875 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Teacher-Guided Pseudo Supervision and Cross-Modal Alignment for Audio-Visual Video Parsing
by: Chen, Yaru, et al.
Published: (2025)
by: Chen, Yaru, et al.
Published: (2025)
HuMo: Human-Centric Video Generation via Collaborative Multi-Modal Conditioning
by: Chen, Liyang, et al.
Published: (2025)
by: Chen, Liyang, et al.
Published: (2025)
HUD: Hierarchical Uncertainty-Aware Disambiguation Network for Composed Video Retrieval
by: Chen, Zhiwei, et al.
Published: (2025)
by: Chen, Zhiwei, et al.
Published: (2025)
Spatiotemporal Graph Guided Multi-modal Network for Livestreaming Product Retrieval
by: Hu, Xiaowan, et al.
Published: (2024)
by: Hu, Xiaowan, et al.
Published: (2024)
Learning Brain Representation with Hierarchical Visual Embeddings
by: Zheng, Jiawen, et al.
Published: (2026)
by: Zheng, Jiawen, et al.
Published: (2026)
Unbiased Video Scene Graph Generation via Visual and Semantic Dual Debiasing
by: Li, Yanjun, et al.
Published: (2025)
by: Li, Yanjun, et al.
Published: (2025)
TEn-CATG:Text-Enriched Audio-Visual Video Parsing with Multi-Scale Category-Aware Temporal Graph
by: Chen, Yaru, et al.
Published: (2025)
by: Chen, Yaru, et al.
Published: (2025)
Language-Guided Diffusion Model for Visual Grounding
by: Chen, Sijia, et al.
Published: (2023)
by: Chen, Sijia, et al.
Published: (2023)
Taming Flow-based I2V Models for Creative Video Editing
by: Kong, Xianghao, et al.
Published: (2025)
by: Kong, Xianghao, et al.
Published: (2025)
ChatVTG: Video Temporal Grounding via Chat with Video Dialogue Large Language Models
by: Qu, Mengxue, et al.
Published: (2024)
by: Qu, Mengxue, et al.
Published: (2024)
Visual Autoregressive Modeling for Instruction-Guided Image Editing
by: Mao, Qingyang, et al.
Published: (2025)
by: Mao, Qingyang, et al.
Published: (2025)
MM-MovieDubber: Towards Multi-Modal Learning for Multi-Modal Movie Dubbing
by: Zheng, Junjie, et al.
Published: (2025)
by: Zheng, Junjie, et al.
Published: (2025)
Region-Constraint In-Context Generation for Instructional Video Editing
by: Zhang, Zhongwei, et al.
Published: (2025)
by: Zhang, Zhongwei, et al.
Published: (2025)
VCoME: Verbal Video Composition with Multimodal Editing Effects
by: Gong, Weibo, et al.
Published: (2024)
by: Gong, Weibo, et al.
Published: (2024)
MMBench-Video: A Long-Form Multi-Shot Benchmark for Holistic Video Understanding
by: Fang, Xinyu, et al.
Published: (2024)
by: Fang, Xinyu, et al.
Published: (2024)
MIRAGE: Runtime Scheduling for Multi-Vector Image Retrieval with Hierarchical Decomposition
by: Li, Maoliang, et al.
Published: (2025)
by: Li, Maoliang, et al.
Published: (2025)
Turing Patterns for Multimedia: Reaction-Diffusion Multi-Modal Fusion for Language-Guided Video Moment Retrieval
by: Fang, Xiang, et al.
Published: (2026)
by: Fang, Xiang, et al.
Published: (2026)
Edit As You Wish: Video Caption Editing with Multi-grained User Control
by: Yao, Linli, et al.
Published: (2023)
by: Yao, Linli, et al.
Published: (2023)
Hierarchical Action Recognition: A Contrastive Video-Language Approach with Hierarchical Interactions
by: Zhang, Rui, et al.
Published: (2024)
by: Zhang, Rui, et al.
Published: (2024)
YingSound: Video-Guided Sound Effects Generation with Multi-modal Chain-of-Thought Controls
by: Chen, Zihao, et al.
Published: (2024)
by: Chen, Zihao, et al.
Published: (2024)
VG-TVP: Multimodal Procedural Planning via Visually Grounded Text-Video Prompting
by: Ilaslan, Muhammet Furkan, et al.
Published: (2024)
by: Ilaslan, Muhammet Furkan, et al.
Published: (2024)
Bernini: Latent Semantic Planning for Video Diffusion
by: Bernini Team, et al.
Published: (2026)
by: Bernini Team, et al.
Published: (2026)
EditHF-1M: A Million-Scale Rich Human Preference Feedback for Image Editing
by: Xu, Zitong, et al.
Published: (2026)
by: Xu, Zitong, et al.
Published: (2026)
AnyV2V: A Tuning-Free Framework For Any Video-to-Video Editing Tasks
by: Ku, Max, et al.
Published: (2024)
by: Ku, Max, et al.
Published: (2024)
Multi-modal Segment Assemblage Network for Ad Video Editing with Importance-Coherence Reward
by: Tang, Yolo Yunlong, et al.
Published: (2022)
by: Tang, Yolo Yunlong, et al.
Published: (2022)
MSVBench: Towards Human-Level Evaluation of Multi-Shot Video Generation
by: Shi, Haoyuan, et al.
Published: (2026)
by: Shi, Haoyuan, et al.
Published: (2026)
FlowVid: Taming Imperfect Optical Flows for Consistent Video-to-Video Synthesis
by: Liang, Feng, et al.
Published: (2023)
by: Liang, Feng, et al.
Published: (2023)
UniTalking: A Unified Audio-Video Framework for Talking Portrait Generation
by: Li, Hebeizi, et al.
Published: (2026)
by: Li, Hebeizi, et al.
Published: (2026)
Bridging Your Imagination with Audio-Video Generation via a Unified Director
by: Zhang, Jiaxu, et al.
Published: (2025)
by: Zhang, Jiaxu, et al.
Published: (2025)
VC-Bench: Pioneering the Video Connecting Benchmark with a Dataset and Evaluation Metrics
by: Yin, Zhiyu, et al.
Published: (2026)
by: Yin, Zhiyu, et al.
Published: (2026)
Multimodal Fake News Video Explanation: Dataset, Analysis and Evaluation
by: Chen, Lizhi, et al.
Published: (2025)
by: Chen, Lizhi, et al.
Published: (2025)
Learning Generalizable and Efficient Image Watermarking via Hierarchical Two-Stage Optimization
by: Liu, Ke, et al.
Published: (2025)
by: Liu, Ke, et al.
Published: (2025)
GMFVAD: Using Grained Multi-modal Feature to Improve Video Anomaly Detection
by: Dai, Guangyu, et al.
Published: (2025)
by: Dai, Guangyu, et al.
Published: (2025)
Audio-Omni: Extending Multi-modal Understanding to Versatile Audio Generation and Editing
by: Tian, Zeyue, et al.
Published: (2026)
by: Tian, Zeyue, et al.
Published: (2026)
Med-Banana-50K: A Cross-modality Large-Scale Dataset for Text-guided Medical Image Editing
by: Chen, Zhihui, et al.
Published: (2025)
by: Chen, Zhihui, et al.
Published: (2025)
Generalizable Deepfake Detection Based on Forgery-aware Layer Masking and Multi-artifact Subspace Decomposition
by: Zhang, Xiang, et al.
Published: (2026)
by: Zhang, Xiang, et al.
Published: (2026)
MuLTI: Efficient Video-and-Language Understanding with Text-Guided MultiWay-Sampler and Multiple Choice Modeling
by: Xu, Jiaqi, et al.
Published: (2023)
by: Xu, Jiaqi, et al.
Published: (2023)
Follow-Your-MultiPose: Tuning-Free Multi-Character Text-to-Video Generation via Pose Guidance
by: Zhang, Beiyuan, et al.
Published: (2024)
by: Zhang, Beiyuan, et al.
Published: (2024)
AIS 2024 Challenge on Video Quality Assessment of User-Generated Content: Methods and Results
by: Conde, Marcos V., et al.
Published: (2024)
by: Conde, Marcos V., et al.
Published: (2024)
CoPRS: Learning Positional Prior from Chain-of-Thought for Reasoning Segmentation
by: Lu, Zhenyu, et al.
Published: (2025)
by: Lu, Zhenyu, et al.
Published: (2025)
Similar Items
-
Teacher-Guided Pseudo Supervision and Cross-Modal Alignment for Audio-Visual Video Parsing
by: Chen, Yaru, et al.
Published: (2025) -
HuMo: Human-Centric Video Generation via Collaborative Multi-Modal Conditioning
by: Chen, Liyang, et al.
Published: (2025) -
HUD: Hierarchical Uncertainty-Aware Disambiguation Network for Composed Video Retrieval
by: Chen, Zhiwei, et al.
Published: (2025) -
Spatiotemporal Graph Guided Multi-modal Network for Livestreaming Product Retrieval
by: Hu, Xiaowan, et al.
Published: (2024) -
Learning Brain Representation with Hierarchical Visual Embeddings
by: Zheng, Jiawen, et al.
Published: (2026)