:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Cao, Yuxin, Song, Wei, Xue, Jingling, Dong, Jin Song
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2509.20851
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Failures to Surface Harmful Contents in Video Large Language Models
by: Cao, Yuxin, et al.
Published: (2025)

VideoSTF: Stress-Testing Output Repetition in Video Large Language Models
by: Cao, Yuxin, et al.
Published: (2026)

FAIRT2V: Training-Free Debiasing for Text-to-Video Diffusion Models
by: Zhong, Haonan, et al.
Published: (2026)

Prompts to Summaries: Zero-Shot Language-Guided Video Summarization with Large Language and Video Models
by: Barbara, Mario, et al.
Published: (2025)

Exploring the Role of Large Language Models in Prompt Encoding for Diffusion Models
by: Ma, Bingqi, et al.
Published: (2024)

Multi-Turn Adaptive Prompting Attack on Large Vision-Language Models
by: Choi, In Chong, et al.
Published: (2026)

DeMark: A Query-Free Black-Box Attack on Deepfake Watermarking Defenses
by: Song, Wei, et al.
Published: (2026)

CauCLIP: Bridging the Sim-to-Real Gap in Surgical Video Understanding via Causality-Inspired Vision-Language Modeling
by: He, Yuxin, et al.
Published: (2026)

SAGE: Accelerating Vision-Language Models via Entropy-Guided Adaptive Speculative Decoding
by: Tong, Yujia, et al.
Published: (2026)

Video SimpleQA: Towards Factuality Evaluation in Large Video Language Models
by: Cao, Meng, et al.
Published: (2025)

Visual Attention Drifts,but Anchors Hold:Mitigating Hallucination in Multimodal Large Language Models via Cross-Layer Visual Anchors
by: Yang, Chengxu, et al.
Published: (2026)

VideoLLM-online: Online Video Large Language Model for Streaming Video
by: Chen, Joya, et al.
Published: (2024)

Language-Guided Token Compression with Reinforcement Learning in Large Vision-Language Models
by: Cao, Sihan, et al.
Published: (2026)

TiFRe: Text-guided Video Frame Reduction for Efficient Video Multi-modal Large Language Models
by: Zheng, Xiangtian, et al.
Published: (2026)

Rethinking Token Reduction for Large Vision-Language Models
by: Wang, Yi, et al.
Published: (2026)

Adversarial Prompt Injection Attack on Multimodal Large Language Models
by: Ding, Meiwen, et al.
Published: (2026)

Mitigating Multilingual Hallucination in Large Vision-Language Models
by: Qu, Xiaoye, et al.
Published: (2024)

Large Language Model for Lossless Image Compression with Visual Prompts
by: Du, Junhao, et al.
Published: (2025)

EventSTU: Event-Guided Efficient Spatio-Temporal Understanding for Video Large Language Models
by: Xu, Wenhao, et al.
Published: (2025)

EvoComp: Learning Visual Token Compression for Multimodal Large Language Models via Semantic-Guided Evolutionary Labeling
by: Song, Jiafei, et al.
Published: (2026)

LLM4VG: Large Language Models Evaluation for Video Grounding
by: Feng, Wei, et al.
Published: (2023)

Coherent Video Inpainting Using Optical Flow-Guided Efficient Diffusion
by: Gu, Bohai, et al.
Published: (2024)

TextRefiner: Internal Visual Feature as Efficient Refiner for Vision-Language Models Prompt Tuning
by: Xie, Jingjing, et al.
Published: (2024)

LocalStyleFool: Regional Video Style Transfer Attack Using Segment Anything Model
by: Cao, Yuxin, et al.
Published: (2024)

Understanding the Multi-modal Prompts of the Pre-trained Vision-Language Model
by: Ma, Shuailei, et al.
Published: (2023)

SAMA: Factorized Semantic Anchoring and Motion Alignment for Instruction-Guided Video Editing
by: Zhang, Xinyao, et al.
Published: (2026)

Reasoning-Enhanced Domain-Adaptive Pretraining of Multimodal Large Language Models for Short Video Content Governance
by: Wang, Zixuan, et al.
Published: (2025)

Video Anomaly Detection with Motion and Appearance Guided Patch Diffusion Model
by: Zhou, Hang, et al.
Published: (2024)

Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding
by: Jin, Peng, et al.
Published: (2023)

Do Egocentric Video-Language Models Truly Understand Hand-Object Interactions?
by: Xu, Boshen, et al.
Published: (2024)

DP^2-VL: Private Photo Dataset Protection by Data Poisoning for Vision-Language Models
by: Miao, Hongyi, et al.
Published: (2026)

LogoStyleFool: Vitiating Video Recognition Systems via Logo Style Transfer
by: Cao, Yuxin, et al.
Published: (2023)

StreamingEffect: Real-Time Human-Centric Video Effect Generation
by: Song, Yiren, et al.
Published: (2026)

Using Prompts to Guide Large Language Models in Imitating a Real Person's Language Style
by: Chen, Ziyang, et al.
Published: (2024)

Empowering Large Language Model for Continual Video Question Answering with Collaborative Prompting
by: Cai, Chen, et al.
Published: (2024)

MePT: Multi-Representation Guided Prompt Tuning for Vision-Language Model
by: Wang, Xinyang, et al.
Published: (2024)

Unleashing the Potential of All Test Samples: Mean-Shift Guided Test-Time Adaptation
by: Han, Jizhou, et al.
Published: (2025)

DUAL-VAD: Dual Benchmarks and Anomaly-Focused Sampling for Video Anomaly Detection
by: Jung, Seoik, et al.
Published: (2025)

Context Guided Transformer Entropy Modeling for Video Compression
by: Tong, Junlong, et al.
Published: (2025)

Towards Fine-Grained Robustness: Attention-Guided Test-Time Prompt Tuning for Vision-Language Models
by: Hai, Jia-Wei, et al.
Published: (2026)