Saved in:
| Main Authors: | Cao, Yuxin, Song, Wei, Xue, Jingling, Dong, Jin Song |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2509.20851 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Failures to Surface Harmful Contents in Video Large Language Models
by: Cao, Yuxin, et al.
Published: (2025)
by: Cao, Yuxin, et al.
Published: (2025)
VideoSTF: Stress-Testing Output Repetition in Video Large Language Models
by: Cao, Yuxin, et al.
Published: (2026)
by: Cao, Yuxin, et al.
Published: (2026)
FAIRT2V: Training-Free Debiasing for Text-to-Video Diffusion Models
by: Zhong, Haonan, et al.
Published: (2026)
by: Zhong, Haonan, et al.
Published: (2026)
Prompts to Summaries: Zero-Shot Language-Guided Video Summarization with Large Language and Video Models
by: Barbara, Mario, et al.
Published: (2025)
by: Barbara, Mario, et al.
Published: (2025)
Exploring the Role of Large Language Models in Prompt Encoding for Diffusion Models
by: Ma, Bingqi, et al.
Published: (2024)
by: Ma, Bingqi, et al.
Published: (2024)
Multi-Turn Adaptive Prompting Attack on Large Vision-Language Models
by: Choi, In Chong, et al.
Published: (2026)
by: Choi, In Chong, et al.
Published: (2026)
DeMark: A Query-Free Black-Box Attack on Deepfake Watermarking Defenses
by: Song, Wei, et al.
Published: (2026)
by: Song, Wei, et al.
Published: (2026)
CauCLIP: Bridging the Sim-to-Real Gap in Surgical Video Understanding via Causality-Inspired Vision-Language Modeling
by: He, Yuxin, et al.
Published: (2026)
by: He, Yuxin, et al.
Published: (2026)
SAGE: Accelerating Vision-Language Models via Entropy-Guided Adaptive Speculative Decoding
by: Tong, Yujia, et al.
Published: (2026)
by: Tong, Yujia, et al.
Published: (2026)
Video SimpleQA: Towards Factuality Evaluation in Large Video Language Models
by: Cao, Meng, et al.
Published: (2025)
by: Cao, Meng, et al.
Published: (2025)
Visual Attention Drifts,but Anchors Hold:Mitigating Hallucination in Multimodal Large Language Models via Cross-Layer Visual Anchors
by: Yang, Chengxu, et al.
Published: (2026)
by: Yang, Chengxu, et al.
Published: (2026)
VideoLLM-online: Online Video Large Language Model for Streaming Video
by: Chen, Joya, et al.
Published: (2024)
by: Chen, Joya, et al.
Published: (2024)
Language-Guided Token Compression with Reinforcement Learning in Large Vision-Language Models
by: Cao, Sihan, et al.
Published: (2026)
by: Cao, Sihan, et al.
Published: (2026)
TiFRe: Text-guided Video Frame Reduction for Efficient Video Multi-modal Large Language Models
by: Zheng, Xiangtian, et al.
Published: (2026)
by: Zheng, Xiangtian, et al.
Published: (2026)
Rethinking Token Reduction for Large Vision-Language Models
by: Wang, Yi, et al.
Published: (2026)
by: Wang, Yi, et al.
Published: (2026)
Adversarial Prompt Injection Attack on Multimodal Large Language Models
by: Ding, Meiwen, et al.
Published: (2026)
by: Ding, Meiwen, et al.
Published: (2026)
Mitigating Multilingual Hallucination in Large Vision-Language Models
by: Qu, Xiaoye, et al.
Published: (2024)
by: Qu, Xiaoye, et al.
Published: (2024)
Large Language Model for Lossless Image Compression with Visual Prompts
by: Du, Junhao, et al.
Published: (2025)
by: Du, Junhao, et al.
Published: (2025)
EventSTU: Event-Guided Efficient Spatio-Temporal Understanding for Video Large Language Models
by: Xu, Wenhao, et al.
Published: (2025)
by: Xu, Wenhao, et al.
Published: (2025)
EvoComp: Learning Visual Token Compression for Multimodal Large Language Models via Semantic-Guided Evolutionary Labeling
by: Song, Jiafei, et al.
Published: (2026)
by: Song, Jiafei, et al.
Published: (2026)
LLM4VG: Large Language Models Evaluation for Video Grounding
by: Feng, Wei, et al.
Published: (2023)
by: Feng, Wei, et al.
Published: (2023)
Coherent Video Inpainting Using Optical Flow-Guided Efficient Diffusion
by: Gu, Bohai, et al.
Published: (2024)
by: Gu, Bohai, et al.
Published: (2024)
TextRefiner: Internal Visual Feature as Efficient Refiner for Vision-Language Models Prompt Tuning
by: Xie, Jingjing, et al.
Published: (2024)
by: Xie, Jingjing, et al.
Published: (2024)
LocalStyleFool: Regional Video Style Transfer Attack Using Segment Anything Model
by: Cao, Yuxin, et al.
Published: (2024)
by: Cao, Yuxin, et al.
Published: (2024)
Understanding the Multi-modal Prompts of the Pre-trained Vision-Language Model
by: Ma, Shuailei, et al.
Published: (2023)
by: Ma, Shuailei, et al.
Published: (2023)
SAMA: Factorized Semantic Anchoring and Motion Alignment for Instruction-Guided Video Editing
by: Zhang, Xinyao, et al.
Published: (2026)
by: Zhang, Xinyao, et al.
Published: (2026)
Reasoning-Enhanced Domain-Adaptive Pretraining of Multimodal Large Language Models for Short Video Content Governance
by: Wang, Zixuan, et al.
Published: (2025)
by: Wang, Zixuan, et al.
Published: (2025)
Video Anomaly Detection with Motion and Appearance Guided Patch Diffusion Model
by: Zhou, Hang, et al.
Published: (2024)
by: Zhou, Hang, et al.
Published: (2024)
Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding
by: Jin, Peng, et al.
Published: (2023)
by: Jin, Peng, et al.
Published: (2023)
Do Egocentric Video-Language Models Truly Understand Hand-Object Interactions?
by: Xu, Boshen, et al.
Published: (2024)
by: Xu, Boshen, et al.
Published: (2024)
DP^2-VL: Private Photo Dataset Protection by Data Poisoning for Vision-Language Models
by: Miao, Hongyi, et al.
Published: (2026)
by: Miao, Hongyi, et al.
Published: (2026)
LogoStyleFool: Vitiating Video Recognition Systems via Logo Style Transfer
by: Cao, Yuxin, et al.
Published: (2023)
by: Cao, Yuxin, et al.
Published: (2023)
StreamingEffect: Real-Time Human-Centric Video Effect Generation
by: Song, Yiren, et al.
Published: (2026)
by: Song, Yiren, et al.
Published: (2026)
Using Prompts to Guide Large Language Models in Imitating a Real Person's Language Style
by: Chen, Ziyang, et al.
Published: (2024)
by: Chen, Ziyang, et al.
Published: (2024)
Empowering Large Language Model for Continual Video Question Answering with Collaborative Prompting
by: Cai, Chen, et al.
Published: (2024)
by: Cai, Chen, et al.
Published: (2024)
MePT: Multi-Representation Guided Prompt Tuning for Vision-Language Model
by: Wang, Xinyang, et al.
Published: (2024)
by: Wang, Xinyang, et al.
Published: (2024)
Unleashing the Potential of All Test Samples: Mean-Shift Guided Test-Time Adaptation
by: Han, Jizhou, et al.
Published: (2025)
by: Han, Jizhou, et al.
Published: (2025)
DUAL-VAD: Dual Benchmarks and Anomaly-Focused Sampling for Video Anomaly Detection
by: Jung, Seoik, et al.
Published: (2025)
by: Jung, Seoik, et al.
Published: (2025)
Context Guided Transformer Entropy Modeling for Video Compression
by: Tong, Junlong, et al.
Published: (2025)
by: Tong, Junlong, et al.
Published: (2025)
Towards Fine-Grained Robustness: Attention-Guided Test-Time Prompt Tuning for Vision-Language Models
by: Hai, Jia-Wei, et al.
Published: (2026)
by: Hai, Jia-Wei, et al.
Published: (2026)
Similar Items
-
Failures to Surface Harmful Contents in Video Large Language Models
by: Cao, Yuxin, et al.
Published: (2025) -
VideoSTF: Stress-Testing Output Repetition in Video Large Language Models
by: Cao, Yuxin, et al.
Published: (2026) -
FAIRT2V: Training-Free Debiasing for Text-to-Video Diffusion Models
by: Zhong, Haonan, et al.
Published: (2026) -
Prompts to Summaries: Zero-Shot Language-Guided Video Summarization with Large Language and Video Models
by: Barbara, Mario, et al.
Published: (2025) -
Exploring the Role of Large Language Models in Prompt Encoding for Diffusion Models
by: Ma, Bingqi, et al.
Published: (2024)