Saved in:
| Main Authors: | Liu, Ming, Zhang, Wensheng |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2503.05977 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Natural Reflection Backdoor Attack on Vision Language Model for Autonomous Driving
by: Liu, Ming, et al.
Published: (2025)
by: Liu, Ming, et al.
Published: (2025)
TRAVL: A Recipe for Making Video-Language Models Better Judges of Physics Implausibility
by: Motamed, Saman, et al.
Published: (2025)
by: Motamed, Saman, et al.
Published: (2025)
EvoStreaming: Your Offline Video Model Is a Natively Streaming Assistant
by: Wen, Zichen, et al.
Published: (2026)
by: Wen, Zichen, et al.
Published: (2026)
MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark
by: Chen, Dongping, et al.
Published: (2024)
by: Chen, Dongping, et al.
Published: (2024)
WorldModelBench: Judging Video Generation Models As World Models
by: Li, Dacheng, et al.
Published: (2025)
by: Li, Dacheng, et al.
Published: (2025)
PRIME: Protect Your Videos From Malicious Editing
by: Li, Guanlin, et al.
Published: (2024)
by: Li, Guanlin, et al.
Published: (2024)
Fuse Your Latents: Video Editing with Multi-source Latent Diffusion Models
by: Lu, Tianyi, et al.
Published: (2023)
by: Lu, Tianyi, et al.
Published: (2023)
StreamBridge: Turning Your Offline Video Large Language Model into a Proactive Streaming Assistant
by: Wang, Haibo, et al.
Published: (2025)
by: Wang, Haibo, et al.
Published: (2025)
Temporal Regularization Makes Your Video Generator Stronger
by: Chen, Harold Haodong, et al.
Published: (2025)
by: Chen, Harold Haodong, et al.
Published: (2025)
AdaptGCD: Multi-Expert Adapter Tuning for Generalized Category Discovery
by: Qu, Yuxun, et al.
Published: (2024)
by: Qu, Yuxun, et al.
Published: (2024)
DeVAn: Dense Video Annotation for Video-Language Models
by: Liu, Tingkai, et al.
Published: (2023)
by: Liu, Tingkai, et al.
Published: (2023)
Pack and Force Your Memory: Long-form and Consistent Video Generation
by: Wu, Xiaofei, et al.
Published: (2025)
by: Wu, Xiaofei, et al.
Published: (2025)
LLM-as-Judge Framework for Evaluating Tone-Induced Hallucination in Vision-Language Models
by: Jiang, Zhiyuan, et al.
Published: (2026)
by: Jiang, Zhiyuan, et al.
Published: (2026)
VideoPoet: A Large Language Model for Zero-Shot Video Generation
by: Kondratyuk, Dan, et al.
Published: (2023)
by: Kondratyuk, Dan, et al.
Published: (2023)
Learning to Decode Against Compositional Hallucination in Video Multimodal Large Language Models
by: Xing, Wenbin, et al.
Published: (2026)
by: Xing, Wenbin, et al.
Published: (2026)
OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding?
by: Li, Yifei, et al.
Published: (2025)
by: Li, Yifei, et al.
Published: (2025)
Your One-Stop Solution for AI-Generated Video Detection
by: Ma, Long, et al.
Published: (2026)
by: Ma, Long, et al.
Published: (2026)
Med-RewardBench: Benchmarking Reward Models and Judges for Medical Multimodal Large Language Models
by: Ding, Meidan, et al.
Published: (2025)
by: Ding, Meidan, et al.
Published: (2025)
GameVerse: Can Vision-Language Models Learn from Video-based Reflection?
by: Zhang, Kuan, et al.
Published: (2026)
by: Zhang, Kuan, et al.
Published: (2026)
To Trust Or Not To Trust Your Vision-Language Model's Prediction
by: Dong, Hao, et al.
Published: (2025)
by: Dong, Hao, et al.
Published: (2025)
Multimodal Video Emotion Recognition with Reliable Reasoning Priors
by: Wang, Zhepeng, et al.
Published: (2025)
by: Wang, Zhepeng, et al.
Published: (2025)
Can Vision Language Models Judge Action Quality? An Empirical Evaluation
by: Freitas, Miguel Monte e, et al.
Published: (2026)
by: Freitas, Miguel Monte e, et al.
Published: (2026)
Don't Judge by the Look: Towards Motion Coherent Video Representation
by: Zhang, Yitian, et al.
Published: (2024)
by: Zhang, Yitian, et al.
Published: (2024)
Your Vision-Language Model Can't Even Count to 20: Exposing the Failures of VLMs in Compositional Counting
by: Guo, Xuyang, et al.
Published: (2025)
by: Guo, Xuyang, et al.
Published: (2025)
When Models Judge Themselves: Unsupervised Self-Evolution for Multimodal Reasoning
by: Wu, Zhengxian, et al.
Published: (2026)
by: Wu, Zhengxian, et al.
Published: (2026)
You Only Judge Once: Multi-response Reward Modeling in a Single Forward Pass
by: Yang, Yinuo, et al.
Published: (2026)
by: Yang, Yinuo, et al.
Published: (2026)
SketchJudge: A Diagnostic Benchmark for Grading Hand-drawn Diagrams with Multimodal Large Language Models
by: Su, Yuhang, et al.
Published: (2026)
by: Su, Yuhang, et al.
Published: (2026)
Explicit Abstention Knobs for Predictable Reliability in Video Question Answering
by: Ortiz, Jorge
Published: (2025)
by: Ortiz, Jorge
Published: (2025)
Knowing Your Target: Target-Aware Transformer Makes Better Spatio-Temporal Video Grounding
by: Gu, Xin, et al.
Published: (2025)
by: Gu, Xin, et al.
Published: (2025)
Bootstrap Your Own Views: Masked Ego-Exo Modeling for Fine-grained View-invariant Video Representations
by: Park, Jungin, et al.
Published: (2025)
by: Park, Jungin, et al.
Published: (2025)
SurgVidLM: Towards Multi-grained Surgical Video Understanding with Large Language Model
by: Wang, Guankun, et al.
Published: (2025)
by: Wang, Guankun, et al.
Published: (2025)
VideoCogQA: A Controllable Benchmark for Evaluating Cognitive Abilities in Video-Language Models
by: Li, Chenglin, et al.
Published: (2024)
by: Li, Chenglin, et al.
Published: (2024)
CopyJudge: Automated Copyright Infringement Identification and Mitigation in Text-to-Image Diffusion Models
by: Liu, Shunchang, et al.
Published: (2025)
by: Liu, Shunchang, et al.
Published: (2025)
TEMPLE: Incentivizing Temporal Understanding of Video Large Language Models via Progressive Pre-SFT Alignment
by: Li, Shicheng, et al.
Published: (2025)
by: Li, Shicheng, et al.
Published: (2025)
MedProbCLIP: Probabilistic Adaptation of Vision-Language Foundation Model for Reliable Radiograph-Report Retrieval
by: Elallaf, Ahmad, et al.
Published: (2026)
by: Elallaf, Ahmad, et al.
Published: (2026)
TRCE: Towards Reliable Malicious Concept Erasure in Text-to-Image Diffusion Models
by: Chen, Ruidong, et al.
Published: (2025)
by: Chen, Ruidong, et al.
Published: (2025)
COLT: Enhancing Video Large Language Models with Continual Tool Usage
by: Liu, Yuyang, et al.
Published: (2025)
by: Liu, Yuyang, et al.
Published: (2025)
AD-EE: Early Exiting for Fast and Reliable Vision-Language Models in Autonomous Driving
by: Huang, Lianming, et al.
Published: (2025)
by: Huang, Lianming, et al.
Published: (2025)
From Evaluation to Defense: Advancing Safety in Video Large Language Models
by: Sun, Yiwei, et al.
Published: (2025)
by: Sun, Yiwei, et al.
Published: (2025)
ProJudge: A Multi-Modal Multi-Discipline Benchmark and Instruction-Tuning Dataset for MLLM-based Process Judges
by: Ai, Jiaxin, et al.
Published: (2025)
by: Ai, Jiaxin, et al.
Published: (2025)
Similar Items
-
Natural Reflection Backdoor Attack on Vision Language Model for Autonomous Driving
by: Liu, Ming, et al.
Published: (2025) -
TRAVL: A Recipe for Making Video-Language Models Better Judges of Physics Implausibility
by: Motamed, Saman, et al.
Published: (2025) -
EvoStreaming: Your Offline Video Model Is a Natively Streaming Assistant
by: Wen, Zichen, et al.
Published: (2026) -
MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark
by: Chen, Dongping, et al.
Published: (2024) -
WorldModelBench: Judging Video Generation Models As World Models
by: Li, Dacheng, et al.
Published: (2025)