:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Liu, Ming, Zhang, Wensheng
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition Artificial Intelligence
Online Access:	https://arxiv.org/abs/2503.05977
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Natural Reflection Backdoor Attack on Vision Language Model for Autonomous Driving
by: Liu, Ming, et al.
Published: (2025)

TRAVL: A Recipe for Making Video-Language Models Better Judges of Physics Implausibility
by: Motamed, Saman, et al.
Published: (2025)

EvoStreaming: Your Offline Video Model Is a Natively Streaming Assistant
by: Wen, Zichen, et al.
Published: (2026)

MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark
by: Chen, Dongping, et al.
Published: (2024)

WorldModelBench: Judging Video Generation Models As World Models
by: Li, Dacheng, et al.
Published: (2025)

PRIME: Protect Your Videos From Malicious Editing
by: Li, Guanlin, et al.
Published: (2024)

Fuse Your Latents: Video Editing with Multi-source Latent Diffusion Models
by: Lu, Tianyi, et al.
Published: (2023)

StreamBridge: Turning Your Offline Video Large Language Model into a Proactive Streaming Assistant
by: Wang, Haibo, et al.
Published: (2025)

Temporal Regularization Makes Your Video Generator Stronger
by: Chen, Harold Haodong, et al.
Published: (2025)

AdaptGCD: Multi-Expert Adapter Tuning for Generalized Category Discovery
by: Qu, Yuxun, et al.
Published: (2024)

DeVAn: Dense Video Annotation for Video-Language Models
by: Liu, Tingkai, et al.
Published: (2023)

Pack and Force Your Memory: Long-form and Consistent Video Generation
by: Wu, Xiaofei, et al.
Published: (2025)

LLM-as-Judge Framework for Evaluating Tone-Induced Hallucination in Vision-Language Models
by: Jiang, Zhiyuan, et al.
Published: (2026)

VideoPoet: A Large Language Model for Zero-Shot Video Generation
by: Kondratyuk, Dan, et al.
Published: (2023)

Learning to Decode Against Compositional Hallucination in Video Multimodal Large Language Models
by: Xing, Wenbin, et al.
Published: (2026)

OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding?
by: Li, Yifei, et al.
Published: (2025)

Your One-Stop Solution for AI-Generated Video Detection
by: Ma, Long, et al.
Published: (2026)

Med-RewardBench: Benchmarking Reward Models and Judges for Medical Multimodal Large Language Models
by: Ding, Meidan, et al.
Published: (2025)

GameVerse: Can Vision-Language Models Learn from Video-based Reflection?
by: Zhang, Kuan, et al.
Published: (2026)

To Trust Or Not To Trust Your Vision-Language Model's Prediction
by: Dong, Hao, et al.
Published: (2025)

Multimodal Video Emotion Recognition with Reliable Reasoning Priors
by: Wang, Zhepeng, et al.
Published: (2025)

Can Vision Language Models Judge Action Quality? An Empirical Evaluation
by: Freitas, Miguel Monte e, et al.
Published: (2026)

Don't Judge by the Look: Towards Motion Coherent Video Representation
by: Zhang, Yitian, et al.
Published: (2024)

Your Vision-Language Model Can't Even Count to 20: Exposing the Failures of VLMs in Compositional Counting
by: Guo, Xuyang, et al.
Published: (2025)

When Models Judge Themselves: Unsupervised Self-Evolution for Multimodal Reasoning
by: Wu, Zhengxian, et al.
Published: (2026)

You Only Judge Once: Multi-response Reward Modeling in a Single Forward Pass
by: Yang, Yinuo, et al.
Published: (2026)

SketchJudge: A Diagnostic Benchmark for Grading Hand-drawn Diagrams with Multimodal Large Language Models
by: Su, Yuhang, et al.
Published: (2026)

Explicit Abstention Knobs for Predictable Reliability in Video Question Answering
by: Ortiz, Jorge
Published: (2025)

Knowing Your Target: Target-Aware Transformer Makes Better Spatio-Temporal Video Grounding
by: Gu, Xin, et al.
Published: (2025)

Bootstrap Your Own Views: Masked Ego-Exo Modeling for Fine-grained View-invariant Video Representations
by: Park, Jungin, et al.
Published: (2025)

SurgVidLM: Towards Multi-grained Surgical Video Understanding with Large Language Model
by: Wang, Guankun, et al.
Published: (2025)

VideoCogQA: A Controllable Benchmark for Evaluating Cognitive Abilities in Video-Language Models
by: Li, Chenglin, et al.
Published: (2024)

CopyJudge: Automated Copyright Infringement Identification and Mitigation in Text-to-Image Diffusion Models
by: Liu, Shunchang, et al.
Published: (2025)

TEMPLE: Incentivizing Temporal Understanding of Video Large Language Models via Progressive Pre-SFT Alignment
by: Li, Shicheng, et al.
Published: (2025)

MedProbCLIP: Probabilistic Adaptation of Vision-Language Foundation Model for Reliable Radiograph-Report Retrieval
by: Elallaf, Ahmad, et al.
Published: (2026)

TRCE: Towards Reliable Malicious Concept Erasure in Text-to-Image Diffusion Models
by: Chen, Ruidong, et al.
Published: (2025)

COLT: Enhancing Video Large Language Models with Continual Tool Usage
by: Liu, Yuyang, et al.
Published: (2025)

AD-EE: Early Exiting for Fast and Reliable Vision-Language Models in Autonomous Driving
by: Huang, Lianming, et al.
Published: (2025)

From Evaluation to Defense: Advancing Safety in Video Large Language Models
by: Sun, Yiwei, et al.
Published: (2025)

ProJudge: A Multi-Modal Multi-Discipline Benchmark and Instruction-Tuning Dataset for MLLM-based Process Judges
by: Ai, Jiaxin, et al.
Published: (2025)