Saved in:
| Main Authors: | Luo, Sha, Prabhu, Yogesh, Ossowski, Timothy, Chen, Kaiping, Hu, Junjie |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2601.03369 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Prompting Large Vision-Language Models for Compositional Reasoning
by: Ossowski, Timothy, et al.
Published: (2024)
by: Ossowski, Timothy, et al.
Published: (2024)
Beyond Words: Enhancing Desire, Emotion, and Sentiment Recognition with Non-Verbal Cues
by: Chen, Wei, et al.
Published: (2025)
by: Chen, Wei, et al.
Published: (2025)
PatchCue: Enhancing Vision-Language Model Reasoning with Patch-Based Visual Cues
by: Qi, Yukun, et al.
Published: (2026)
by: Qi, Yukun, et al.
Published: (2026)
Vision-Language Models Mistake Head Orientation for Gaze Direction: Nonverbal Conversation Cues
by: Zhang, Zory, et al.
Published: (2025)
by: Zhang, Zory, et al.
Published: (2025)
OLIVE: Object Level In-Context Visual Embeddings
by: Ossowski, Timothy, et al.
Published: (2024)
by: Ossowski, Timothy, et al.
Published: (2024)
Video-ToC: Video Tree-of-Cue Reasoning
by: Tan, Qizhong, et al.
Published: (2026)
by: Tan, Qizhong, et al.
Published: (2026)
Learning Multimodal Cues of Children's Uncertainty
by: Cheng, Qi, et al.
Published: (2024)
by: Cheng, Qi, et al.
Published: (2024)
Speaking Beyond Language: A Large-Scale Multimodal Dataset for Learning Nonverbal Cues from Video-Grounded Dialogues
by: Kim, Youngmin, et al.
Published: (2025)
by: Kim, Youngmin, et al.
Published: (2025)
SAVeS: Steering Safety Judgments in Vision-Language Models via Semantic Cues
by: Hinojosa, Carlos, et al.
Published: (2026)
by: Hinojosa, Carlos, et al.
Published: (2026)
Towards an Automated Multimodal Approach for Video Summarization: Building a Bridge Between Text, Audio and Facial Cue-Based Summarization
by: Islam, Md Moinul, et al.
Published: (2025)
by: Islam, Md Moinul, et al.
Published: (2025)
IV-Bench: A Benchmark for Image-Grounded Video Perception and Reasoning in Multimodal LLMs
by: Ma, David, et al.
Published: (2025)
by: Ma, David, et al.
Published: (2025)
MFC-Bench: Benchmarking Multimodal Fact-Checking with Large Vision-Language Models
by: Wang, Shengkang, et al.
Published: (2024)
by: Wang, Shengkang, et al.
Published: (2024)
Enhancing Vision-Language Tracking by Effectively Converting Textual Cues into Visual Cues
by: Feng, X., et al.
Published: (2024)
by: Feng, X., et al.
Published: (2024)
SpurLens: Automatic Detection of Spurious Cues in Multimodal LLMs
by: Hosseini, Parsa, et al.
Published: (2025)
by: Hosseini, Parsa, et al.
Published: (2025)
DIFFER: Disentangling Identity Features via Semantic Cues for Clothes-Changing Person Re-ID
by: Liang, Xin, et al.
Published: (2025)
by: Liang, Xin, et al.
Published: (2025)
CueBench: Advancing Unified Understanding of Context-Aware Video Anomalies in Real-World
by: Yu, Yating, et al.
Published: (2025)
by: Yu, Yating, et al.
Published: (2025)
VRR-QA: Visual Relational Reasoning in Videos Beyond Explicit Cues
by: Swetha, Sirnam, et al.
Published: (2025)
by: Swetha, Sirnam, et al.
Published: (2025)
TemMed-Bench: Evaluating Temporal Medical Image Reasoning in Vision-Language Models
by: Zhang, Junyi, et al.
Published: (2025)
by: Zhang, Junyi, et al.
Published: (2025)
CDH-Bench: A Commonsense-Driven Hallucination Benchmark for Evaluating Visual Fidelity in Vision-Language Models
by: Chen, Kesheng, et al.
Published: (2026)
by: Chen, Kesheng, et al.
Published: (2026)
Video-SafetyBench: A Benchmark for Safety Evaluation of Video LVLMs
by: Liu, Xuannan, et al.
Published: (2025)
by: Liu, Xuannan, et al.
Published: (2025)
How Far Are Vision-Language Models from Constructing the Real World? A Benchmark for Physical Generative Reasoning
by: Yang, Luyu, et al.
Published: (2026)
by: Yang, Luyu, et al.
Published: (2026)
MER-Bench: A Comprehensive Benchmark for Multimodal Meme Reappraisal
by: Nie, Yiqi, et al.
Published: (2026)
by: Nie, Yiqi, et al.
Published: (2026)
CameraBench: Benchmarking Visual Reasoning in MLLMs via Photography
by: Fang, I-Sheng, et al.
Published: (2025)
by: Fang, I-Sheng, et al.
Published: (2025)
HyperGVL: Benchmarking and Improving Large Vision-Language Models in Hypergraph Understanding and Reasoning
by: Wei, Yanbin, et al.
Published: (2026)
by: Wei, Yanbin, et al.
Published: (2026)
Heron-Bench: A Benchmark for Evaluating Vision Language Models in Japanese
by: Inoue, Yuichi, et al.
Published: (2024)
by: Inoue, Yuichi, et al.
Published: (2024)
LongVideoBench: A Benchmark for Long-context Interleaved Video-Language Understanding
by: Wu, Haoning, et al.
Published: (2024)
by: Wu, Haoning, et al.
Published: (2024)
ChronoMagic-Bench: A Benchmark for Metamorphic Evaluation of Text-to-Time-lapse Video Generation
by: Yuan, Shenghai, et al.
Published: (2024)
by: Yuan, Shenghai, et al.
Published: (2024)
SOK-Bench: A Situated Video Reasoning Benchmark with Aligned Open-World Knowledge
by: Wang, Andong, et al.
Published: (2024)
by: Wang, Andong, et al.
Published: (2024)
BRIDGE: Bridging Gaps in Image Captioning Evaluation with Stronger Visual Cues
by: Sarto, Sara, et al.
Published: (2024)
by: Sarto, Sara, et al.
Published: (2024)
SAP-Bench: Benchmarking Multimodal Large Language Models in Surgical Action Planning
by: Xu, Mengya, et al.
Published: (2025)
by: Xu, Mengya, et al.
Published: (2025)
MMLongBench: Benchmarking Long-Context Vision-Language Models Effectively and Thoroughly
by: Wang, Zhaowei, et al.
Published: (2025)
by: Wang, Zhaowei, et al.
Published: (2025)
RoboTrustBench: Benchmarking the Trustworthiness of Video World Models for Robotic Manipulation
by: Li, Huiqiong, et al.
Published: (2026)
by: Li, Huiqiong, et al.
Published: (2026)
Benchmarking Multi-Image Understanding in Vision and Language Models: Perception, Knowledge, Reasoning, and Multi-Hop Reasoning
by: Zhao, Bingchen, et al.
Published: (2024)
by: Zhao, Bingchen, et al.
Published: (2024)
GDI-Bench: A Benchmark for General Document Intelligence with Vision and Reasoning Decoupling
by: Li, Siqi, et al.
Published: (2025)
by: Li, Siqi, et al.
Published: (2025)
R2I-Bench: Benchmarking Reasoning-Driven Text-to-Image Generation
by: Chen, Kaijie, et al.
Published: (2025)
by: Chen, Kaijie, et al.
Published: (2025)
VideoAnchor: Reinforcing Subspace-Structured Visual Cues for Coherent Visual-Spatial Reasoning
by: Wang, Zhaozhi, et al.
Published: (2025)
by: Wang, Zhaozhi, et al.
Published: (2025)
VL-RewardBench: A Challenging Benchmark for Vision-Language Generative Reward Models
by: Li, Lei, et al.
Published: (2024)
by: Li, Lei, et al.
Published: (2024)
Res-Bench: Benchmarking the Robustness of Multimodal Large Language Models to Dynamic Resolution Input
by: Li, Chenxu, et al.
Published: (2025)
by: Li, Chenxu, et al.
Published: (2025)
AesBench: An Expert Benchmark for Multimodal Large Language Models on Image Aesthetics Perception
by: Huang, Yipo, et al.
Published: (2024)
by: Huang, Yipo, et al.
Published: (2024)
VideoVista: A Versatile Benchmark for Video Understanding and Reasoning
by: Li, Yunxin, et al.
Published: (2024)
by: Li, Yunxin, et al.
Published: (2024)
Similar Items
-
Prompting Large Vision-Language Models for Compositional Reasoning
by: Ossowski, Timothy, et al.
Published: (2024) -
Beyond Words: Enhancing Desire, Emotion, and Sentiment Recognition with Non-Verbal Cues
by: Chen, Wei, et al.
Published: (2025) -
PatchCue: Enhancing Vision-Language Model Reasoning with Patch-Based Visual Cues
by: Qi, Yukun, et al.
Published: (2026) -
Vision-Language Models Mistake Head Orientation for Gaze Direction: Nonverbal Conversation Cues
by: Zhang, Zory, et al.
Published: (2025) -
OLIVE: Object Level In-Context Visual Embeddings
by: Ossowski, Timothy, et al.
Published: (2024)