:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Zhang, Michael, Wang, Elise, Whatley, Charlotte, Strickland, Seth, Bannon, Dylan
Format:	Preprint
Published:	2026
Subjects:	Computer Vision and Pattern Recognition Artificial Intelligence Computation and Language
Online Access:	https://arxiv.org/abs/2605.00977
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Semantic-Drive: Democratizing Long-Tail Data Curation via Open-Vocabulary Grounding and Neuro-Symbolic VLM Consensus
by: Guillen-Perez, Antonio
Published: (2025)

Impact of Visual Context on Noisy Multimodal NMT: An Empirical Study for English to Indian Languages
by: Gain, Baban, et al.
Published: (2023)

Losing Visual Needles in Image Haystacks: Vision Language Models are Easily Distracted in Short and Long Contexts
by: Sharma, Aditya, et al.
Published: (2024)

Good at captioning, bad at counting: Benchmarking GPT-4V on Earth observation data
by: Zhang, Chenhui, et al.
Published: (2024)

Capybara-OMNI: An Efficient Paradigm for Building Omni-Modal Language Models
by: Ji, Xingguang, et al.
Published: (2025)

From Hallucinations to Jailbreaks: Rethinking the Vulnerability of Large Foundation Models
by: Jin, Haibo, et al.
Published: (2025)

Data Metabolism: An Efficient Data Design Schema For Vision Language Model
by: Zhang, Jingyuan, et al.
Published: (2025)

Traceable Evidence Enhanced Visual Grounded Reasoning: Evaluation and Methodology
by: Wang, Haochen, et al.
Published: (2025)

When language and vision meet road safety: leveraging multimodal large language models for video-based traffic accident analysis
by: Zhang, Ruixuan, et al.
Published: (2025)

Web World Models
by: Feng, Jichen, et al.
Published: (2025)

CaughtCheating: Is Your MLLM a Good Cheating Detective? Exploring the Boundary of Visual Perception and Reasoning
by: Li, Ming, et al.
Published: (2025)

VisionGraph: Leveraging Large Multimodal Models for Graph Theory Problems in Visual Context
by: Li, Yunxin, et al.
Published: (2024)

MindCube: Spatial Mental Modeling from Limited Views
by: Wang, Qineng, et al.
Published: (2025)

Breaking Down and Building Up: Mixture of Skill-Based Vision-and-Language Navigation Agents
by: Ma, Tianyi, et al.
Published: (2025)

Weak-eval-Strong: Evaluating and Eliciting Lateral Thinking of LLMs with Situation Puzzles
by: Chen, Qi, et al.
Published: (2024)

Reasoning Can Hurt the Inductive Abilities of Large Language Models
by: Jin, Haibo, et al.
Published: (2025)

Who Evaluates the Evaluations? Objectively Scoring Text-to-Image Prompt Coherence Metrics with T2IScoreScore (TS2)
by: Saxon, Michael, et al.
Published: (2024)

Probing and Inducing Combinational Creativity in Vision-Language Models
by: Peng, Yongqian, et al.
Published: (2025)

Grasp Any Region: Towards Precise, Contextual Pixel Understanding for Multimodal LLMs
by: Wang, Haochen, et al.
Published: (2025)

OSPC: Detecting Harmful Memes with Large Language Model as a Catalyst
by: Cao, Jingtao, et al.
Published: (2024)

Mavors: Multi-granularity Video Representation for Multimodal Large Language Model
by: Shi, Yang, et al.
Published: (2025)

MMReason: An Open-Ended Multi-Modal Multi-Step Reasoning Benchmark for MLLMs Toward AGI
by: Yao, Huanjin, et al.
Published: (2025)

Active Video Perception: Iterative Evidence Seeking for Agentic Long Video Understanding
by: Wang, Ziyang, et al.
Published: (2025)

InfiniteWeb: Scalable Web Environment Synthesis for GUI Agent Training
by: Zhang, Ziyun, et al.
Published: (2026)

TC-Bench: Benchmarking Temporal Compositionality in Text-to-Video and Image-to-Video Generation
by: Feng, Weixi, et al.
Published: (2024)

MLLM-CL: Continual Learning for Multimodal Large Language Models
by: Zhao, Hongbo, et al.
Published: (2025)

EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents
by: Yang, Rui, et al.
Published: (2025)

ScanReason: Empowering 3D Visual Grounding with Reasoning Capabilities
by: Zhu, Chenming, et al.
Published: (2024)

MEXA: Towards General Multimodal Reasoning with Dynamic Multi-Expert Aggregation
by: Yu, Shoubin, et al.
Published: (2025)

Interpretable and Reliable Detection of AI-Generated Images via Grounded Reasoning in MLLMs
by: Ji, Yikun, et al.
Published: (2025)

OmniSpatial: Towards Comprehensive Spatial Reasoning Benchmark for Vision Language Models
by: Jia, Mengdi, et al.
Published: (2025)

CoViPAL: Layer-wise Contextualized Visual Token Pruning for Large Vision-Language Models
by: Tang, Zicong, et al.
Published: (2025)

Automatic Layout Planning for Visually-Rich Documents with Instruction-Following Models
by: Zhu, Wanrong, et al.
Published: (2024)

Cost-effective Instruction Learning for Pathology Vision and Language Analysis
by: Chen, Kaitao, et al.
Published: (2024)

Exploring the Potential of Encoder-free Architectures in 3D LMMs
by: Tang, Yiwen, et al.
Published: (2025)

EgoMemReason: A Memory-Driven Reasoning Benchmark for Long-Horizon Egocentric Video Understanding
by: Wang, Ziyang, et al.
Published: (2026)

Face-Human-Bench: A Comprehensive Benchmark of Face and Human Understanding for Multi-modal Assistants
by: Qin, Lixiong, et al.
Published: (2025)

EVA: Efficient Reinforcement Learning for End-to-End Video Agent
by: Zhang, Yaolun, et al.
Published: (2026)

Crafter: A Multi-Agent Harness for Editable Scientific Figure Generation from Diverse Inputs
by: Zhao, Haozhe, et al.
Published: (2026)

Semantically-Prompted Language Models Improve Visual Descriptions
by: Ogezi, Michael, et al.
Published: (2023)