Saved in:
| Main Authors: | Zhang, Michael, Wang, Elise, Whatley, Charlotte, Strickland, Seth, Bannon, Dylan |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2605.00977 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Semantic-Drive: Democratizing Long-Tail Data Curation via Open-Vocabulary Grounding and Neuro-Symbolic VLM Consensus
by: Guillen-Perez, Antonio
Published: (2025)
by: Guillen-Perez, Antonio
Published: (2025)
Impact of Visual Context on Noisy Multimodal NMT: An Empirical Study for English to Indian Languages
by: Gain, Baban, et al.
Published: (2023)
by: Gain, Baban, et al.
Published: (2023)
Losing Visual Needles in Image Haystacks: Vision Language Models are Easily Distracted in Short and Long Contexts
by: Sharma, Aditya, et al.
Published: (2024)
by: Sharma, Aditya, et al.
Published: (2024)
Good at captioning, bad at counting: Benchmarking GPT-4V on Earth observation data
by: Zhang, Chenhui, et al.
Published: (2024)
by: Zhang, Chenhui, et al.
Published: (2024)
Capybara-OMNI: An Efficient Paradigm for Building Omni-Modal Language Models
by: Ji, Xingguang, et al.
Published: (2025)
by: Ji, Xingguang, et al.
Published: (2025)
From Hallucinations to Jailbreaks: Rethinking the Vulnerability of Large Foundation Models
by: Jin, Haibo, et al.
Published: (2025)
by: Jin, Haibo, et al.
Published: (2025)
Data Metabolism: An Efficient Data Design Schema For Vision Language Model
by: Zhang, Jingyuan, et al.
Published: (2025)
by: Zhang, Jingyuan, et al.
Published: (2025)
Traceable Evidence Enhanced Visual Grounded Reasoning: Evaluation and Methodology
by: Wang, Haochen, et al.
Published: (2025)
by: Wang, Haochen, et al.
Published: (2025)
When language and vision meet road safety: leveraging multimodal large language models for video-based traffic accident analysis
by: Zhang, Ruixuan, et al.
Published: (2025)
by: Zhang, Ruixuan, et al.
Published: (2025)
Web World Models
by: Feng, Jichen, et al.
Published: (2025)
by: Feng, Jichen, et al.
Published: (2025)
CaughtCheating: Is Your MLLM a Good Cheating Detective? Exploring the Boundary of Visual Perception and Reasoning
by: Li, Ming, et al.
Published: (2025)
by: Li, Ming, et al.
Published: (2025)
VisionGraph: Leveraging Large Multimodal Models for Graph Theory Problems in Visual Context
by: Li, Yunxin, et al.
Published: (2024)
by: Li, Yunxin, et al.
Published: (2024)
MindCube: Spatial Mental Modeling from Limited Views
by: Wang, Qineng, et al.
Published: (2025)
by: Wang, Qineng, et al.
Published: (2025)
Breaking Down and Building Up: Mixture of Skill-Based Vision-and-Language Navigation Agents
by: Ma, Tianyi, et al.
Published: (2025)
by: Ma, Tianyi, et al.
Published: (2025)
Weak-eval-Strong: Evaluating and Eliciting Lateral Thinking of LLMs with Situation Puzzles
by: Chen, Qi, et al.
Published: (2024)
by: Chen, Qi, et al.
Published: (2024)
Reasoning Can Hurt the Inductive Abilities of Large Language Models
by: Jin, Haibo, et al.
Published: (2025)
by: Jin, Haibo, et al.
Published: (2025)
Who Evaluates the Evaluations? Objectively Scoring Text-to-Image Prompt Coherence Metrics with T2IScoreScore (TS2)
by: Saxon, Michael, et al.
Published: (2024)
by: Saxon, Michael, et al.
Published: (2024)
Probing and Inducing Combinational Creativity in Vision-Language Models
by: Peng, Yongqian, et al.
Published: (2025)
by: Peng, Yongqian, et al.
Published: (2025)
Grasp Any Region: Towards Precise, Contextual Pixel Understanding for Multimodal LLMs
by: Wang, Haochen, et al.
Published: (2025)
by: Wang, Haochen, et al.
Published: (2025)
OSPC: Detecting Harmful Memes with Large Language Model as a Catalyst
by: Cao, Jingtao, et al.
Published: (2024)
by: Cao, Jingtao, et al.
Published: (2024)
Mavors: Multi-granularity Video Representation for Multimodal Large Language Model
by: Shi, Yang, et al.
Published: (2025)
by: Shi, Yang, et al.
Published: (2025)
MMReason: An Open-Ended Multi-Modal Multi-Step Reasoning Benchmark for MLLMs Toward AGI
by: Yao, Huanjin, et al.
Published: (2025)
by: Yao, Huanjin, et al.
Published: (2025)
Active Video Perception: Iterative Evidence Seeking for Agentic Long Video Understanding
by: Wang, Ziyang, et al.
Published: (2025)
by: Wang, Ziyang, et al.
Published: (2025)
InfiniteWeb: Scalable Web Environment Synthesis for GUI Agent Training
by: Zhang, Ziyun, et al.
Published: (2026)
by: Zhang, Ziyun, et al.
Published: (2026)
TC-Bench: Benchmarking Temporal Compositionality in Text-to-Video and Image-to-Video Generation
by: Feng, Weixi, et al.
Published: (2024)
by: Feng, Weixi, et al.
Published: (2024)
MLLM-CL: Continual Learning for Multimodal Large Language Models
by: Zhao, Hongbo, et al.
Published: (2025)
by: Zhao, Hongbo, et al.
Published: (2025)
EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents
by: Yang, Rui, et al.
Published: (2025)
by: Yang, Rui, et al.
Published: (2025)
ScanReason: Empowering 3D Visual Grounding with Reasoning Capabilities
by: Zhu, Chenming, et al.
Published: (2024)
by: Zhu, Chenming, et al.
Published: (2024)
MEXA: Towards General Multimodal Reasoning with Dynamic Multi-Expert Aggregation
by: Yu, Shoubin, et al.
Published: (2025)
by: Yu, Shoubin, et al.
Published: (2025)
Interpretable and Reliable Detection of AI-Generated Images via Grounded Reasoning in MLLMs
by: Ji, Yikun, et al.
Published: (2025)
by: Ji, Yikun, et al.
Published: (2025)
OmniSpatial: Towards Comprehensive Spatial Reasoning Benchmark for Vision Language Models
by: Jia, Mengdi, et al.
Published: (2025)
by: Jia, Mengdi, et al.
Published: (2025)
CoViPAL: Layer-wise Contextualized Visual Token Pruning for Large Vision-Language Models
by: Tang, Zicong, et al.
Published: (2025)
by: Tang, Zicong, et al.
Published: (2025)
Automatic Layout Planning for Visually-Rich Documents with Instruction-Following Models
by: Zhu, Wanrong, et al.
Published: (2024)
by: Zhu, Wanrong, et al.
Published: (2024)
Cost-effective Instruction Learning for Pathology Vision and Language Analysis
by: Chen, Kaitao, et al.
Published: (2024)
by: Chen, Kaitao, et al.
Published: (2024)
Exploring the Potential of Encoder-free Architectures in 3D LMMs
by: Tang, Yiwen, et al.
Published: (2025)
by: Tang, Yiwen, et al.
Published: (2025)
EgoMemReason: A Memory-Driven Reasoning Benchmark for Long-Horizon Egocentric Video Understanding
by: Wang, Ziyang, et al.
Published: (2026)
by: Wang, Ziyang, et al.
Published: (2026)
Face-Human-Bench: A Comprehensive Benchmark of Face and Human Understanding for Multi-modal Assistants
by: Qin, Lixiong, et al.
Published: (2025)
by: Qin, Lixiong, et al.
Published: (2025)
EVA: Efficient Reinforcement Learning for End-to-End Video Agent
by: Zhang, Yaolun, et al.
Published: (2026)
by: Zhang, Yaolun, et al.
Published: (2026)
Crafter: A Multi-Agent Harness for Editable Scientific Figure Generation from Diverse Inputs
by: Zhao, Haozhe, et al.
Published: (2026)
by: Zhao, Haozhe, et al.
Published: (2026)
Semantically-Prompted Language Models Improve Visual Descriptions
by: Ogezi, Michael, et al.
Published: (2023)
by: Ogezi, Michael, et al.
Published: (2023)
Similar Items
-
Semantic-Drive: Democratizing Long-Tail Data Curation via Open-Vocabulary Grounding and Neuro-Symbolic VLM Consensus
by: Guillen-Perez, Antonio
Published: (2025) -
Impact of Visual Context on Noisy Multimodal NMT: An Empirical Study for English to Indian Languages
by: Gain, Baban, et al.
Published: (2023) -
Losing Visual Needles in Image Haystacks: Vision Language Models are Easily Distracted in Short and Long Contexts
by: Sharma, Aditya, et al.
Published: (2024) -
Good at captioning, bad at counting: Benchmarking GPT-4V on Earth observation data
by: Zhang, Chenhui, et al.
Published: (2024) -
Capybara-OMNI: An Efficient Paradigm for Building Omni-Modal Language Models
by: Ji, Xingguang, et al.
Published: (2025)