:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Wang, Chao, Zhang, Luning, Wang, Zheng, Zhou, Yang
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition Artificial Intelligence
Online Access:	https://arxiv.org/abs/2502.19973
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Reasoning Can Hurt the Inductive Abilities of Large Language Models
by: Jin, Haibo, et al.
Published: (2025)

Enhancing Advanced Visual Reasoning Ability of Large Language Models
by: Li, Zhiyuan, et al.
Published: (2024)

Medical Large Vision Language Models with Multi-Image Visual Ability
by: Yang, Xikai, et al.
Published: (2025)

AutoBench-V: Can Large Vision-Language Models Benchmark Themselves?
by: Bao, Han, et al.
Published: (2024)

ReWatch-R1: Boosting Complex Video Reasoning in Large Vision-Language Models through Agentic Data Synthesis
by: Zhang, Congzhi, et al.
Published: (2025)

LLM-RG: Referential Grounding in Outdoor Scenarios using Large Language Models
by: Saxena, Pranav, et al.
Published: (2025)

Gemini in Reasoning: Unveiling Commonsense in Multimodal Large Language Models
by: Wang, Yuqing, et al.
Published: (2023)

TPC: Cross-Temporal Prediction Connection for Vision-Language Model Hallucination Reduction
by: Wang, Chao, et al.
Published: (2025)

Exploiting Mixture-of-Experts Redundancy Unlocks Multimodal Generative Abilities
by: Dutt, Raman, et al.
Published: (2025)

Blocks as Probes: Dissecting Categorization Ability of Large Multimodal Models
by: Fu, Bin, et al.
Published: (2024)

LLaVA-Read: Enhancing Reading Ability of Multimodal Language Models
by: Zhang, Ruiyi, et al.
Published: (2024)

Unveiling the Response of Large Vision-Language Models to Visually Absent Tokens
by: Kim, Sohee, et al.
Published: (2025)

LVLM-COUNT: Enhancing the Counting Ability of Large Vision-Language Models
by: Qharabagh, Muhammad Fetrat, et al.
Published: (2024)

HSSBench: Benchmarking Humanities and Social Sciences Ability for Multimodal Large Language Models
by: Kang, Zhaolu, et al.
Published: (2025)

Beyond Static Vision: Scene Dynamic Field Unlocks Intuitive Physics Understanding in Multi-modal Large Language Models
by: Li, Nanxi, et al.
Published: (2026)

Human-Aligned Bench: Fine-Grained Assessment of Reasoning Ability in MLLMs vs. Humans
by: Qiu, Yansheng, et al.
Published: (2025)

Emphasizing Discriminative Features for Dataset Distillation in Complex Scenarios
by: Wang, Kai, et al.
Published: (2024)

Research on Driving Scenario Technology Based on Multimodal Large Lauguage Model Optimization
by: Mengjie, Wang, et al.
Published: (2025)

Apollo: An Exploration of Video Understanding in Large Multimodal Models
by: Zohar, Orr, et al.
Published: (2024)

SeqBench: Benchmarking Sequential Narrative Generation in Text-to-Video Models
by: Tang, Zhengxu, et al.
Published: (2025)

Can GPT tell us why these images are synthesized? Empowering Multimodal Large Language Models for Forensics
by: He, Yiran, et al.
Published: (2025)

From Summary to Action: Enhancing Large Language Models for Complex Tasks with Open World APIs
by: Liu, Yulong, et al.
Published: (2024)

Fewer Tokens and Fewer Videos: Extending Video Understanding Abilities in Large Vision-Language Models
by: Chen, Shimin, et al.
Published: (2024)

Text as Images: Can Multimodal Large Language Models Follow Printed Instructions in Pixels?
by: Li, Xiujun, et al.
Published: (2023)

TRINS: Towards Multimodal Language Models that Can Read
by: Zhang, Ruiyi, et al.
Published: (2024)

Mitigating Entangled Steering in Large Vision-Language Models for Hallucination Reduction
by: Zhang, Yuanhong, et al.
Published: (2026)

Unveiling the Pitfalls of Knowledge Editing for Large Language Models
by: Li, Zhoubo, et al.
Published: (2023)

SPHERE: Unveiling Spatial Blind Spots in Vision-Language Models Through Hierarchical Evaluation
by: Zhang, Wenyu, et al.
Published: (2024)

UrBench: A Comprehensive Benchmark for Evaluating Large Multimodal Models in Multi-View Urban Scenarios
by: Zhou, Baichuan, et al.
Published: (2024)

Can Large Multimodal Models Understand Agricultural Scenes? Benchmarking with AgroMind
by: Li, Qingmei, et al.
Published: (2025)

Unlocking Vision-Language Models for Video Anomaly Detection via Fine-Grained Prompting
by: Zou, Shu, et al.
Published: (2025)

BLINK: Multimodal Large Language Models Can See but Not Perceive
by: Fu, Xingyu, et al.
Published: (2024)

Reasoning via Video: The First Evaluation of Video Models' Reasoning Abilities through Maze-Solving Tasks
by: Yang, Cheng, et al.
Published: (2025)

Using Vision Language Models to Detect Students' Academic Emotion through Facial Expressions
by: Wang, Deliang, et al.
Published: (2025)

MAGIC: Meta-Ability Guided Interactive Chain-of-Distillation for Effective-and-Efficient Vision-and-Language Navigation
by: Wang, Liuyi, et al.
Published: (2024)

MonkeyOCR v1.5 Technical Report: Unlocking Robust Document Parsing for Complex Patterns
by: Zhang, Jiarui, et al.
Published: (2025)

Can Multimodal Large Language Models Understand Pathologic Movements? A Pilot Study on Seizure Semiology
by: Zhang, Lina, et al.
Published: (2026)

LocateBench: Evaluating the Locating Ability of Vision Language Models
by: Chiang, Ting-Rui, et al.
Published: (2024)

Hyperbolic and Evidence-Prioritized Experts for Large Vision-Language Models
by: Zhou, Zijie, et al.
Published: (2026)

UniCode: Learning a Unified Codebook for Multimodal Large Language Models
by: Zheng, Sipeng, et al.
Published: (2024)