Saved in:
| Main Authors: | Wang, Chao, Zhang, Luning, Wang, Zheng, Zhou, Yang |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2502.19973 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Reasoning Can Hurt the Inductive Abilities of Large Language Models
by: Jin, Haibo, et al.
Published: (2025)
by: Jin, Haibo, et al.
Published: (2025)
Enhancing Advanced Visual Reasoning Ability of Large Language Models
by: Li, Zhiyuan, et al.
Published: (2024)
by: Li, Zhiyuan, et al.
Published: (2024)
Medical Large Vision Language Models with Multi-Image Visual Ability
by: Yang, Xikai, et al.
Published: (2025)
by: Yang, Xikai, et al.
Published: (2025)
AutoBench-V: Can Large Vision-Language Models Benchmark Themselves?
by: Bao, Han, et al.
Published: (2024)
by: Bao, Han, et al.
Published: (2024)
ReWatch-R1: Boosting Complex Video Reasoning in Large Vision-Language Models through Agentic Data Synthesis
by: Zhang, Congzhi, et al.
Published: (2025)
by: Zhang, Congzhi, et al.
Published: (2025)
LLM-RG: Referential Grounding in Outdoor Scenarios using Large Language Models
by: Saxena, Pranav, et al.
Published: (2025)
by: Saxena, Pranav, et al.
Published: (2025)
Gemini in Reasoning: Unveiling Commonsense in Multimodal Large Language Models
by: Wang, Yuqing, et al.
Published: (2023)
by: Wang, Yuqing, et al.
Published: (2023)
TPC: Cross-Temporal Prediction Connection for Vision-Language Model Hallucination Reduction
by: Wang, Chao, et al.
Published: (2025)
by: Wang, Chao, et al.
Published: (2025)
Exploiting Mixture-of-Experts Redundancy Unlocks Multimodal Generative Abilities
by: Dutt, Raman, et al.
Published: (2025)
by: Dutt, Raman, et al.
Published: (2025)
Blocks as Probes: Dissecting Categorization Ability of Large Multimodal Models
by: Fu, Bin, et al.
Published: (2024)
by: Fu, Bin, et al.
Published: (2024)
LLaVA-Read: Enhancing Reading Ability of Multimodal Language Models
by: Zhang, Ruiyi, et al.
Published: (2024)
by: Zhang, Ruiyi, et al.
Published: (2024)
Unveiling the Response of Large Vision-Language Models to Visually Absent Tokens
by: Kim, Sohee, et al.
Published: (2025)
by: Kim, Sohee, et al.
Published: (2025)
LVLM-COUNT: Enhancing the Counting Ability of Large Vision-Language Models
by: Qharabagh, Muhammad Fetrat, et al.
Published: (2024)
by: Qharabagh, Muhammad Fetrat, et al.
Published: (2024)
HSSBench: Benchmarking Humanities and Social Sciences Ability for Multimodal Large Language Models
by: Kang, Zhaolu, et al.
Published: (2025)
by: Kang, Zhaolu, et al.
Published: (2025)
Beyond Static Vision: Scene Dynamic Field Unlocks Intuitive Physics Understanding in Multi-modal Large Language Models
by: Li, Nanxi, et al.
Published: (2026)
by: Li, Nanxi, et al.
Published: (2026)
Human-Aligned Bench: Fine-Grained Assessment of Reasoning Ability in MLLMs vs. Humans
by: Qiu, Yansheng, et al.
Published: (2025)
by: Qiu, Yansheng, et al.
Published: (2025)
Emphasizing Discriminative Features for Dataset Distillation in Complex Scenarios
by: Wang, Kai, et al.
Published: (2024)
by: Wang, Kai, et al.
Published: (2024)
Research on Driving Scenario Technology Based on Multimodal Large Lauguage Model Optimization
by: Mengjie, Wang, et al.
Published: (2025)
by: Mengjie, Wang, et al.
Published: (2025)
Apollo: An Exploration of Video Understanding in Large Multimodal Models
by: Zohar, Orr, et al.
Published: (2024)
by: Zohar, Orr, et al.
Published: (2024)
SeqBench: Benchmarking Sequential Narrative Generation in Text-to-Video Models
by: Tang, Zhengxu, et al.
Published: (2025)
by: Tang, Zhengxu, et al.
Published: (2025)
Can GPT tell us why these images are synthesized? Empowering Multimodal Large Language Models for Forensics
by: He, Yiran, et al.
Published: (2025)
by: He, Yiran, et al.
Published: (2025)
From Summary to Action: Enhancing Large Language Models for Complex Tasks with Open World APIs
by: Liu, Yulong, et al.
Published: (2024)
by: Liu, Yulong, et al.
Published: (2024)
Fewer Tokens and Fewer Videos: Extending Video Understanding Abilities in Large Vision-Language Models
by: Chen, Shimin, et al.
Published: (2024)
by: Chen, Shimin, et al.
Published: (2024)
Text as Images: Can Multimodal Large Language Models Follow Printed Instructions in Pixels?
by: Li, Xiujun, et al.
Published: (2023)
by: Li, Xiujun, et al.
Published: (2023)
TRINS: Towards Multimodal Language Models that Can Read
by: Zhang, Ruiyi, et al.
Published: (2024)
by: Zhang, Ruiyi, et al.
Published: (2024)
Mitigating Entangled Steering in Large Vision-Language Models for Hallucination Reduction
by: Zhang, Yuanhong, et al.
Published: (2026)
by: Zhang, Yuanhong, et al.
Published: (2026)
Unveiling the Pitfalls of Knowledge Editing for Large Language Models
by: Li, Zhoubo, et al.
Published: (2023)
by: Li, Zhoubo, et al.
Published: (2023)
SPHERE: Unveiling Spatial Blind Spots in Vision-Language Models Through Hierarchical Evaluation
by: Zhang, Wenyu, et al.
Published: (2024)
by: Zhang, Wenyu, et al.
Published: (2024)
UrBench: A Comprehensive Benchmark for Evaluating Large Multimodal Models in Multi-View Urban Scenarios
by: Zhou, Baichuan, et al.
Published: (2024)
by: Zhou, Baichuan, et al.
Published: (2024)
Can Large Multimodal Models Understand Agricultural Scenes? Benchmarking with AgroMind
by: Li, Qingmei, et al.
Published: (2025)
by: Li, Qingmei, et al.
Published: (2025)
Unlocking Vision-Language Models for Video Anomaly Detection via Fine-Grained Prompting
by: Zou, Shu, et al.
Published: (2025)
by: Zou, Shu, et al.
Published: (2025)
BLINK: Multimodal Large Language Models Can See but Not Perceive
by: Fu, Xingyu, et al.
Published: (2024)
by: Fu, Xingyu, et al.
Published: (2024)
Reasoning via Video: The First Evaluation of Video Models' Reasoning Abilities through Maze-Solving Tasks
by: Yang, Cheng, et al.
Published: (2025)
by: Yang, Cheng, et al.
Published: (2025)
Using Vision Language Models to Detect Students' Academic Emotion through Facial Expressions
by: Wang, Deliang, et al.
Published: (2025)
by: Wang, Deliang, et al.
Published: (2025)
MAGIC: Meta-Ability Guided Interactive Chain-of-Distillation for Effective-and-Efficient Vision-and-Language Navigation
by: Wang, Liuyi, et al.
Published: (2024)
by: Wang, Liuyi, et al.
Published: (2024)
MonkeyOCR v1.5 Technical Report: Unlocking Robust Document Parsing for Complex Patterns
by: Zhang, Jiarui, et al.
Published: (2025)
by: Zhang, Jiarui, et al.
Published: (2025)
Can Multimodal Large Language Models Understand Pathologic Movements? A Pilot Study on Seizure Semiology
by: Zhang, Lina, et al.
Published: (2026)
by: Zhang, Lina, et al.
Published: (2026)
LocateBench: Evaluating the Locating Ability of Vision Language Models
by: Chiang, Ting-Rui, et al.
Published: (2024)
by: Chiang, Ting-Rui, et al.
Published: (2024)
Hyperbolic and Evidence-Prioritized Experts for Large Vision-Language Models
by: Zhou, Zijie, et al.
Published: (2026)
by: Zhou, Zijie, et al.
Published: (2026)
UniCode: Learning a Unified Codebook for Multimodal Large Language Models
by: Zheng, Sipeng, et al.
Published: (2024)
by: Zheng, Sipeng, et al.
Published: (2024)
Similar Items
-
Reasoning Can Hurt the Inductive Abilities of Large Language Models
by: Jin, Haibo, et al.
Published: (2025) -
Enhancing Advanced Visual Reasoning Ability of Large Language Models
by: Li, Zhiyuan, et al.
Published: (2024) -
Medical Large Vision Language Models with Multi-Image Visual Ability
by: Yang, Xikai, et al.
Published: (2025) -
AutoBench-V: Can Large Vision-Language Models Benchmark Themselves?
by: Bao, Han, et al.
Published: (2024) -
ReWatch-R1: Boosting Complex Video Reasoning in Large Vision-Language Models through Agentic Data Synthesis
by: Zhang, Congzhi, et al.
Published: (2025)