Saved in:
| Main Authors: | Takato, Hiroshi, Tsutsui, Hiroshi, Soda, Komei, Kamigaito, Hidetaka |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2408.01682 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Toward Automatic Safe Driving Instruction: A Large-Scale Vision Language Model Approach
by: Sakajo, Haruki, et al.
Published: (2025)
by: Sakajo, Haruki, et al.
Published: (2025)
Diagnosing Vision Language Models' Perception by Leveraging Human Methods for Color Vision Deficiencies
by: Hayashi, Kazuki, et al.
Published: (2025)
by: Hayashi, Kazuki, et al.
Published: (2025)
Constructing Multilingual Visual-Text Datasets Revealing Visual Multilingual Ability of Vision Language Models
by: Atuhurra, Jesse, et al.
Published: (2024)
by: Atuhurra, Jesse, et al.
Published: (2024)
Evaluating Vision Language Model Adaptations for Radiology Report Generation in Low-Resource Languages
by: Salmè, Marco, et al.
Published: (2025)
by: Salmè, Marco, et al.
Published: (2025)
VLURes: Benchmarking VLM Visual and Linguistic Understanding in Low-Resource Languages
by: Atuhurra, Jesse, et al.
Published: (2025)
by: Atuhurra, Jesse, et al.
Published: (2025)
Towards Artwork Explanation in Large-scale Vision Language Models
by: Hayashi, Kazuki, et al.
Published: (2024)
by: Hayashi, Kazuki, et al.
Published: (2024)
IRR: Image Review Ranking Framework for Evaluating Vision-Language Models
by: Hayashi, Kazuki, et al.
Published: (2024)
by: Hayashi, Kazuki, et al.
Published: (2024)
Centurio: On Drivers of Multilingual Ability of Large Vision-Language Model
by: Geigle, Gregor, et al.
Published: (2025)
by: Geigle, Gregor, et al.
Published: (2025)
Rethinking Visual Dependency in Long-Context Reasoning for Large Vision-Language Models
by: Zhou, Yucheng, et al.
Published: (2024)
by: Zhou, Yucheng, et al.
Published: (2024)
TextTIGER: Text-based Intelligent Generation with Entity Prompt Refinement for Text-to-Image Generation
by: Ozaki, Shintaro, et al.
Published: (2025)
by: Ozaki, Shintaro, et al.
Published: (2025)
Can Impressions of Music be Extracted from Thumbnail Images?
by: Harada, Takashi, et al.
Published: (2025)
by: Harada, Takashi, et al.
Published: (2025)
Open Vision Reasoner: Transferring Linguistic Cognitive Behavior for Visual Reasoning
by: Wei, Yana, et al.
Published: (2025)
by: Wei, Yana, et al.
Published: (2025)
Multi-SpatialMLLM: Multi-Frame Spatial Understanding with Multi-Modal Large Language Models
by: Xu, Runsen, et al.
Published: (2025)
by: Xu, Runsen, et al.
Published: (2025)
VELA: An LLM-Hybrid-as-a-Judge Approach for Evaluating Long Image Captions
by: Matsuda, Kazuki, et al.
Published: (2025)
by: Matsuda, Kazuki, et al.
Published: (2025)
Frame-Voyager: Learning to Query Frames for Video Large Language Models
by: Yu, Sicheng, et al.
Published: (2024)
by: Yu, Sicheng, et al.
Published: (2024)
Open-Vocabulary Mobile Manipulation Based on Double Relaxed Contrastive Learning with Dense Labeling
by: Yashima, Daichi, et al.
Published: (2024)
by: Yashima, Daichi, et al.
Published: (2024)
PROGRESSLM: Towards Progress Reasoning in Vision-Language Models
by: Zhang, Jianshu, et al.
Published: (2026)
by: Zhang, Jianshu, et al.
Published: (2026)
Vision-G1: Towards General Vision Language Reasoning with Multi-Domain Data Curation
by: Zha, Yuheng, et al.
Published: (2025)
by: Zha, Yuheng, et al.
Published: (2025)
Do Vision-Language Models Represent Space and How? Evaluating Spatial Frame of Reference Under Ambiguities
by: Zhang, Zheyuan, et al.
Published: (2024)
by: Zhang, Zheyuan, et al.
Published: (2024)
Benchmarking Multi-Image Understanding in Vision and Language Models: Perception, Knowledge, Reasoning, and Multi-Hop Reasoning
by: Zhao, Bingchen, et al.
Published: (2024)
by: Zhao, Bingchen, et al.
Published: (2024)
X-Driver: Explainable Autonomous Driving with Vision-Language Models
by: Liu, Wei, et al.
Published: (2025)
by: Liu, Wei, et al.
Published: (2025)
Mind the Gap: Benchmarking Spatial Reasoning in Vision-Language Models
by: Stogiannidis, Ilias, et al.
Published: (2025)
by: Stogiannidis, Ilias, et al.
Published: (2025)
MedFrameQA: A Multi-Image Medical VQA Benchmark for Clinical Reasoning
by: Yu, Suhao, et al.
Published: (2025)
by: Yu, Suhao, et al.
Published: (2025)
ReVisionLLM: Recursive Vision-Language Model for Temporal Grounding in Hour-Long Videos
by: Hannan, Tanveer, et al.
Published: (2024)
by: Hannan, Tanveer, et al.
Published: (2024)
Reasoning Paths with Reference Objects Elicit Quantitative Spatial Reasoning in Large Vision-Language Models
by: Liao, Yuan-Hong, et al.
Published: (2024)
by: Liao, Yuan-Hong, et al.
Published: (2024)
No Tokens Wasted: Leveraging Long Context in Biomedical Vision-Language Models
by: Sun, Min Woo, et al.
Published: (2025)
by: Sun, Min Woo, et al.
Published: (2025)
DENEB: A Hallucination-Robust Automatic Evaluation Metric for Image Captioning
by: Matsuda, Kazuki, et al.
Published: (2024)
by: Matsuda, Kazuki, et al.
Published: (2024)
On Asymmetric Optimization of Reasoning and Perception in Vision-Language Model Post-Training
by: Wu, Xueqing, et al.
Published: (2026)
by: Wu, Xueqing, et al.
Published: (2026)
IdealGPT: Iteratively Decomposing Vision and Language Reasoning via Large Language Models
by: You, Haoxuan, et al.
Published: (2023)
by: You, Haoxuan, et al.
Published: (2023)
Progressive Multi-granular Alignments for Grounded Reasoning in Large Vision-Language Models
by: Le, Quang-Hung, et al.
Published: (2024)
by: Le, Quang-Hung, et al.
Published: (2024)
Towards Grounded Visual Spatial Reasoning in Multi-Modal Vision Language Models
by: Rajabi, Navid, et al.
Published: (2023)
by: Rajabi, Navid, et al.
Published: (2023)
The Abstraction Gap in Vision-Language Causal Reasoning
by: Hoang, Chinh, et al.
Published: (2026)
by: Hoang, Chinh, et al.
Published: (2026)
BabyVision: Visual Reasoning Beyond Language
by: Chen, Liang, et al.
Published: (2026)
by: Chen, Liang, et al.
Published: (2026)
Do Vision-Language Models Truly Perform Vision Reasoning? A Rigorous Study of the Modality Gap
by: Xu, Yige, et al.
Published: (2026)
by: Xu, Yige, et al.
Published: (2026)
Affordance RAG: Hierarchical Multimodal Retrieval with Affordance-Aware Embodied Memory for Mobile Manipulation
by: Korekata, Ryosuke, et al.
Published: (2025)
by: Korekata, Ryosuke, et al.
Published: (2025)
MMLongBench: Benchmarking Long-Context Vision-Language Models Effectively and Thoroughly
by: Wang, Zhaowei, et al.
Published: (2025)
by: Wang, Zhaowei, et al.
Published: (2025)
From Behavioral Performance to Internal Competence: Interpreting Vision-Language Models with VLM-Lens
by: Sheta, Hala, et al.
Published: (2025)
by: Sheta, Hala, et al.
Published: (2025)
Evaluating Reasoning Faithfulness in Medical Vision-Language Models using Multimodal Perturbations
by: Moll, Johannes, et al.
Published: (2025)
by: Moll, Johannes, et al.
Published: (2025)
Visual Program Distillation: Distilling Tools and Programmatic Reasoning into Vision-Language Models
by: Hu, Yushi, et al.
Published: (2023)
by: Hu, Yushi, et al.
Published: (2023)
ChartMuseum: Testing Visual Reasoning Capabilities of Large Vision-Language Models
by: Tang, Liyan, et al.
Published: (2025)
by: Tang, Liyan, et al.
Published: (2025)
Similar Items
-
Toward Automatic Safe Driving Instruction: A Large-Scale Vision Language Model Approach
by: Sakajo, Haruki, et al.
Published: (2025) -
Diagnosing Vision Language Models' Perception by Leveraging Human Methods for Color Vision Deficiencies
by: Hayashi, Kazuki, et al.
Published: (2025) -
Constructing Multilingual Visual-Text Datasets Revealing Visual Multilingual Ability of Vision Language Models
by: Atuhurra, Jesse, et al.
Published: (2024) -
Evaluating Vision Language Model Adaptations for Radiology Report Generation in Low-Resource Languages
by: Salmè, Marco, et al.
Published: (2025) -
VLURes: Benchmarking VLM Visual and Linguistic Understanding in Low-Resource Languages
by: Atuhurra, Jesse, et al.
Published: (2025)