Saved in:
Bibliographic Details
Main Authors: Pratama, Dhita Putri, Han, Soyeon Caren, Ding, Yihao
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2602.20878
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866914347245109248
author Pratama, Dhita Putri
Han, Soyeon Caren
Ding, Yihao
author_facet Pratama, Dhita Putri
Han, Soyeon Caren
Ding, Yihao
contents Large Vision-Language Models (LVLMs) achieve strong performance on visual question answering benchmarks, yet often rely on spurious correlations rather than genuine causal reasoning. Existing evaluations primarily assess the correctness of the answers, making it unclear whether failures arise from limited reasoning capability or from misidentifying causally relevant information. We introduce Vision-Language Causal Graphs (VLCGs), a structured, query-conditioned representation that explicitly encodes causally relevant objects, attributes, relations, and scene-grounded assumptions. Building on this representation, we present ViLCaR, a diagnostic benchmark comprising tasks for Causal Attribution, Causal Inference, and Question Answering, along with graph-aligned evaluation metrics that assess relevance identification beyond final answer accuracy. Experiments in state-of-the-art LVLMs show that injecting structured relevance information significantly improves attribution and inference consistency compared to zero-shot and standard in-context learning. These findings suggest that current limitations in LVLM causal reasoning stem primarily from insufficient structural guidance rather than a lack of reasoning capacity.
format Preprint
id arxiv_https___arxiv_org_abs_2602_20878
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle Diagnosing Causal Reasoning in Vision-Language Models via Structured Relevance Graphs
Pratama, Dhita Putri
Han, Soyeon Caren
Ding, Yihao
Artificial Intelligence
Large Vision-Language Models (LVLMs) achieve strong performance on visual question answering benchmarks, yet often rely on spurious correlations rather than genuine causal reasoning. Existing evaluations primarily assess the correctness of the answers, making it unclear whether failures arise from limited reasoning capability or from misidentifying causally relevant information. We introduce Vision-Language Causal Graphs (VLCGs), a structured, query-conditioned representation that explicitly encodes causally relevant objects, attributes, relations, and scene-grounded assumptions. Building on this representation, we present ViLCaR, a diagnostic benchmark comprising tasks for Causal Attribution, Causal Inference, and Question Answering, along with graph-aligned evaluation metrics that assess relevance identification beyond final answer accuracy. Experiments in state-of-the-art LVLMs show that injecting structured relevance information significantly improves attribution and inference consistency compared to zero-shot and standard in-context learning. These findings suggest that current limitations in LVLM causal reasoning stem primarily from insufficient structural guidance rather than a lack of reasoning capacity.
title Diagnosing Causal Reasoning in Vision-Language Models via Structured Relevance Graphs
topic Artificial Intelligence
url https://arxiv.org/abs/2602.20878