Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Pratama, Dhita Putri, Han, Soyeon Caren, Ding, Yihao
Format:	Preprint
Published:	2026
Subjects:	Artificial Intelligence
Online Access:	https://arxiv.org/abs/2602.20878
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866914347245109248
author	Pratama, Dhita Putri Han, Soyeon Caren Ding, Yihao
author_facet	Pratama, Dhita Putri Han, Soyeon Caren Ding, Yihao
contents	Large Vision-Language Models (LVLMs) achieve strong performance on visual question answering benchmarks, yet often rely on spurious correlations rather than genuine causal reasoning. Existing evaluations primarily assess the correctness of the answers, making it unclear whether failures arise from limited reasoning capability or from misidentifying causally relevant information. We introduce Vision-Language Causal Graphs (VLCGs), a structured, query-conditioned representation that explicitly encodes causally relevant objects, attributes, relations, and scene-grounded assumptions. Building on this representation, we present ViLCaR, a diagnostic benchmark comprising tasks for Causal Attribution, Causal Inference, and Question Answering, along with graph-aligned evaluation metrics that assess relevance identification beyond final answer accuracy. Experiments in state-of-the-art LVLMs show that injecting structured relevance information significantly improves attribution and inference consistency compared to zero-shot and standard in-context learning. These findings suggest that current limitations in LVLM causal reasoning stem primarily from insufficient structural guidance rather than a lack of reasoning capacity.
format	Preprint
id	arxiv_https___arxiv_org_abs_2602_20878
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Diagnosing Causal Reasoning in Vision-Language Models via Structured Relevance Graphs Pratama, Dhita Putri Han, Soyeon Caren Ding, Yihao Artificial Intelligence Large Vision-Language Models (LVLMs) achieve strong performance on visual question answering benchmarks, yet often rely on spurious correlations rather than genuine causal reasoning. Existing evaluations primarily assess the correctness of the answers, making it unclear whether failures arise from limited reasoning capability or from misidentifying causally relevant information. We introduce Vision-Language Causal Graphs (VLCGs), a structured, query-conditioned representation that explicitly encodes causally relevant objects, attributes, relations, and scene-grounded assumptions. Building on this representation, we present ViLCaR, a diagnostic benchmark comprising tasks for Causal Attribution, Causal Inference, and Question Answering, along with graph-aligned evaluation metrics that assess relevance identification beyond final answer accuracy. Experiments in state-of-the-art LVLMs show that injecting structured relevance information significantly improves attribution and inference consistency compared to zero-shot and standard in-context learning. These findings suggest that current limitations in LVLM causal reasoning stem primarily from insufficient structural guidance rather than a lack of reasoning capacity.
title	Diagnosing Causal Reasoning in Vision-Language Models via Structured Relevance Graphs
topic	Artificial Intelligence
url	https://arxiv.org/abs/2602.20878

Similar Items