Vista Equipo: :: Library Catalog

Guardado en:

Detalles Bibliográficos
Autores principales:	Lombardo, Gabriele, Maiorana, Luigi, Presti, Liliana Lo, La Cascia, Marco
Formato:	Preprint
Publicado:	2026
Materias:	Computer Vision and Pattern Recognition Artificial Intelligence
Acceso en línea:	https://arxiv.org/abs/2605.09090
Etiquetas:	Agregar Etiqueta Sin Etiquetas, Sea el primero en etiquetar este registro!

_version_	1866913107286163456
author	Lombardo, Gabriele Maiorana, Luigi Presti, Liliana Lo La Cascia, Marco
author_facet	Lombardo, Gabriele Maiorana, Luigi Presti, Liliana Lo La Cascia, Marco
contents	Visual Grounding benchmarks assume that the object described by a referring expression is always present in the image, and grounding models are therefore rarely evaluated under semantically mismatched captions. In such cases, models frequently exhibit approximation behavior, producing a plausible bounding box that satisfies only part of the expression (\eg, preserving the original object while ignoring modified contextual cues). Because mismatched captions represent realistic edge cases, this behavior compromises reliability and raises concerns from an explainability perspective. Identifying its underlying causes is thus essential for improving model faithfulness and interpretability. Adopting a mechanistic interpretability viewpoint, this work examines whether embedding anisotropy contributes to counterfactual failures. A similarity-controlled counterfactual caption generation protocol is introduced to systematically perturb object or contextual components within predefined embedding similarity intervals, enabling a fine-grained analysis of grounding behavior as a function of alignment. Experiments on two Transformer-based models with markedly different embedding geometries (BERT-based TransVG and CLIP-based SwimVG) reveal no meaningful correlation between cosine similarity and approximation. These findings suggest that anisotropy alone does not account for counterfactual errors, and that robustness requires investigating finer-grained geometric properties of the embedding space.
format	Preprint
id	arxiv_https___arxiv_org_abs_2605_09090
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Investigating Anisotropy in Visual Grounding under Controlled Counterfactual Perturbations Lombardo, Gabriele Maiorana, Luigi Presti, Liliana Lo La Cascia, Marco Computer Vision and Pattern Recognition Artificial Intelligence Visual Grounding benchmarks assume that the object described by a referring expression is always present in the image, and grounding models are therefore rarely evaluated under semantically mismatched captions. In such cases, models frequently exhibit approximation behavior, producing a plausible bounding box that satisfies only part of the expression (\eg, preserving the original object while ignoring modified contextual cues). Because mismatched captions represent realistic edge cases, this behavior compromises reliability and raises concerns from an explainability perspective. Identifying its underlying causes is thus essential for improving model faithfulness and interpretability. Adopting a mechanistic interpretability viewpoint, this work examines whether embedding anisotropy contributes to counterfactual failures. A similarity-controlled counterfactual caption generation protocol is introduced to systematically perturb object or contextual components within predefined embedding similarity intervals, enabling a fine-grained analysis of grounding behavior as a function of alignment. Experiments on two Transformer-based models with markedly different embedding geometries (BERT-based TransVG and CLIP-based SwimVG) reveal no meaningful correlation between cosine similarity and approximation. These findings suggest that anisotropy alone does not account for counterfactual errors, and that robustness requires investigating finer-grained geometric properties of the embedding space.
title	Investigating Anisotropy in Visual Grounding under Controlled Counterfactual Perturbations
topic	Computer Vision and Pattern Recognition Artificial Intelligence
url	https://arxiv.org/abs/2605.09090

Ejemplares similares