Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Lepori, Michael A., Tartaglini, Alexa R., Vong, Wai Keen, Serre, Thomas, Lake, Brenden M., Pavlick, Ellie
Format:	Preprint
Published:	2024
Subjects:	Computer Vision and Pattern Recognition Artificial Intelligence
Online Access:	https://arxiv.org/abs/2406.15955
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866910709659467776
author	Lepori, Michael A. Tartaglini, Alexa R. Vong, Wai Keen Serre, Thomas Lake, Brenden M. Pavlick, Ellie
author_facet	Lepori, Michael A. Tartaglini, Alexa R. Vong, Wai Keen Serre, Thomas Lake, Brenden M. Pavlick, Ellie
contents	Though vision transformers (ViTs) have achieved state-of-the-art performance in a variety of settings, they exhibit surprising failures when performing tasks involving visual relations. This begs the question: how do ViTs attempt to perform tasks that require computing visual relations between objects? Prior efforts to interpret ViTs tend to focus on characterizing relevant low-level visual features. In contrast, we adopt methods from mechanistic interpretability to study the higher-level visual algorithms that ViTs use to perform abstract visual reasoning. We present a case study of a fundamental, yet surprisingly difficult, relational reasoning task: judging whether two visual entities are the same or different. We find that pretrained ViTs fine-tuned on this task often exhibit two qualitatively different stages of processing despite having no obvious inductive biases to do so: 1) a perceptual stage wherein local object features are extracted and stored in a disentangled representation, and 2) a relational stage wherein object representations are compared. In the second stage, we find evidence that ViTs can learn to represent somewhat abstract visual relations, a capability that has long been considered out of reach for artificial neural networks. Finally, we demonstrate that failures at either stage can prevent a model from learning a generalizable solution to our fairly simple tasks. By understanding ViTs in terms of discrete processing stages, one can more precisely diagnose and rectify shortcomings of existing and future models.
format	Preprint
id	arxiv_https___arxiv_org_abs_2406_15955
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	Beyond the Doors of Perception: Vision Transformers Represent Relations Between Objects Lepori, Michael A. Tartaglini, Alexa R. Vong, Wai Keen Serre, Thomas Lake, Brenden M. Pavlick, Ellie Computer Vision and Pattern Recognition Artificial Intelligence Though vision transformers (ViTs) have achieved state-of-the-art performance in a variety of settings, they exhibit surprising failures when performing tasks involving visual relations. This begs the question: how do ViTs attempt to perform tasks that require computing visual relations between objects? Prior efforts to interpret ViTs tend to focus on characterizing relevant low-level visual features. In contrast, we adopt methods from mechanistic interpretability to study the higher-level visual algorithms that ViTs use to perform abstract visual reasoning. We present a case study of a fundamental, yet surprisingly difficult, relational reasoning task: judging whether two visual entities are the same or different. We find that pretrained ViTs fine-tuned on this task often exhibit two qualitatively different stages of processing despite having no obvious inductive biases to do so: 1) a perceptual stage wherein local object features are extracted and stored in a disentangled representation, and 2) a relational stage wherein object representations are compared. In the second stage, we find evidence that ViTs can learn to represent somewhat abstract visual relations, a capability that has long been considered out of reach for artificial neural networks. Finally, we demonstrate that failures at either stage can prevent a model from learning a generalizable solution to our fairly simple tasks. By understanding ViTs in terms of discrete processing stages, one can more precisely diagnose and rectify shortcomings of existing and future models.
title	Beyond the Doors of Perception: Vision Transformers Represent Relations Between Objects
topic	Computer Vision and Pattern Recognition Artificial Intelligence
url	https://arxiv.org/abs/2406.15955

Similar Items