Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Zhang, Mingyu, Cai, Jiting, Liu, Mingyu, Xu, Yue, Lu, Cewu, Li, Yong-Lu
Format:	Preprint
Published:	2024
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2407.19666
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866911970985246720
author	Zhang, Mingyu Cai, Jiting Liu, Mingyu Xu, Yue Lu, Cewu Li, Yong-Lu
author_facet	Zhang, Mingyu Cai, Jiting Liu, Mingyu Xu, Yue Lu, Cewu Li, Yong-Lu
contents	Visual reasoning, as a prominent research area, plays a crucial role in AI by facilitating concept formation and interaction with the world. However, current works are usually carried out separately on small datasets thus lacking generalization ability. Through rigorous evaluation of diverse benchmarks, we demonstrate the shortcomings of existing ad-hoc methods in achieving cross-domain reasoning and their tendency to data bias fitting. In this paper, we revisit visual reasoning with a two-stage perspective: (1) symbolization and (2) logical reasoning given symbols or their representations. We find that the reasoning stage is better at generalization than symbolization. Thus, it is more efficient to implement symbolization via separated encoders for different data domains while using a shared reasoner. Given our findings, we establish design principles for visual reasoning frameworks following the separated symbolization and shared reasoning. The proposed two-stage framework achieves impressive generalization ability on various visual reasoning tasks, including puzzles, physical prediction, and visual question answering (VQA), encompassing both 2D and 3D modalities. We believe our insights will pave the way for generalizable visual reasoning.
format	Preprint
id	arxiv_https___arxiv_org_abs_2407_19666
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	Take A Step Back: Rethinking the Two Stages in Visual Reasoning Zhang, Mingyu Cai, Jiting Liu, Mingyu Xu, Yue Lu, Cewu Li, Yong-Lu Computer Vision and Pattern Recognition Visual reasoning, as a prominent research area, plays a crucial role in AI by facilitating concept formation and interaction with the world. However, current works are usually carried out separately on small datasets thus lacking generalization ability. Through rigorous evaluation of diverse benchmarks, we demonstrate the shortcomings of existing ad-hoc methods in achieving cross-domain reasoning and their tendency to data bias fitting. In this paper, we revisit visual reasoning with a two-stage perspective: (1) symbolization and (2) logical reasoning given symbols or their representations. We find that the reasoning stage is better at generalization than symbolization. Thus, it is more efficient to implement symbolization via separated encoders for different data domains while using a shared reasoner. Given our findings, we establish design principles for visual reasoning frameworks following the separated symbolization and shared reasoning. The proposed two-stage framework achieves impressive generalization ability on various visual reasoning tasks, including puzzles, physical prediction, and visual question answering (VQA), encompassing both 2D and 3D modalities. We believe our insights will pave the way for generalizable visual reasoning.
title	Take A Step Back: Rethinking the Two Stages in Visual Reasoning
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2407.19666

Similar Items