Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Wang, Lei, Xu, Wanyu, Hu, Zhiqiang, Lan, Yihuai, Dong, Shan, Wang, Hao, Lee, Roy Ka-Wei, Lim, Ee-Peng
Format:	Preprint
Published:	2024
Subjects:	Computer Vision and Pattern Recognition Artificial Intelligence Computation and Language
Online Access:	https://arxiv.org/abs/2402.17971
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866917628495265792
author	Wang, Lei Xu, Wanyu Hu, Zhiqiang Lan, Yihuai Dong, Shan Wang, Hao Lee, Roy Ka-Wei Lim, Ee-Peng
author_facet	Wang, Lei Xu, Wanyu Hu, Zhiqiang Lan, Yihuai Dong, Shan Wang, Hao Lee, Roy Ka-Wei Lim, Ee-Peng
contents	This paper introduces a new in-context learning (ICL) mechanism called In-Image Learning (I$^2$L) that combines demonstration examples, visual cues, and chain-of-thought reasoning into an aggregated image to enhance the capabilities of Large Multimodal Models (e.g., GPT-4V) in multimodal reasoning tasks. Unlike previous approaches that rely on converting images to text or incorporating visual input into language models, I$^2$L consolidates all information into an aggregated image and leverages image processing, understanding, and reasoning abilities. This has several advantages: it reduces inaccurate textual descriptions of complex images, provides flexibility in positioning demonstration examples, and avoids multiple input images and lengthy prompts. We also introduce I$^2$L-Hybrid, a method that combines the strengths of I$^2$L with other ICL methods. Specifically, it uses an automatic strategy to select the most suitable method (I$^2$L or another certain ICL method) for a specific task instance. We conduct extensive experiments to assess the effectiveness of I$^2$L and I$^2$L-Hybrid on MathVista, which covers a variety of complex multimodal reasoning tasks. Additionally, we investigate the influence of image resolution, the number of demonstration examples in a single image, and the positions of these demonstrations in the aggregated image on the effectiveness of I$^2$L. Our code is publicly available at https://github.com/AGI-Edgerunners/IIL.
format	Preprint
id	arxiv_https___arxiv_org_abs_2402_17971
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	All in an Aggregated Image for In-Image Learning Wang, Lei Xu, Wanyu Hu, Zhiqiang Lan, Yihuai Dong, Shan Wang, Hao Lee, Roy Ka-Wei Lim, Ee-Peng Computer Vision and Pattern Recognition Artificial Intelligence Computation and Language This paper introduces a new in-context learning (ICL) mechanism called In-Image Learning (I$^2$L) that combines demonstration examples, visual cues, and chain-of-thought reasoning into an aggregated image to enhance the capabilities of Large Multimodal Models (e.g., GPT-4V) in multimodal reasoning tasks. Unlike previous approaches that rely on converting images to text or incorporating visual input into language models, I$^2$L consolidates all information into an aggregated image and leverages image processing, understanding, and reasoning abilities. This has several advantages: it reduces inaccurate textual descriptions of complex images, provides flexibility in positioning demonstration examples, and avoids multiple input images and lengthy prompts. We also introduce I$^2$L-Hybrid, a method that combines the strengths of I$^2$L with other ICL methods. Specifically, it uses an automatic strategy to select the most suitable method (I$^2$L or another certain ICL method) for a specific task instance. We conduct extensive experiments to assess the effectiveness of I$^2$L and I$^2$L-Hybrid on MathVista, which covers a variety of complex multimodal reasoning tasks. Additionally, we investigate the influence of image resolution, the number of demonstration examples in a single image, and the positions of these demonstrations in the aggregated image on the effectiveness of I$^2$L. Our code is publicly available at https://github.com/AGI-Edgerunners/IIL.
title	All in an Aggregated Image for In-Image Learning
topic	Computer Vision and Pattern Recognition Artificial Intelligence Computation and Language
url	https://arxiv.org/abs/2402.17971

Similar Items