Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Author:	Jeong, Joonhyun
Format:	Preprint
Published:	2023
Subjects:	Artificial Intelligence Computation and Language
Online Access:	https://arxiv.org/abs/2312.07553
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866917663488344064
author	Jeong, Joonhyun
author_facet	Jeong, Joonhyun
contents	Recently, Large Multi-modal Models (LMMs) have demonstrated their ability to understand the visual contents of images given the instructions regarding the images. Built upon the Large Language Models (LLMs), LMMs also inherit their abilities and characteristics such as in-context learning where a coherent sequence of images and texts are given as the input prompt. However, we identify a new limitation of off-the-shelf LMMs where a small fraction of incoherent images or text descriptions mislead LMMs to only generate biased output about the hijacked context, not the originally intended context. To address this, we propose a pre-filtering method that removes irrelevant contexts via GPT-4V, based on its robustness towards distribution shift within the contexts. We further investigate whether replacing the hijacked visual and textual contexts with the correlated ones via GPT-4V and text-to-image models can help yield coherent responses.
format	Preprint
id	arxiv_https___arxiv_org_abs_2312_07553
institution	arXiv
publishDate	2023
record_format	arxiv
spellingShingle	Hijacking Context in Large Multi-modal Models Jeong, Joonhyun Artificial Intelligence Computation and Language Recently, Large Multi-modal Models (LMMs) have demonstrated their ability to understand the visual contents of images given the instructions regarding the images. Built upon the Large Language Models (LLMs), LMMs also inherit their abilities and characteristics such as in-context learning where a coherent sequence of images and texts are given as the input prompt. However, we identify a new limitation of off-the-shelf LMMs where a small fraction of incoherent images or text descriptions mislead LMMs to only generate biased output about the hijacked context, not the originally intended context. To address this, we propose a pre-filtering method that removes irrelevant contexts via GPT-4V, based on its robustness towards distribution shift within the contexts. We further investigate whether replacing the hijacked visual and textual contexts with the correlated ones via GPT-4V and text-to-image models can help yield coherent responses.
title	Hijacking Context in Large Multi-modal Models
topic	Artificial Intelligence Computation and Language
url	https://arxiv.org/abs/2312.07553

Similar Items