Saved in:
Bibliographic Details
Main Author: Jeong, Joonhyun
Format: Preprint
Published: 2023
Subjects:
Online Access:https://arxiv.org/abs/2312.07553
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866917663488344064
author Jeong, Joonhyun
author_facet Jeong, Joonhyun
contents Recently, Large Multi-modal Models (LMMs) have demonstrated their ability to understand the visual contents of images given the instructions regarding the images. Built upon the Large Language Models (LLMs), LMMs also inherit their abilities and characteristics such as in-context learning where a coherent sequence of images and texts are given as the input prompt. However, we identify a new limitation of off-the-shelf LMMs where a small fraction of incoherent images or text descriptions mislead LMMs to only generate biased output about the hijacked context, not the originally intended context. To address this, we propose a pre-filtering method that removes irrelevant contexts via GPT-4V, based on its robustness towards distribution shift within the contexts. We further investigate whether replacing the hijacked visual and textual contexts with the correlated ones via GPT-4V and text-to-image models can help yield coherent responses.
format Preprint
id arxiv_https___arxiv_org_abs_2312_07553
institution arXiv
publishDate 2023
record_format arxiv
spellingShingle Hijacking Context in Large Multi-modal Models
Jeong, Joonhyun
Artificial Intelligence
Computation and Language
Recently, Large Multi-modal Models (LMMs) have demonstrated their ability to understand the visual contents of images given the instructions regarding the images. Built upon the Large Language Models (LLMs), LMMs also inherit their abilities and characteristics such as in-context learning where a coherent sequence of images and texts are given as the input prompt. However, we identify a new limitation of off-the-shelf LMMs where a small fraction of incoherent images or text descriptions mislead LMMs to only generate biased output about the hijacked context, not the originally intended context. To address this, we propose a pre-filtering method that removes irrelevant contexts via GPT-4V, based on its robustness towards distribution shift within the contexts. We further investigate whether replacing the hijacked visual and textual contexts with the correlated ones via GPT-4V and text-to-image models can help yield coherent responses.
title Hijacking Context in Large Multi-modal Models
topic Artificial Intelligence
Computation and Language
url https://arxiv.org/abs/2312.07553