Saved in:
Bibliographic Details
Main Authors: Lin, Weifeng, Wei, Xinyu, An, Ruichuan, Gao, Peng, Zou, Bocheng, Luo, Yulin, Huang, Siyuan, Zhang, Shanghang, Li, Hongsheng
Format: Preprint
Published: 2024
Subjects:
Online Access:https://arxiv.org/abs/2403.20271
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866912240793288704
author Lin, Weifeng
Wei, Xinyu
An, Ruichuan
Gao, Peng
Zou, Bocheng
Luo, Yulin
Huang, Siyuan
Zhang, Shanghang
Li, Hongsheng
author_facet Lin, Weifeng
Wei, Xinyu
An, Ruichuan
Gao, Peng
Zou, Bocheng
Luo, Yulin
Huang, Siyuan
Zhang, Shanghang
Li, Hongsheng
contents In this paper, we present the Draw-and-Understand framework, exploring how to integrate visual prompting understanding capabilities into Multimodal Large Language Models (MLLMs). Visual prompts allow users to interact through multi-modal instructions, enhancing the models' interactivity and fine-grained image comprehension. In this framework, we propose a general architecture adaptable to different pre-trained MLLMs, enabling it to recognize various types of visual prompts (such as points, bounding boxes, and free-form shapes) alongside language understanding. Additionally, we introduce MDVP-Instruct-Data, a multi-domain dataset featuring 1.2 million image-visual prompt-text triplets, including natural images, document images, scene text images, mobile/web screenshots, and remote sensing images. Building on this dataset, we introduce MDVP-Bench, a challenging benchmark designed to evaluate a model's ability to understand visual prompting instructions. The experimental results demonstrate that our framework can be easily and effectively applied to various MLLMs, such as SPHINX-X and LLaVA. After training with MDVP-Instruct-Data and image-level instruction datasets, our models exhibit impressive multimodal interaction capabilities and pixel-level understanding, while maintaining their image-level visual perception performance.
format Preprint
id arxiv_https___arxiv_org_abs_2403_20271
institution arXiv
publishDate 2024
record_format arxiv
spellingShingle Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want
Lin, Weifeng
Wei, Xinyu
An, Ruichuan
Gao, Peng
Zou, Bocheng
Luo, Yulin
Huang, Siyuan
Zhang, Shanghang
Li, Hongsheng
Computer Vision and Pattern Recognition
In this paper, we present the Draw-and-Understand framework, exploring how to integrate visual prompting understanding capabilities into Multimodal Large Language Models (MLLMs). Visual prompts allow users to interact through multi-modal instructions, enhancing the models' interactivity and fine-grained image comprehension. In this framework, we propose a general architecture adaptable to different pre-trained MLLMs, enabling it to recognize various types of visual prompts (such as points, bounding boxes, and free-form shapes) alongside language understanding. Additionally, we introduce MDVP-Instruct-Data, a multi-domain dataset featuring 1.2 million image-visual prompt-text triplets, including natural images, document images, scene text images, mobile/web screenshots, and remote sensing images. Building on this dataset, we introduce MDVP-Bench, a challenging benchmark designed to evaluate a model's ability to understand visual prompting instructions. The experimental results demonstrate that our framework can be easily and effectively applied to various MLLMs, such as SPHINX-X and LLaVA. After training with MDVP-Instruct-Data and image-level instruction datasets, our models exhibit impressive multimodal interaction capabilities and pixel-level understanding, while maintaining their image-level visual perception performance.
title Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want
topic Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2403.20271