Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Lin, Weifeng, Wei, Xinyu, An, Ruichuan, Gao, Peng, Zou, Bocheng, Luo, Yulin, Huang, Siyuan, Zhang, Shanghang, Li, Hongsheng
Format:	Preprint
Published:	2024
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2403.20271
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866912240793288704
author	Lin, Weifeng Wei, Xinyu An, Ruichuan Gao, Peng Zou, Bocheng Luo, Yulin Huang, Siyuan Zhang, Shanghang Li, Hongsheng
author_facet	Lin, Weifeng Wei, Xinyu An, Ruichuan Gao, Peng Zou, Bocheng Luo, Yulin Huang, Siyuan Zhang, Shanghang Li, Hongsheng
contents	In this paper, we present the Draw-and-Understand framework, exploring how to integrate visual prompting understanding capabilities into Multimodal Large Language Models (MLLMs). Visual prompts allow users to interact through multi-modal instructions, enhancing the models' interactivity and fine-grained image comprehension. In this framework, we propose a general architecture adaptable to different pre-trained MLLMs, enabling it to recognize various types of visual prompts (such as points, bounding boxes, and free-form shapes) alongside language understanding. Additionally, we introduce MDVP-Instruct-Data, a multi-domain dataset featuring 1.2 million image-visual prompt-text triplets, including natural images, document images, scene text images, mobile/web screenshots, and remote sensing images. Building on this dataset, we introduce MDVP-Bench, a challenging benchmark designed to evaluate a model's ability to understand visual prompting instructions. The experimental results demonstrate that our framework can be easily and effectively applied to various MLLMs, such as SPHINX-X and LLaVA. After training with MDVP-Instruct-Data and image-level instruction datasets, our models exhibit impressive multimodal interaction capabilities and pixel-level understanding, while maintaining their image-level visual perception performance.
format	Preprint
id	arxiv_https___arxiv_org_abs_2403_20271
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want Lin, Weifeng Wei, Xinyu An, Ruichuan Gao, Peng Zou, Bocheng Luo, Yulin Huang, Siyuan Zhang, Shanghang Li, Hongsheng Computer Vision and Pattern Recognition In this paper, we present the Draw-and-Understand framework, exploring how to integrate visual prompting understanding capabilities into Multimodal Large Language Models (MLLMs). Visual prompts allow users to interact through multi-modal instructions, enhancing the models' interactivity and fine-grained image comprehension. In this framework, we propose a general architecture adaptable to different pre-trained MLLMs, enabling it to recognize various types of visual prompts (such as points, bounding boxes, and free-form shapes) alongside language understanding. Additionally, we introduce MDVP-Instruct-Data, a multi-domain dataset featuring 1.2 million image-visual prompt-text triplets, including natural images, document images, scene text images, mobile/web screenshots, and remote sensing images. Building on this dataset, we introduce MDVP-Bench, a challenging benchmark designed to evaluate a model's ability to understand visual prompting instructions. The experimental results demonstrate that our framework can be easily and effectively applied to various MLLMs, such as SPHINX-X and LLaVA. After training with MDVP-Instruct-Data and image-level instruction datasets, our models exhibit impressive multimodal interaction capabilities and pixel-level understanding, while maintaining their image-level visual perception performance.
title	Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2403.20271

Similar Items