Saved in:
Bibliographic Details
Main Authors: Hu, Hexiang, Chan, Kelvin C. K., Su, Yu-Chuan, Chen, Wenhu, Li, Yandong, Sohn, Kihyuk, Zhao, Yang, Ben, Xue, Gong, Boqing, Cohen, William, Chang, Ming-Wei, Jia, Xuhui
Format: Preprint
Published: 2024
Subjects:
Online Access:https://arxiv.org/abs/2401.01952
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866914629490311168
author Hu, Hexiang
Chan, Kelvin C. K.
Su, Yu-Chuan
Chen, Wenhu
Li, Yandong
Sohn, Kihyuk
Zhao, Yang
Ben, Xue
Gong, Boqing
Cohen, William
Chang, Ming-Wei
Jia, Xuhui
author_facet Hu, Hexiang
Chan, Kelvin C. K.
Su, Yu-Chuan
Chen, Wenhu
Li, Yandong
Sohn, Kihyuk
Zhao, Yang
Ben, Xue
Gong, Boqing
Cohen, William
Chang, Ming-Wei
Jia, Xuhui
contents This paper presents instruct-imagen, a model that tackles heterogeneous image generation tasks and generalizes across unseen tasks. We introduce *multi-modal instruction* for image generation, a task representation articulating a range of generation intents with precision. It uses natural language to amalgamate disparate modalities (e.g., text, edge, style, subject, etc.), such that abundant generation intents can be standardized in a uniform format. We then build instruct-imagen by fine-tuning a pre-trained text-to-image diffusion model with a two-stage framework. First, we adapt the model using the retrieval-augmented training, to enhance model's capabilities to ground its generation on external multimodal context. Subsequently, we fine-tune the adapted model on diverse image generation tasks that requires vision-language understanding (e.g., subject-driven generation, etc.), each paired with a multi-modal instruction encapsulating the task's essence. Human evaluation on various image generation datasets reveals that instruct-imagen matches or surpasses prior task-specific models in-domain and demonstrates promising generalization to unseen and more complex tasks.
format Preprint
id arxiv_https___arxiv_org_abs_2401_01952
institution arXiv
publishDate 2024
record_format arxiv
spellingShingle Instruct-Imagen: Image Generation with Multi-modal Instruction
Hu, Hexiang
Chan, Kelvin C. K.
Su, Yu-Chuan
Chen, Wenhu
Li, Yandong
Sohn, Kihyuk
Zhao, Yang
Ben, Xue
Gong, Boqing
Cohen, William
Chang, Ming-Wei
Jia, Xuhui
Computer Vision and Pattern Recognition
Artificial Intelligence
Computation and Language
This paper presents instruct-imagen, a model that tackles heterogeneous image generation tasks and generalizes across unseen tasks. We introduce *multi-modal instruction* for image generation, a task representation articulating a range of generation intents with precision. It uses natural language to amalgamate disparate modalities (e.g., text, edge, style, subject, etc.), such that abundant generation intents can be standardized in a uniform format. We then build instruct-imagen by fine-tuning a pre-trained text-to-image diffusion model with a two-stage framework. First, we adapt the model using the retrieval-augmented training, to enhance model's capabilities to ground its generation on external multimodal context. Subsequently, we fine-tune the adapted model on diverse image generation tasks that requires vision-language understanding (e.g., subject-driven generation, etc.), each paired with a multi-modal instruction encapsulating the task's essence. Human evaluation on various image generation datasets reveals that instruct-imagen matches or surpasses prior task-specific models in-domain and demonstrates promising generalization to unseen and more complex tasks.
title Instruct-Imagen: Image Generation with Multi-modal Instruction
topic Computer Vision and Pattern Recognition
Artificial Intelligence
Computation and Language
url https://arxiv.org/abs/2401.01952