Saved in:
| Main Authors: | , , , , , , , , , , , |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2401.01952 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866914629490311168 |
|---|---|
| author | Hu, Hexiang Chan, Kelvin C. K. Su, Yu-Chuan Chen, Wenhu Li, Yandong Sohn, Kihyuk Zhao, Yang Ben, Xue Gong, Boqing Cohen, William Chang, Ming-Wei Jia, Xuhui |
| author_facet | Hu, Hexiang Chan, Kelvin C. K. Su, Yu-Chuan Chen, Wenhu Li, Yandong Sohn, Kihyuk Zhao, Yang Ben, Xue Gong, Boqing Cohen, William Chang, Ming-Wei Jia, Xuhui |
| contents | This paper presents instruct-imagen, a model that tackles heterogeneous image generation tasks and generalizes across unseen tasks. We introduce *multi-modal instruction* for image generation, a task representation articulating a range of generation intents with precision. It uses natural language to amalgamate disparate modalities (e.g., text, edge, style, subject, etc.), such that abundant generation intents can be standardized in a uniform format.
We then build instruct-imagen by fine-tuning a pre-trained text-to-image diffusion model with a two-stage framework. First, we adapt the model using the retrieval-augmented training, to enhance model's capabilities to ground its generation on external multimodal context. Subsequently, we fine-tune the adapted model on diverse image generation tasks that requires vision-language understanding (e.g., subject-driven generation, etc.), each paired with a multi-modal instruction encapsulating the task's essence. Human evaluation on various image generation datasets reveals that instruct-imagen matches or surpasses prior task-specific models in-domain and demonstrates promising generalization to unseen and more complex tasks. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2401_01952 |
| institution | arXiv |
| publishDate | 2024 |
| record_format | arxiv |
| spellingShingle | Instruct-Imagen: Image Generation with Multi-modal Instruction Hu, Hexiang Chan, Kelvin C. K. Su, Yu-Chuan Chen, Wenhu Li, Yandong Sohn, Kihyuk Zhao, Yang Ben, Xue Gong, Boqing Cohen, William Chang, Ming-Wei Jia, Xuhui Computer Vision and Pattern Recognition Artificial Intelligence Computation and Language This paper presents instruct-imagen, a model that tackles heterogeneous image generation tasks and generalizes across unseen tasks. We introduce *multi-modal instruction* for image generation, a task representation articulating a range of generation intents with precision. It uses natural language to amalgamate disparate modalities (e.g., text, edge, style, subject, etc.), such that abundant generation intents can be standardized in a uniform format. We then build instruct-imagen by fine-tuning a pre-trained text-to-image diffusion model with a two-stage framework. First, we adapt the model using the retrieval-augmented training, to enhance model's capabilities to ground its generation on external multimodal context. Subsequently, we fine-tune the adapted model on diverse image generation tasks that requires vision-language understanding (e.g., subject-driven generation, etc.), each paired with a multi-modal instruction encapsulating the task's essence. Human evaluation on various image generation datasets reveals that instruct-imagen matches or surpasses prior task-specific models in-domain and demonstrates promising generalization to unseen and more complex tasks. |
| title | Instruct-Imagen: Image Generation with Multi-modal Instruction |
| topic | Computer Vision and Pattern Recognition Artificial Intelligence Computation and Language |
| url | https://arxiv.org/abs/2401.01952 |