Saved in:
Bibliographic Details
Main Authors: Xue, Han, Sun, Qianru, Song, Li, Zhang, Wenjun, Huang, Zhiwu
Format: Preprint
Published: 2024
Subjects:
Online Access:https://arxiv.org/abs/2404.09633
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866910686252105728
author Xue, Han
Sun, Qianru
Song, Li
Zhang, Wenjun
Huang, Zhiwu
author_facet Xue, Han
Sun, Qianru
Song, Li
Zhang, Wenjun
Huang, Zhiwu
contents We propose In-Context Translation (ICT), a general learning framework to unify visual recognition (e.g., semantic segmentation), low-level image processing (e.g., denoising), and conditional image generation (e.g., edge-to-image synthesis). Thanks to unification, ICT significantly reduces the inherent inductive bias that comes with designing models for specific tasks, and it maximizes mutual enhancement across similar tasks. However, the unification across a large number of tasks is non-trivial due to various data formats and training pipelines. To this end, ICT introduces two designs. Firstly, it standardizes input-output data of different tasks into RGB image pairs, e.g., semantic segmentation data pairs an RGB image with its segmentation mask in the same RGB format. This turns different tasks into a general translation task between two RGB images. Secondly, it standardizes the training of different tasks into a general in-context learning, where "in-context" means the input comprises an example input-output pair of the target task and a query image. The learning objective is to generate the "missing" data paired with the query. The implicit translation process is thus between the query and the generated image. In experiments, ICT unifies ten vision tasks and showcases impressive performance on their respective benchmarks. Notably, ICT performs well across three major categories of computer vision tasks, while its two competitors (Painter and PromptDiffusion) are only effective in at most two of these task categories. In addition, compared to its competitors, ICT trained on only 4 RTX 3090 GPUs is shown to be more efficient and less costly in training.
format Preprint
id arxiv_https___arxiv_org_abs_2404_09633
institution arXiv
publishDate 2024
record_format arxiv
spellingShingle In-Context Translation: Towards Unifying Image Recognition, Processing, and Generation
Xue, Han
Sun, Qianru
Song, Li
Zhang, Wenjun
Huang, Zhiwu
Computer Vision and Pattern Recognition
We propose In-Context Translation (ICT), a general learning framework to unify visual recognition (e.g., semantic segmentation), low-level image processing (e.g., denoising), and conditional image generation (e.g., edge-to-image synthesis). Thanks to unification, ICT significantly reduces the inherent inductive bias that comes with designing models for specific tasks, and it maximizes mutual enhancement across similar tasks. However, the unification across a large number of tasks is non-trivial due to various data formats and training pipelines. To this end, ICT introduces two designs. Firstly, it standardizes input-output data of different tasks into RGB image pairs, e.g., semantic segmentation data pairs an RGB image with its segmentation mask in the same RGB format. This turns different tasks into a general translation task between two RGB images. Secondly, it standardizes the training of different tasks into a general in-context learning, where "in-context" means the input comprises an example input-output pair of the target task and a query image. The learning objective is to generate the "missing" data paired with the query. The implicit translation process is thus between the query and the generated image. In experiments, ICT unifies ten vision tasks and showcases impressive performance on their respective benchmarks. Notably, ICT performs well across three major categories of computer vision tasks, while its two competitors (Painter and PromptDiffusion) are only effective in at most two of these task categories. In addition, compared to its competitors, ICT trained on only 4 RTX 3090 GPUs is shown to be more efficient and less costly in training.
title In-Context Translation: Towards Unifying Image Recognition, Processing, and Generation
topic Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2404.09633