Saved in:
Bibliographic Details
Main Authors: Kim, Minsu, Jung, Jee-weon, Rha, Hyeongseop, Maiti, Soumi, Arora, Siddhant, Chang, Xuankai, Watanabe, Shinji, Ro, Yong Man
Format: Preprint
Published: 2024
Subjects:
Online Access:https://arxiv.org/abs/2402.16021
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866910990536278016
author Kim, Minsu
Jung, Jee-weon
Rha, Hyeongseop
Maiti, Soumi
Arora, Siddhant
Chang, Xuankai
Watanabe, Shinji
Ro, Yong Man
author_facet Kim, Minsu
Jung, Jee-weon
Rha, Hyeongseop
Maiti, Soumi
Arora, Siddhant
Chang, Xuankai
Watanabe, Shinji
Ro, Yong Man
contents The capability to jointly process multi-modal information is becoming an essential task. However, the limited number of paired multi-modal data and the large computational requirements in multi-modal learning hinder the development. We propose a novel Tri-Modal Translation (TMT) model that translates between arbitrary modalities spanning speech, image, and text. We introduce a novel viewpoint, where we interpret different modalities as different languages, and treat multi-modal translation as a well-established machine translation problem. To this end, we tokenize speech and image data into discrete tokens, which provide a unified interface across modalities and significantly decrease the computational cost. In the proposed TMT, a multi-modal encoder-decoder conducts the core translation, whereas modality-specific processing is conducted only within the tokenization and detokenization stages. We evaluate the proposed TMT on all six modality translation tasks. TMT outperforms single model counterparts consistently, demonstrating that unifying tasks is beneficial not only for practicality but also for performance.
format Preprint
id arxiv_https___arxiv_org_abs_2402_16021
institution arXiv
publishDate 2024
record_format arxiv
spellingShingle TMT: Tri-Modal Translation between Speech, Image, and Text by Processing Different Modalities as Different Languages
Kim, Minsu
Jung, Jee-weon
Rha, Hyeongseop
Maiti, Soumi
Arora, Siddhant
Chang, Xuankai
Watanabe, Shinji
Ro, Yong Man
Computation and Language
Artificial Intelligence
Computer Vision and Pattern Recognition
Audio and Speech Processing
The capability to jointly process multi-modal information is becoming an essential task. However, the limited number of paired multi-modal data and the large computational requirements in multi-modal learning hinder the development. We propose a novel Tri-Modal Translation (TMT) model that translates between arbitrary modalities spanning speech, image, and text. We introduce a novel viewpoint, where we interpret different modalities as different languages, and treat multi-modal translation as a well-established machine translation problem. To this end, we tokenize speech and image data into discrete tokens, which provide a unified interface across modalities and significantly decrease the computational cost. In the proposed TMT, a multi-modal encoder-decoder conducts the core translation, whereas modality-specific processing is conducted only within the tokenization and detokenization stages. We evaluate the proposed TMT on all six modality translation tasks. TMT outperforms single model counterparts consistently, demonstrating that unifying tasks is beneficial not only for practicality but also for performance.
title TMT: Tri-Modal Translation between Speech, Image, and Text by Processing Different Modalities as Different Languages
topic Computation and Language
Artificial Intelligence
Computer Vision and Pattern Recognition
Audio and Speech Processing
url https://arxiv.org/abs/2402.16021