Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Luong, Manh, Nguyen, Khai, Ho, Nhat, Haf, Reza, Phung, Dinh, Qu, Lizhen
Format:	Preprint
Published:	2024
Subjects:	Audio and Speech Processing Artificial Intelligence Sound
Online Access:	https://arxiv.org/abs/2405.10084
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866929345890615296
author	Luong, Manh Nguyen, Khai Ho, Nhat Haf, Reza Phung, Dinh Qu, Lizhen
author_facet	Luong, Manh Nguyen, Khai Ho, Nhat Haf, Reza Phung, Dinh Qu, Lizhen
contents	The Learning-to-match (LTM) framework proves to be an effective inverse optimal transport approach for learning the underlying ground metric between two sources of data, facilitating subsequent matching. However, the conventional LTM framework faces scalability challenges, necessitating the use of the entire dataset each time the parameters of the ground metric are updated. In adapting LTM to the deep learning context, we introduce the mini-batch Learning-to-match (m-LTM) framework for audio-text retrieval problems. This framework leverages mini-batch subsampling and Mahalanobis-enhanced family of ground metrics. Moreover, to cope with misaligned training data in practice, we propose a variant using partial optimal transport to mitigate the harm of misaligned data pairs in training data. We conduct extensive experiments on audio-text matching problems using three datasets: AudioCaps, Clotho, and ESC-50. Results demonstrate that our proposed method is capable of learning rich and expressive joint embedding space, which achieves SOTA performance. Beyond this, the proposed m-LTM framework is able to close the modality gap across audio and text embedding, which surpasses both triplet and contrastive loss in the zero-shot sound event detection task on the ESC-50 dataset. Notably, our strategy of employing partial optimal transport with m-LTM demonstrates greater noise tolerance than contrastive loss, especially under varying noise ratios in training data on the AudioCaps dataset. Our code is available at https://github.com/v-manhlt3/m-LTM-Audio-Text-Retrieval
format	Preprint
id	arxiv_https___arxiv_org_abs_2405_10084
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	Revisiting Deep Audio-Text Retrieval Through the Lens of Transportation Luong, Manh Nguyen, Khai Ho, Nhat Haf, Reza Phung, Dinh Qu, Lizhen Audio and Speech Processing Artificial Intelligence Sound The Learning-to-match (LTM) framework proves to be an effective inverse optimal transport approach for learning the underlying ground metric between two sources of data, facilitating subsequent matching. However, the conventional LTM framework faces scalability challenges, necessitating the use of the entire dataset each time the parameters of the ground metric are updated. In adapting LTM to the deep learning context, we introduce the mini-batch Learning-to-match (m-LTM) framework for audio-text retrieval problems. This framework leverages mini-batch subsampling and Mahalanobis-enhanced family of ground metrics. Moreover, to cope with misaligned training data in practice, we propose a variant using partial optimal transport to mitigate the harm of misaligned data pairs in training data. We conduct extensive experiments on audio-text matching problems using three datasets: AudioCaps, Clotho, and ESC-50. Results demonstrate that our proposed method is capable of learning rich and expressive joint embedding space, which achieves SOTA performance. Beyond this, the proposed m-LTM framework is able to close the modality gap across audio and text embedding, which surpasses both triplet and contrastive loss in the zero-shot sound event detection task on the ESC-50 dataset. Notably, our strategy of employing partial optimal transport with m-LTM demonstrates greater noise tolerance than contrastive loss, especially under varying noise ratios in training data on the AudioCaps dataset. Our code is available at https://github.com/v-manhlt3/m-LTM-Audio-Text-Retrieval
title	Revisiting Deep Audio-Text Retrieval Through the Lens of Transportation
topic	Audio and Speech Processing Artificial Intelligence Sound
url	https://arxiv.org/abs/2405.10084

Similar Items