Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Wu, Junda, Li, Warren, Novack, Zachary, Namburi, Amit, Chen, Carol, McAuley, Julian
Format:	Preprint
Published:	2024
Subjects:	Sound Artificial Intelligence Audio and Speech Processing
Online Access:	https://arxiv.org/abs/2410.02271
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866910630853738496
author	Wu, Junda Li, Warren Novack, Zachary Namburi, Amit Chen, Carol McAuley, Julian
author_facet	Wu, Junda Li, Warren Novack, Zachary Namburi, Amit Chen, Carol McAuley, Julian
contents	Modeling temporal characteristics plays a significant role in the representation learning of audio waveform. We propose Contrastive Long-form Language-Audio Pretraining (\textbf{CoLLAP}) to significantly extend the perception window for both the input audio (up to 5 minutes) and the language descriptions (exceeding 250 words), while enabling contrastive learning across modalities and temporal dynamics. Leveraging recent Music-LLMs to generate long-form music captions for full-length songs, augmented with musical temporal structures, we collect 51.3K audio-text pairs derived from the large-scale AudioSet training dataset, where the average audio length reaches 288 seconds. We propose a novel contrastive learning architecture that fuses language representations with structured audio representations by segmenting each song into clips and extracting their embeddings. With an attention mechanism, we capture multimodal temporal correlations, allowing the model to automatically weigh and enhance the final fusion score for improved contrastive alignment. Finally, we develop two variants of the CoLLAP model with different types of backbone language models. Through comprehensive experiments on multiple long-form music-text retrieval datasets, we demonstrate consistent performance improvement in retrieval accuracy compared with baselines. We also show the pretrained CoLLAP models can be transferred to various music information retrieval tasks, with heterogeneous long-form multimodal contexts.
format	Preprint
id	arxiv_https___arxiv_org_abs_2410_02271
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	CoLLAP: Contrastive Long-form Language-Audio Pretraining with Musical Temporal Structure Augmentation Wu, Junda Li, Warren Novack, Zachary Namburi, Amit Chen, Carol McAuley, Julian Sound Artificial Intelligence Audio and Speech Processing Modeling temporal characteristics plays a significant role in the representation learning of audio waveform. We propose Contrastive Long-form Language-Audio Pretraining (\textbf{CoLLAP}) to significantly extend the perception window for both the input audio (up to 5 minutes) and the language descriptions (exceeding 250 words), while enabling contrastive learning across modalities and temporal dynamics. Leveraging recent Music-LLMs to generate long-form music captions for full-length songs, augmented with musical temporal structures, we collect 51.3K audio-text pairs derived from the large-scale AudioSet training dataset, where the average audio length reaches 288 seconds. We propose a novel contrastive learning architecture that fuses language representations with structured audio representations by segmenting each song into clips and extracting their embeddings. With an attention mechanism, we capture multimodal temporal correlations, allowing the model to automatically weigh and enhance the final fusion score for improved contrastive alignment. Finally, we develop two variants of the CoLLAP model with different types of backbone language models. Through comprehensive experiments on multiple long-form music-text retrieval datasets, we demonstrate consistent performance improvement in retrieval accuracy compared with baselines. We also show the pretrained CoLLAP models can be transferred to various music information retrieval tasks, with heterogeneous long-form multimodal contexts.
title	CoLLAP: Contrastive Long-form Language-Audio Pretraining with Musical Temporal Structure Augmentation
topic	Sound Artificial Intelligence Audio and Speech Processing
url	https://arxiv.org/abs/2410.02271

Similar Items