Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Ishikawa, Yuchi, Komatsu, Tatsuya, Aoki, Yoshimitsu
Format:	Preprint
Published:	2024
Subjects:	Audio and Speech Processing Artificial Intelligence Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2410.00511
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866929522481299456
author	Ishikawa, Yuchi Komatsu, Tatsuya Aoki, Yoshimitsu
author_facet	Ishikawa, Yuchi Komatsu, Tatsuya Aoki, Yoshimitsu
contents	In this paper, we propose to pre-train audio encoders using synthetic patterns instead of real audio data. Our proposed framework consists of two key elements. The first one is Masked Autoencoder (MAE), a self-supervised learning framework that learns from reconstructing data from randomly masked counterparts. MAEs tend to focus on low-level information such as visual patterns and regularities within data. Therefore, it is unimportant what is portrayed in the input, whether it be images, audio mel-spectrograms, or even synthetic patterns. This leads to the second key element, which is synthetic data. Synthetic data, unlike real audio, is free from privacy and licensing infringement issues. By combining MAEs and synthetic patterns, our framework enables the model to learn generalized feature representations without real data, while addressing the issues related to real audio. To evaluate the efficacy of our framework, we conduct extensive experiments across a total of 13 audio tasks and 17 synthetic datasets. The experiments provide insights into which types of synthetic patterns are effective for audio. Our results demonstrate that our framework achieves performance comparable to models pre-trained on AudioSet-2M and partially outperforms image-based pre-training methods.
format	Preprint
id	arxiv_https___arxiv_org_abs_2410_00511
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	Pre-training with Synthetic Patterns for Audio Ishikawa, Yuchi Komatsu, Tatsuya Aoki, Yoshimitsu Audio and Speech Processing Artificial Intelligence Computer Vision and Pattern Recognition In this paper, we propose to pre-train audio encoders using synthetic patterns instead of real audio data. Our proposed framework consists of two key elements. The first one is Masked Autoencoder (MAE), a self-supervised learning framework that learns from reconstructing data from randomly masked counterparts. MAEs tend to focus on low-level information such as visual patterns and regularities within data. Therefore, it is unimportant what is portrayed in the input, whether it be images, audio mel-spectrograms, or even synthetic patterns. This leads to the second key element, which is synthetic data. Synthetic data, unlike real audio, is free from privacy and licensing infringement issues. By combining MAEs and synthetic patterns, our framework enables the model to learn generalized feature representations without real data, while addressing the issues related to real audio. To evaluate the efficacy of our framework, we conduct extensive experiments across a total of 13 audio tasks and 17 synthetic datasets. The experiments provide insights into which types of synthetic patterns are effective for audio. Our results demonstrate that our framework achieves performance comparable to models pre-trained on AudioSet-2M and partially outperforms image-based pre-training methods.
title	Pre-training with Synthetic Patterns for Audio
topic	Audio and Speech Processing Artificial Intelligence Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2410.00511

Similar Items