Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Verma, Prateek, Berger, Jonathan
Format:	Preprint
Published:	2021
Subjects:	Sound Artificial Intelligence Machine Learning Multimedia Audio and Speech Processing
Online Access:	https://arxiv.org/abs/2105.00335
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866908357698256896
author	Verma, Prateek Berger, Jonathan
author_facet	Verma, Prateek Berger, Jonathan
contents	Over the past two decades, CNN architectures have produced compelling models of sound perception and cognition, learning hierarchical organizations of features. Analogous to successes in computer vision, audio feature classification can be optimized for a particular task of interest, over a wide variety of datasets and labels. In fact similar architectures designed for image understanding have proven effective for acoustic scene analysis. Here we propose applying Transformer based architectures without convolutional layers to raw audio signals. On a standard dataset of Free Sound 50K,comprising of 200 categories, our model outperforms convolutional models to produce state of the art results. This is significant as unlike in natural language processing and computer vision, we do not perform unsupervised pre-training for outperforming convolutional architectures. On the same training set, with respect mean aver-age precision benchmarks, we show a significant improvement. We further improve the performance of Transformer architectures by using techniques such as pooling inspired from convolutional net-work designed in the past few years. In addition, we also show how multi-rate signal processing ideas inspired from wavelets, can be applied to the Transformer embeddings to improve the results. We also show how our models learns a non-linear non constant band-width filter-bank, which shows an adaptable time frequency front end representation for the task of audio understanding, different from other tasks e.g. pitch estimation.
format	Preprint
id	arxiv_https___arxiv_org_abs_2105_00335
institution	arXiv
publishDate	2021
record_format	arxiv
spellingShingle	Audio Transformers Verma, Prateek Berger, Jonathan Sound Artificial Intelligence Machine Learning Multimedia Audio and Speech Processing Over the past two decades, CNN architectures have produced compelling models of sound perception and cognition, learning hierarchical organizations of features. Analogous to successes in computer vision, audio feature classification can be optimized for a particular task of interest, over a wide variety of datasets and labels. In fact similar architectures designed for image understanding have proven effective for acoustic scene analysis. Here we propose applying Transformer based architectures without convolutional layers to raw audio signals. On a standard dataset of Free Sound 50K,comprising of 200 categories, our model outperforms convolutional models to produce state of the art results. This is significant as unlike in natural language processing and computer vision, we do not perform unsupervised pre-training for outperforming convolutional architectures. On the same training set, with respect mean aver-age precision benchmarks, we show a significant improvement. We further improve the performance of Transformer architectures by using techniques such as pooling inspired from convolutional net-work designed in the past few years. In addition, we also show how multi-rate signal processing ideas inspired from wavelets, can be applied to the Transformer embeddings to improve the results. We also show how our models learns a non-linear non constant band-width filter-bank, which shows an adaptable time frequency front end representation for the task of audio understanding, different from other tasks e.g. pitch estimation.
title	Audio Transformers
topic	Sound Artificial Intelligence Machine Learning Multimedia Audio and Speech Processing
url	https://arxiv.org/abs/2105.00335

Similar Items