Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Xie, Jiamin, Hansen, John H. L.
Format:	Preprint
Published:	2022
Subjects:	Audio and Speech Processing Artificial Intelligence
Online Access:	https://arxiv.org/abs/2207.01732
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866915348555497472
author	Xie, Jiamin Hansen, John H. L.
author_facet	Xie, Jiamin Hansen, John H. L.
contents	Convolutional neural networks (CNN) have improved speech recognition performance greatly by exploiting localized time-frequency patterns. But these patterns are assumed to appear in symmetric and rigid kernels by the conventional CNN operation. It motivates the question: What about asymmetric kernels? In this study, we illustrate adaptive views can discover local features which couple better with attention than fixed views of the input. We replace depthwise CNNs in the Conformer architecture with a deformable counterpart, dubbed this "Deformer". By analyzing our best-performing model, we visualize both local receptive fields and global attention maps learned by the Deformer and show increased feature associations on the utterance level. The statistical analysis of learned kernel offsets provides an insight into the change of information in features with the network depth. Finally, replacing only half of the layers in the encoder, the Deformer improves +5.6% relative WER without a LM and +6.4% relative WER with a LM over the Conformer baseline on the WSJ eval92 set.
format	Preprint
id	arxiv_https___arxiv_org_abs_2207_01732
institution	arXiv
publishDate	2022
record_format	arxiv
spellingShingle	DEFORMER: Coupling Deformed Localized Patterns with Global Context for Robust End-to-end Speech Recognition Xie, Jiamin Hansen, John H. L. Audio and Speech Processing Artificial Intelligence Convolutional neural networks (CNN) have improved speech recognition performance greatly by exploiting localized time-frequency patterns. But these patterns are assumed to appear in symmetric and rigid kernels by the conventional CNN operation. It motivates the question: What about asymmetric kernels? In this study, we illustrate adaptive views can discover local features which couple better with attention than fixed views of the input. We replace depthwise CNNs in the Conformer architecture with a deformable counterpart, dubbed this "Deformer". By analyzing our best-performing model, we visualize both local receptive fields and global attention maps learned by the Deformer and show increased feature associations on the utterance level. The statistical analysis of learned kernel offsets provides an insight into the change of information in features with the network depth. Finally, replacing only half of the layers in the encoder, the Deformer improves +5.6% relative WER without a LM and +6.4% relative WER with a LM over the Conformer baseline on the WSJ eval92 set.
title	DEFORMER: Coupling Deformed Localized Patterns with Global Context for Robust End-to-end Speech Recognition
topic	Audio and Speech Processing Artificial Intelligence
url	https://arxiv.org/abs/2207.01732

Similar Items