Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Nan, Zheng, Dang, Ting, Sethu, Vidhyasaharan, Ahmed, Beena
Format:	Preprint
Published:	2024
Subjects:	Audio and Speech Processing Computation and Language Machine Learning Sound
Online Access:	https://arxiv.org/abs/2409.15357
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866929512211546112
author	Nan, Zheng Dang, Ting Sethu, Vidhyasaharan Ahmed, Beena
author_facet	Nan, Zheng Dang, Ting Sethu, Vidhyasaharan Ahmed, Beena
contents	Relational thinking refers to the inherent ability of humans to form mental impressions about relations between sensory signals and prior knowledge, and subsequently incorporate them into their model of their world. Despite the crucial role relational thinking plays in human understanding of speech, it has yet to be leveraged in any artificial speech recognition systems. Recently, there have been some attempts to correct this oversight, but these have been limited to coarse utterance-level models that operate exclusively in the time domain. In an attempt to narrow the gap between artificial systems and human abilities, this paper presents a novel spectro-temporal relational thinking based acoustic modeling framework. Specifically, it first generates numerous probabilistic graphs to model the relationships among speech segments across both time and frequency domains. The relational information rooted in every pair of nodes within these graphs is then aggregated and embedded into latent representations that can be utilized by downstream tasks. Models built upon this framework outperform state-of-the-art systems with a 7.82\% improvement in phoneme recognition tasks over the TIMIT dataset. In-depth analyses further reveal that our proposed relational thinking modeling mainly improves the model's ability to recognize vowels, which are the most likely to be confused by phoneme recognizers.
format	Preprint
id	arxiv_https___arxiv_org_abs_2409_15357
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	A Joint Spectro-Temporal Relational Thinking Based Acoustic Modeling Framework Nan, Zheng Dang, Ting Sethu, Vidhyasaharan Ahmed, Beena Audio and Speech Processing Computation and Language Machine Learning Sound Relational thinking refers to the inherent ability of humans to form mental impressions about relations between sensory signals and prior knowledge, and subsequently incorporate them into their model of their world. Despite the crucial role relational thinking plays in human understanding of speech, it has yet to be leveraged in any artificial speech recognition systems. Recently, there have been some attempts to correct this oversight, but these have been limited to coarse utterance-level models that operate exclusively in the time domain. In an attempt to narrow the gap between artificial systems and human abilities, this paper presents a novel spectro-temporal relational thinking based acoustic modeling framework. Specifically, it first generates numerous probabilistic graphs to model the relationships among speech segments across both time and frequency domains. The relational information rooted in every pair of nodes within these graphs is then aggregated and embedded into latent representations that can be utilized by downstream tasks. Models built upon this framework outperform state-of-the-art systems with a 7.82\% improvement in phoneme recognition tasks over the TIMIT dataset. In-depth analyses further reveal that our proposed relational thinking modeling mainly improves the model's ability to recognize vowels, which are the most likely to be confused by phoneme recognizers.
title	A Joint Spectro-Temporal Relational Thinking Based Acoustic Modeling Framework
topic	Audio and Speech Processing Computation and Language Machine Learning Sound
url	https://arxiv.org/abs/2409.15357

Similar Items