Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Tabatabaee, Saba, Boyce, Suzanne, Oren, Liran, Tiede, Mark, Espy-Wilson, Carol
Format:	Preprint
Published:	2025
Subjects:	Audio and Speech Processing
Online Access:	https://arxiv.org/abs/2506.09231
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866911129068896256
author	Tabatabaee, Saba Boyce, Suzanne Oren, Liran Tiede, Mark Espy-Wilson, Carol
author_facet	Tabatabaee, Saba Boyce, Suzanne Oren, Liran Tiede, Mark Espy-Wilson, Carol
contents	Speech is produced through the coordination of vocal tract constricting organs: lips, tongue, velum, and glottis. Previous works developed Speech Inversion (SI) systems to recover acoustic-to-articulatory mappings for lip and tongue constrictions, called oral tract variables (TVs), which were later enhanced by including source information (periodic and aperiodic energies, and F0 frequency) as proxies for glottal control. Comparison of the nasometric measures with high-speed nasopharyngoscopy showed that nasalance can serve as ground truth, and that an SI system trained with it reliably recovers velum movement patterns for American English speakers. Here, two SI training approaches are compared: baseline models that estimate oral TVs and nasalance independently, and a synergistic model that combines oral TVs and source features with nasalance. The synergistic model shows relative improvements of 5% in oral TVs estimation and 9% in nasalance estimation compared to the baseline models.
format	Preprint
id	arxiv_https___arxiv_org_abs_2506_09231
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Enhancing Acoustic-to-Articulatory Speech Inversion by Incorporating Nasality Tabatabaee, Saba Boyce, Suzanne Oren, Liran Tiede, Mark Espy-Wilson, Carol Audio and Speech Processing Speech is produced through the coordination of vocal tract constricting organs: lips, tongue, velum, and glottis. Previous works developed Speech Inversion (SI) systems to recover acoustic-to-articulatory mappings for lip and tongue constrictions, called oral tract variables (TVs), which were later enhanced by including source information (periodic and aperiodic energies, and F0 frequency) as proxies for glottal control. Comparison of the nasometric measures with high-speed nasopharyngoscopy showed that nasalance can serve as ground truth, and that an SI system trained with it reliably recovers velum movement patterns for American English speakers. Here, two SI training approaches are compared: baseline models that estimate oral TVs and nasalance independently, and a synergistic model that combines oral TVs and source features with nasalance. The synergistic model shows relative improvements of 5% in oral TVs estimation and 9% in nasalance estimation compared to the baseline models.
title	Enhancing Acoustic-to-Articulatory Speech Inversion by Incorporating Nasality
topic	Audio and Speech Processing
url	https://arxiv.org/abs/2506.09231

Similar Items