Saved in:
Bibliographic Details
Main Authors: Tabatabaee, Saba, Boyce, Suzanne, Oren, Liran, Tiede, Mark, Espy-Wilson, Carol
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2506.09231
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866911129068896256
author Tabatabaee, Saba
Boyce, Suzanne
Oren, Liran
Tiede, Mark
Espy-Wilson, Carol
author_facet Tabatabaee, Saba
Boyce, Suzanne
Oren, Liran
Tiede, Mark
Espy-Wilson, Carol
contents Speech is produced through the coordination of vocal tract constricting organs: lips, tongue, velum, and glottis. Previous works developed Speech Inversion (SI) systems to recover acoustic-to-articulatory mappings for lip and tongue constrictions, called oral tract variables (TVs), which were later enhanced by including source information (periodic and aperiodic energies, and F0 frequency) as proxies for glottal control. Comparison of the nasometric measures with high-speed nasopharyngoscopy showed that nasalance can serve as ground truth, and that an SI system trained with it reliably recovers velum movement patterns for American English speakers. Here, two SI training approaches are compared: baseline models that estimate oral TVs and nasalance independently, and a synergistic model that combines oral TVs and source features with nasalance. The synergistic model shows relative improvements of 5% in oral TVs estimation and 9% in nasalance estimation compared to the baseline models.
format Preprint
id arxiv_https___arxiv_org_abs_2506_09231
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle Enhancing Acoustic-to-Articulatory Speech Inversion by Incorporating Nasality
Tabatabaee, Saba
Boyce, Suzanne
Oren, Liran
Tiede, Mark
Espy-Wilson, Carol
Audio and Speech Processing
Speech is produced through the coordination of vocal tract constricting organs: lips, tongue, velum, and glottis. Previous works developed Speech Inversion (SI) systems to recover acoustic-to-articulatory mappings for lip and tongue constrictions, called oral tract variables (TVs), which were later enhanced by including source information (periodic and aperiodic energies, and F0 frequency) as proxies for glottal control. Comparison of the nasometric measures with high-speed nasopharyngoscopy showed that nasalance can serve as ground truth, and that an SI system trained with it reliably recovers velum movement patterns for American English speakers. Here, two SI training approaches are compared: baseline models that estimate oral TVs and nasalance independently, and a synergistic model that combines oral TVs and source features with nasalance. The synergistic model shows relative improvements of 5% in oral TVs estimation and 9% in nasalance estimation compared to the baseline models.
title Enhancing Acoustic-to-Articulatory Speech Inversion by Incorporating Nasality
topic Audio and Speech Processing
url https://arxiv.org/abs/2506.09231