Saved in:
Bibliographic Details
Main Authors: Best, Paul, Cuervo, Santiago, Marxer, Ricard
Format: Preprint
Published: 2024
Subjects:
Online Access:https://arxiv.org/abs/2404.01737
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866912021779316736
author Best, Paul
Cuervo, Santiago
Marxer, Ricard
author_facet Best, Paul
Cuervo, Santiago
Marxer, Ricard
contents Macroscopic intelligibility models predict the expected human word-error-rate for a given speech-in-noise stimulus. In contrast, microscopic intelligibility models aim to make fine-grained predictions about listeners' perception, e.g. predicting phonetic or lexical responses. State-of-the-art macroscopic models use transfer learning from large scale deep learning models for speech processing, whereas such methods have rarely been used for microscopic modeling. In this paper, we study the use of transfer learning from Whisper, a state-of-the-art deep learning model for automatic speech recognition, for microscopic intelligibility prediction at the level of lexical responses. Our method outperforms the considered baselines, even in a zero-shot setup, and yields a relative improvement of up to 66\% when fine-tuned to predict listeners' responses. Our results showcase the promise of large scale deep learning based methods for microscopic intelligibility prediction.
format Preprint
id arxiv_https___arxiv_org_abs_2404_01737
institution arXiv
publishDate 2024
record_format arxiv
spellingShingle Transfer Learning from Whisper for Microscopic Intelligibility Prediction
Best, Paul
Cuervo, Santiago
Marxer, Ricard
Audio and Speech Processing
Computation and Language
Sound
Macroscopic intelligibility models predict the expected human word-error-rate for a given speech-in-noise stimulus. In contrast, microscopic intelligibility models aim to make fine-grained predictions about listeners' perception, e.g. predicting phonetic or lexical responses. State-of-the-art macroscopic models use transfer learning from large scale deep learning models for speech processing, whereas such methods have rarely been used for microscopic modeling. In this paper, we study the use of transfer learning from Whisper, a state-of-the-art deep learning model for automatic speech recognition, for microscopic intelligibility prediction at the level of lexical responses. Our method outperforms the considered baselines, even in a zero-shot setup, and yields a relative improvement of up to 66\% when fine-tuned to predict listeners' responses. Our results showcase the promise of large scale deep learning based methods for microscopic intelligibility prediction.
title Transfer Learning from Whisper for Microscopic Intelligibility Prediction
topic Audio and Speech Processing
Computation and Language
Sound
url https://arxiv.org/abs/2404.01737