Saved in:
Bibliographic Details
Main Authors: C, Kishan K, Tan, Zhenning, Chen, Long, Jin, Minho, Han, Eunjung, Stolcke, Andreas, Lee, Chul
Format: Preprint
Published: 2022
Subjects:
Online Access:https://arxiv.org/abs/2202.12349
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866929237254995968
author C, Kishan K
Tan, Zhenning
Chen, Long
Jin, Minho
Han, Eunjung
Stolcke, Andreas
Lee, Chul
author_facet C, Kishan K
Tan, Zhenning
Chen, Long
Jin, Minho
Han, Eunjung
Stolcke, Andreas
Lee, Chul
contents Household speaker identification with few enrollment utterances is an important yet challenging problem, especially when household members share similar voice characteristics and room acoustics. A common embedding space learned from a large number of speakers is not universally applicable for the optimal identification of every speaker in a household. In this work, we first formulate household speaker identification as a few-shot open-set recognition task and then propose a novel embedding adaptation framework to adapt speaker representations from the given universal embedding space to a household-specific embedding space using a set-to-set function, yielding better household speaker identification performance. With our algorithm, Open-set Few-shot Embedding Adaptation with Transformer (openFEAT), we observe that the speaker identification equal error rate (IEER) on simulated households with 2 to 7 hard-to-discriminate speakers is reduced by 23% to 31% relative.
format Preprint
id arxiv_https___arxiv_org_abs_2202_12349
institution arXiv
publishDate 2022
record_format arxiv
spellingShingle openFEAT: Improving Speaker Identification by Open-set Few-shot Embedding Adaptation with Transformer
C, Kishan K
Tan, Zhenning
Chen, Long
Jin, Minho
Han, Eunjung
Stolcke, Andreas
Lee, Chul
Audio and Speech Processing
Household speaker identification with few enrollment utterances is an important yet challenging problem, especially when household members share similar voice characteristics and room acoustics. A common embedding space learned from a large number of speakers is not universally applicable for the optimal identification of every speaker in a household. In this work, we first formulate household speaker identification as a few-shot open-set recognition task and then propose a novel embedding adaptation framework to adapt speaker representations from the given universal embedding space to a household-specific embedding space using a set-to-set function, yielding better household speaker identification performance. With our algorithm, Open-set Few-shot Embedding Adaptation with Transformer (openFEAT), we observe that the speaker identification equal error rate (IEER) on simulated households with 2 to 7 hard-to-discriminate speakers is reduced by 23% to 31% relative.
title openFEAT: Improving Speaker Identification by Open-set Few-shot Embedding Adaptation with Transformer
topic Audio and Speech Processing
url https://arxiv.org/abs/2202.12349