Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	O'Regan, Jim, Edlund, Jens
Format:	Preprint
Published:	2026
Subjects:	Audio and Speech Processing Information Retrieval
Online Access:	https://arxiv.org/abs/2605.02804
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866910200089280512
author	O'Regan, Jim Edlund, Jens
author_facet	O'Regan, Jim Edlund, Jens
contents	Speech encodes multiple simultaneous attributes -- linguistic content, speaker identity, dialect, gender --that conventional single-vector embeddings conflate. We present a factor-partitioned embedding framework that maps each utterance into a single vector whose subspaces correspond to distinct axes of variation. A shared acoustic encoder feeds per-axis linear projection heads, each trained via distillation from a specialist teacher or a contrastive objective over shared-label pairs. The resulting embeddings support attribute-conditioned retrieval: similarity is computed as a signed weighted sum over per-axis cosine scores, allowing retrieval that jointly considers what was said and how -- or explicitly suppresses one attribute to surface another. We evaluate on cross-corpus retrieval over corpora sharing the Harvard sentence prompts, demonstrating that signed axis weighting can suppress same-speaker bias and surface semantically matched utterances across recording conditions. Code is available at: https://github.com/jimregan/spoken-sentence-transformers
format	Preprint
id	arxiv_https___arxiv_org_abs_2605_02804
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Multi-Axis Speech Similarity via Factor-Partitioned Embeddings O'Regan, Jim Edlund, Jens Audio and Speech Processing Information Retrieval Speech encodes multiple simultaneous attributes -- linguistic content, speaker identity, dialect, gender --that conventional single-vector embeddings conflate. We present a factor-partitioned embedding framework that maps each utterance into a single vector whose subspaces correspond to distinct axes of variation. A shared acoustic encoder feeds per-axis linear projection heads, each trained via distillation from a specialist teacher or a contrastive objective over shared-label pairs. The resulting embeddings support attribute-conditioned retrieval: similarity is computed as a signed weighted sum over per-axis cosine scores, allowing retrieval that jointly considers what was said and how -- or explicitly suppresses one attribute to surface another. We evaluate on cross-corpus retrieval over corpora sharing the Harvard sentence prompts, demonstrating that signed axis weighting can suppress same-speaker bias and surface semantically matched utterances across recording conditions. Code is available at: https://github.com/jimregan/spoken-sentence-transformers
title	Multi-Axis Speech Similarity via Factor-Partitioned Embeddings
topic	Audio and Speech Processing Information Retrieval
url	https://arxiv.org/abs/2605.02804

Similar Items