Saved in:
Bibliographic Details
Main Authors: O'Regan, Jim, Edlund, Jens
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2605.02804
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866910200089280512
author O'Regan, Jim
Edlund, Jens
author_facet O'Regan, Jim
Edlund, Jens
contents Speech encodes multiple simultaneous attributes -- linguistic content, speaker identity, dialect, gender --that conventional single-vector embeddings conflate. We present a factor-partitioned embedding framework that maps each utterance into a single vector whose subspaces correspond to distinct axes of variation. A shared acoustic encoder feeds per-axis linear projection heads, each trained via distillation from a specialist teacher or a contrastive objective over shared-label pairs. The resulting embeddings support attribute-conditioned retrieval: similarity is computed as a signed weighted sum over per-axis cosine scores, allowing retrieval that jointly considers what was said and how -- or explicitly suppresses one attribute to surface another. We evaluate on cross-corpus retrieval over corpora sharing the Harvard sentence prompts, demonstrating that signed axis weighting can suppress same-speaker bias and surface semantically matched utterances across recording conditions. Code is available at: https://github.com/jimregan/spoken-sentence-transformers
format Preprint
id arxiv_https___arxiv_org_abs_2605_02804
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle Multi-Axis Speech Similarity via Factor-Partitioned Embeddings
O'Regan, Jim
Edlund, Jens
Audio and Speech Processing
Information Retrieval
Speech encodes multiple simultaneous attributes -- linguistic content, speaker identity, dialect, gender --that conventional single-vector embeddings conflate. We present a factor-partitioned embedding framework that maps each utterance into a single vector whose subspaces correspond to distinct axes of variation. A shared acoustic encoder feeds per-axis linear projection heads, each trained via distillation from a specialist teacher or a contrastive objective over shared-label pairs. The resulting embeddings support attribute-conditioned retrieval: similarity is computed as a signed weighted sum over per-axis cosine scores, allowing retrieval that jointly considers what was said and how -- or explicitly suppresses one attribute to surface another. We evaluate on cross-corpus retrieval over corpora sharing the Harvard sentence prompts, demonstrating that signed axis weighting can suppress same-speaker bias and surface semantically matched utterances across recording conditions. Code is available at: https://github.com/jimregan/spoken-sentence-transformers
title Multi-Axis Speech Similarity via Factor-Partitioned Embeddings
topic Audio and Speech Processing
Information Retrieval
url https://arxiv.org/abs/2605.02804