Saved in:
| Main Authors: | , |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2605.02804 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866910200089280512 |
|---|---|
| author | O'Regan, Jim Edlund, Jens |
| author_facet | O'Regan, Jim Edlund, Jens |
| contents | Speech encodes multiple simultaneous attributes -- linguistic content, speaker identity, dialect, gender --that conventional single-vector embeddings conflate. We present a factor-partitioned embedding framework that maps each utterance into a single vector whose subspaces correspond to distinct axes of variation. A shared acoustic encoder feeds per-axis linear projection heads, each trained via distillation from a specialist teacher or a contrastive objective over shared-label pairs. The resulting embeddings support attribute-conditioned retrieval: similarity is computed as a signed weighted sum over per-axis cosine scores, allowing retrieval that jointly considers what was said and how -- or explicitly suppresses one attribute to surface another. We evaluate on cross-corpus retrieval over corpora sharing the Harvard sentence prompts, demonstrating that signed axis weighting can suppress same-speaker bias and surface semantically matched utterances across recording conditions.
Code is available at: https://github.com/jimregan/spoken-sentence-transformers |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2605_02804 |
| institution | arXiv |
| publishDate | 2026 |
| record_format | arxiv |
| spellingShingle | Multi-Axis Speech Similarity via Factor-Partitioned Embeddings O'Regan, Jim Edlund, Jens Audio and Speech Processing Information Retrieval Speech encodes multiple simultaneous attributes -- linguistic content, speaker identity, dialect, gender --that conventional single-vector embeddings conflate. We present a factor-partitioned embedding framework that maps each utterance into a single vector whose subspaces correspond to distinct axes of variation. A shared acoustic encoder feeds per-axis linear projection heads, each trained via distillation from a specialist teacher or a contrastive objective over shared-label pairs. The resulting embeddings support attribute-conditioned retrieval: similarity is computed as a signed weighted sum over per-axis cosine scores, allowing retrieval that jointly considers what was said and how -- or explicitly suppresses one attribute to surface another. We evaluate on cross-corpus retrieval over corpora sharing the Harvard sentence prompts, demonstrating that signed axis weighting can suppress same-speaker bias and surface semantically matched utterances across recording conditions. Code is available at: https://github.com/jimregan/spoken-sentence-transformers |
| title | Multi-Axis Speech Similarity via Factor-Partitioned Embeddings |
| topic | Audio and Speech Processing Information Retrieval |
| url | https://arxiv.org/abs/2605.02804 |