Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Rahimi, Akam, Afouras, Triantafyllos, Zisserman, Andrew
Format:	Preprint
Published:	2025
Subjects:	Audio and Speech Processing
Online Access:	https://arxiv.org/abs/2501.01401
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866912175550889984
author	Rahimi, Akam Afouras, Triantafyllos Zisserman, Andrew
author_facet	Rahimi, Akam Afouras, Triantafyllos Zisserman, Andrew
contents	We present a transformer-based architecture for voice separation of a target speaker from multiple other speakers and ambient noise. We achieve this by using two separate neural networks: (A) An enrolment network designed to craft speaker-specific embeddings, exploiting various combinations of audio and visual modalities; and (B) A separation network that accepts both the noisy signal and enrolment vectors as inputs, outputting the clean signal of the target speaker. The novelties are: (i) the enrolment vector can be produced from: audio only, audio-visual data (using lip movements) or visual data alone (using lip movements from silent video); and (ii) the flexibility in conditioning the separation on multiple positive and negative enrolment vectors. We compare with previous methods and obtain superior performance.
format	Preprint
id	arxiv_https___arxiv_org_abs_2501_01401
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	VoiceVector: Multimodal Enrolment Vectors for Speaker Separation Rahimi, Akam Afouras, Triantafyllos Zisserman, Andrew Audio and Speech Processing We present a transformer-based architecture for voice separation of a target speaker from multiple other speakers and ambient noise. We achieve this by using two separate neural networks: (A) An enrolment network designed to craft speaker-specific embeddings, exploiting various combinations of audio and visual modalities; and (B) A separation network that accepts both the noisy signal and enrolment vectors as inputs, outputting the clean signal of the target speaker. The novelties are: (i) the enrolment vector can be produced from: audio only, audio-visual data (using lip movements) or visual data alone (using lip movements from silent video); and (ii) the flexibility in conditioning the separation on multiple positive and negative enrolment vectors. We compare with previous methods and obtain superior performance.
title	VoiceVector: Multimodal Enrolment Vectors for Speaker Separation
topic	Audio and Speech Processing
url	https://arxiv.org/abs/2501.01401

Similar Items