Saved in:
| Main Authors: | , , |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2501.01401 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866912175550889984 |
|---|---|
| author | Rahimi, Akam Afouras, Triantafyllos Zisserman, Andrew |
| author_facet | Rahimi, Akam Afouras, Triantafyllos Zisserman, Andrew |
| contents | We present a transformer-based architecture for voice separation of a target speaker from multiple other speakers and ambient noise. We achieve this by using two separate neural networks: (A) An enrolment network designed to craft speaker-specific embeddings, exploiting various combinations of audio and visual modalities; and (B) A separation network that accepts both the noisy signal and enrolment vectors as inputs, outputting the clean signal of the target speaker. The novelties are: (i) the enrolment vector can be produced from: audio only, audio-visual data (using lip movements) or visual data alone (using lip movements from silent video); and (ii) the flexibility in conditioning the separation on multiple positive and negative enrolment vectors. We compare with previous methods and obtain superior performance. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2501_01401 |
| institution | arXiv |
| publishDate | 2025 |
| record_format | arxiv |
| spellingShingle | VoiceVector: Multimodal Enrolment Vectors for Speaker Separation Rahimi, Akam Afouras, Triantafyllos Zisserman, Andrew Audio and Speech Processing We present a transformer-based architecture for voice separation of a target speaker from multiple other speakers and ambient noise. We achieve this by using two separate neural networks: (A) An enrolment network designed to craft speaker-specific embeddings, exploiting various combinations of audio and visual modalities; and (B) A separation network that accepts both the noisy signal and enrolment vectors as inputs, outputting the clean signal of the target speaker. The novelties are: (i) the enrolment vector can be produced from: audio only, audio-visual data (using lip movements) or visual data alone (using lip movements from silent video); and (ii) the flexibility in conditioning the separation on multiple positive and negative enrolment vectors. We compare with previous methods and obtain superior performance. |
| title | VoiceVector: Multimodal Enrolment Vectors for Speaker Separation |
| topic | Audio and Speech Processing |
| url | https://arxiv.org/abs/2501.01401 |