Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Jung, Kyudan, Kim, Jihwan, Lee, Minwoo, Kim, Soyoon, Kim, Jeonghoon, Choo, Jaegul, Park, Cheonbok
Format:	Preprint
Published:	2026
Subjects:	Sound Artificial Intelligence
Online Access:	https://arxiv.org/abs/2603.20686
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866917356238798848
author	Jung, Kyudan Kim, Jihwan Lee, Minwoo Kim, Soyoon Kim, Jeonghoon Choo, Jaegul Park, Cheonbok
author_facet	Jung, Kyudan Kim, Jihwan Lee, Minwoo Kim, Soyoon Kim, Jeonghoon Choo, Jaegul Park, Cheonbok
contents	Recent advancements in text-to-speech technologies enable generating high-fidelity synthetic speech nearly indistinguishable from real human voices. While recent studies show the efficacy of self-supervised learning-based speech encoders for deepfake detection, these models struggle to generalize across unseen speakers. Our quantitative analysis suggests these encoder representations are substantially influenced by speaker information, causing detectors to exploit speaker-specific correlations rather than artifact-related cues. We call this phenomenon speaker entanglement. To mitigate this reliance, we introduce SNAP, a speaker-nulling framework. We estimate a speaker subspace and apply orthogonal projection to suppress speaker-dependent components, isolating synthesis artifacts within the residual features. By reducing speaker entanglement, SNAP encourages detectors to focus on artifact-related patterns, leading to state-of-the-art performance.
format	Preprint
id	arxiv_https___arxiv_org_abs_2603_20686
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	SNAP: Speaker Nulling for Artifact Projection in Speech Deepfake Detection Jung, Kyudan Kim, Jihwan Lee, Minwoo Kim, Soyoon Kim, Jeonghoon Choo, Jaegul Park, Cheonbok Sound Artificial Intelligence Recent advancements in text-to-speech technologies enable generating high-fidelity synthetic speech nearly indistinguishable from real human voices. While recent studies show the efficacy of self-supervised learning-based speech encoders for deepfake detection, these models struggle to generalize across unseen speakers. Our quantitative analysis suggests these encoder representations are substantially influenced by speaker information, causing detectors to exploit speaker-specific correlations rather than artifact-related cues. We call this phenomenon speaker entanglement. To mitigate this reliance, we introduce SNAP, a speaker-nulling framework. We estimate a speaker subspace and apply orthogonal projection to suppress speaker-dependent components, isolating synthesis artifacts within the residual features. By reducing speaker entanglement, SNAP encourages detectors to focus on artifact-related patterns, leading to state-of-the-art performance.
title	SNAP: Speaker Nulling for Artifact Projection in Speech Deepfake Detection
topic	Sound Artificial Intelligence
url	https://arxiv.org/abs/2603.20686

Similar Items