Saved in:
Bibliographic Details
Main Authors: Jung, Kyudan, Kim, Jihwan, Lee, Minwoo, Kim, Soyoon, Kim, Jeonghoon, Choo, Jaegul, Park, Cheonbok
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2603.20686
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866917356238798848
author Jung, Kyudan
Kim, Jihwan
Lee, Minwoo
Kim, Soyoon
Kim, Jeonghoon
Choo, Jaegul
Park, Cheonbok
author_facet Jung, Kyudan
Kim, Jihwan
Lee, Minwoo
Kim, Soyoon
Kim, Jeonghoon
Choo, Jaegul
Park, Cheonbok
contents Recent advancements in text-to-speech technologies enable generating high-fidelity synthetic speech nearly indistinguishable from real human voices. While recent studies show the efficacy of self-supervised learning-based speech encoders for deepfake detection, these models struggle to generalize across unseen speakers. Our quantitative analysis suggests these encoder representations are substantially influenced by speaker information, causing detectors to exploit speaker-specific correlations rather than artifact-related cues. We call this phenomenon speaker entanglement. To mitigate this reliance, we introduce SNAP, a speaker-nulling framework. We estimate a speaker subspace and apply orthogonal projection to suppress speaker-dependent components, isolating synthesis artifacts within the residual features. By reducing speaker entanglement, SNAP encourages detectors to focus on artifact-related patterns, leading to state-of-the-art performance.
format Preprint
id arxiv_https___arxiv_org_abs_2603_20686
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle SNAP: Speaker Nulling for Artifact Projection in Speech Deepfake Detection
Jung, Kyudan
Kim, Jihwan
Lee, Minwoo
Kim, Soyoon
Kim, Jeonghoon
Choo, Jaegul
Park, Cheonbok
Sound
Artificial Intelligence
Recent advancements in text-to-speech technologies enable generating high-fidelity synthetic speech nearly indistinguishable from real human voices. While recent studies show the efficacy of self-supervised learning-based speech encoders for deepfake detection, these models struggle to generalize across unseen speakers. Our quantitative analysis suggests these encoder representations are substantially influenced by speaker information, causing detectors to exploit speaker-specific correlations rather than artifact-related cues. We call this phenomenon speaker entanglement. To mitigate this reliance, we introduce SNAP, a speaker-nulling framework. We estimate a speaker subspace and apply orthogonal projection to suppress speaker-dependent components, isolating synthesis artifacts within the residual features. By reducing speaker entanglement, SNAP encourages detectors to focus on artifact-related patterns, leading to state-of-the-art performance.
title SNAP: Speaker Nulling for Artifact Projection in Speech Deepfake Detection
topic Sound
Artificial Intelligence
url https://arxiv.org/abs/2603.20686