Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Zhang, Wenli, Shi, Xianglong, Zhao, Sirui, Chen, Xinqi, Cheng, Guo, Xu, Yifan, Xu, Tong, Liao, Yong
Format:	Preprint
Published:	2026
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2604.08405
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866915930175438848
author	Zhang, Wenli Shi, Xianglong Zhao, Sirui Chen, Xinqi Cheng, Guo Xu, Yifan Xu, Tong Liao, Yong
author_facet	Zhang, Wenli Shi, Xianglong Zhao, Sirui Chen, Xinqi Cheng, Guo Xu, Yifan Xu, Tong Liao, Yong
contents	Diffusion-based audio-driven talking-head generation enables realistic portrait animation, but also introduces risks of misuse, such as fraud and misinformation. Existing protection methods are largely limited to a single modality, and neither image-only nor audio-only attacks can effectively suppress speech-driven facial dynamics. To address this gap, we propose SyncBreaker, a stage-aware multimodal protection framework that jointly perturbs portrait and audio inputs under modality-specific perceptual constraints. Our key contributions are twofold. First, for the image stream, we introduce nullifying supervision with Multi-Interval Sampling (MIS) across diffusion stages to steer the generation toward the static reference portrait by aggregating guidance from multiple denoising intervals. Second, for the audio stream, we propose Cross-Attention Fooling (CAF), which suppresses interval-specific audio-conditioned cross-attention responses. Both streams are optimized independently and combined at inference time to enable flexible deployment. We evaluate SyncBreaker in a white-box proactive protection setting. Extensive experiments demonstrate that SyncBreaker more effectively degrades lip synchronization and facial dynamics than strong single-modality baselines, while preserving input perceptual quality and remaining robust under purification. Code: https://github.com/kitty384/SyncBreaker.
format	Preprint
id	arxiv_https___arxiv_org_abs_2604_08405
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	SyncBreaker:Stage-Aware Multimodal Adversarial Attacks on Audio-Driven Talking Head Generation Zhang, Wenli Shi, Xianglong Zhao, Sirui Chen, Xinqi Cheng, Guo Xu, Yifan Xu, Tong Liao, Yong Computer Vision and Pattern Recognition Diffusion-based audio-driven talking-head generation enables realistic portrait animation, but also introduces risks of misuse, such as fraud and misinformation. Existing protection methods are largely limited to a single modality, and neither image-only nor audio-only attacks can effectively suppress speech-driven facial dynamics. To address this gap, we propose SyncBreaker, a stage-aware multimodal protection framework that jointly perturbs portrait and audio inputs under modality-specific perceptual constraints. Our key contributions are twofold. First, for the image stream, we introduce nullifying supervision with Multi-Interval Sampling (MIS) across diffusion stages to steer the generation toward the static reference portrait by aggregating guidance from multiple denoising intervals. Second, for the audio stream, we propose Cross-Attention Fooling (CAF), which suppresses interval-specific audio-conditioned cross-attention responses. Both streams are optimized independently and combined at inference time to enable flexible deployment. We evaluate SyncBreaker in a white-box proactive protection setting. Extensive experiments demonstrate that SyncBreaker more effectively degrades lip synchronization and facial dynamics than strong single-modality baselines, while preserving input perceptual quality and remaining robust under purification. Code: https://github.com/kitty384/SyncBreaker.
title	SyncBreaker:Stage-Aware Multimodal Adversarial Attacks on Audio-Driven Talking Head Generation
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2604.08405

Similar Items