Affichage MARC: :: Library Catalog

Enregistré dans:

Détails bibliographiques
Auteurs principaux:	Batra, Arnesh, Sharma, Dev, Thukral, Krish, Bhatia, Ruhani, Batra, Naman, Gautam, Aditya
Format:	Preprint
Publié:	2025
Sujets:	Sound Artificial Intelligence Computation and Language
Accès en ligne:	https://arxiv.org/abs/2512.00621
Tags:	Ajouter un tag Pas de tags, Soyez le premier à ajouter un tag!

_version_	1866911294701961216
author	Batra, Arnesh Sharma, Dev Thukral, Krish Bhatia, Ruhani Batra, Naman Gautam, Aditya
author_facet	Batra, Arnesh Sharma, Dev Thukral, Krish Bhatia, Ruhani Batra, Naman Gautam, Aditya
contents	The rapid evolution of end-to-end AI music generation poses an escalating threat to artistic authenticity and copyright, demanding detection methods that can keep pace. While foundational, existing models like SpecTTTra falter when faced with the diverse and rapidly advancing ecosystem of new generators, exhibiting significant performance drops on out-of-distribution (OOD) content. This generalization failure highlights a critical gap: the need for more challenging benchmarks and more robust detection architectures. To address this, we first introduce Melody or Machine (MoM), a new large-scale benchmark of over 130,000 songs (6,665 hours). MoM is the most diverse dataset to date, built with a mix of open and closed-source models and a curated OOD test set designed specifically to foster the development of truly generalizable detectors. Alongside this benchmark, we introduce CLAM, a novel dual-stream detection architecture. We hypothesize that subtle, machine-induced inconsistencies between vocal and instrumental elements, often imperceptible in a mixed signal, offer a powerful tell-tale sign of synthesis. CLAM is designed to test this hypothesis by employing two distinct pre-trained audio encoders (MERT and Wave2Vec2) to create parallel representations of the audio. These representations are fused by a learnable cross-aggregation module that models their inter-dependencies. The model is trained with a dual-loss objective: a standard binary cross-entropy loss for classification, complemented by a contrastive triplet loss which trains the model to distinguish between coherent and artificially mismatched stream pairings, enhancing its sensitivity to synthetic artifacts without presuming a simple feature alignment. CLAM establishes a new state-of-the-art in synthetic music forensics. It achieves an F1 score of 0.925 on our challenging MoM benchmark.
format	Preprint
id	arxiv_https___arxiv_org_abs_2512_00621
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Melody or Machine: Detecting Synthetic Music with Dual-Stream Contrastive Learning Batra, Arnesh Sharma, Dev Thukral, Krish Bhatia, Ruhani Batra, Naman Gautam, Aditya Sound Artificial Intelligence Computation and Language The rapid evolution of end-to-end AI music generation poses an escalating threat to artistic authenticity and copyright, demanding detection methods that can keep pace. While foundational, existing models like SpecTTTra falter when faced with the diverse and rapidly advancing ecosystem of new generators, exhibiting significant performance drops on out-of-distribution (OOD) content. This generalization failure highlights a critical gap: the need for more challenging benchmarks and more robust detection architectures. To address this, we first introduce Melody or Machine (MoM), a new large-scale benchmark of over 130,000 songs (6,665 hours). MoM is the most diverse dataset to date, built with a mix of open and closed-source models and a curated OOD test set designed specifically to foster the development of truly generalizable detectors. Alongside this benchmark, we introduce CLAM, a novel dual-stream detection architecture. We hypothesize that subtle, machine-induced inconsistencies between vocal and instrumental elements, often imperceptible in a mixed signal, offer a powerful tell-tale sign of synthesis. CLAM is designed to test this hypothesis by employing two distinct pre-trained audio encoders (MERT and Wave2Vec2) to create parallel representations of the audio. These representations are fused by a learnable cross-aggregation module that models their inter-dependencies. The model is trained with a dual-loss objective: a standard binary cross-entropy loss for classification, complemented by a contrastive triplet loss which trains the model to distinguish between coherent and artificially mismatched stream pairings, enhancing its sensitivity to synthetic artifacts without presuming a simple feature alignment. CLAM establishes a new state-of-the-art in synthetic music forensics. It achieves an F1 score of 0.925 on our challenging MoM benchmark.
title	Melody or Machine: Detecting Synthetic Music with Dual-Stream Contrastive Learning
topic	Sound Artificial Intelligence Computation and Language
url	https://arxiv.org/abs/2512.00621

Documents similaires