Saved in:
Bibliographic Details
Main Authors: Gonzalez-Jimenez, Alvaro, Gröger, Fabian, Wermelinger, Linda, Bürli, Andrin, Kastanis, Iason, Lionetti, Simone, Pouly, Marc
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2509.26291
Tags: Add Tag
No Tags, Be the first to tag this record!
Table of Contents:
  • Data quality issues such as off-topic samples, near duplicates, and label errors often limit the performance of audio-based systems. This paper addresses these issues by adapting SelfClean, a representation-to-rank data auditing framework, from the image to the audio domain. This approach leverages self-supervised audio representations to identify common data quality issues, creating ranked review lists that surface distinct issues within a single, unified process. The method is benchmarked on the ESC-50, GTZAN, and a proprietary industrial dataset, using both synthetic and naturally occurring corruptions. The results demonstrate that this framework achieves state-of-the-art ranking performance, often outperforming issue-specific baselines and enabling significant annotation savings by efficiently guiding human review.