Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Zhong, Henry, Buchholz, Jörg M., Maclaren, Julian, Carlile, Simon, Lyon, Richard F.
Format:	Preprint
Published:	2026
Subjects:	Sound
Online Access:	https://arxiv.org/abs/2602.19409
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866910029927415808
author	Zhong, Henry Buchholz, Jörg M. Maclaren, Julian Carlile, Simon Lyon, Richard F.
author_facet	Zhong, Henry Buchholz, Jörg M. Maclaren, Julian Carlile, Simon Lyon, Richard F.
contents	Manual annotation of audio datasets is labour intensive, and it is challenging to balance label granularity with acoustic separability. We introduce AuditoryHuM, a novel framework for the unsupervised discovery and clustering of auditory scene labels using a collaborative Human-Multimodal Large Language Model (MLLM) approach. By leveraging MLLMs (Gemma and Qwen) the framework generates contextually relevant labels for audio data. To ensure label quality and mitigate hallucinations, we employ zero-shot learning techniques (Human-CLAP) to quantify the alignment between generated text labels and raw audio content. A strategically targeted human-in-the-loop intervention is then used to refine the least aligned pairs. The discovered labels are grouped into thematically cohesive clusters using an adjusted silhouette score that incorporates a penalty parameter to balance cluster cohesion and thematic granularity. Evaluated across three diverse auditory scene datasets (ADVANCE, AHEAD-DS, and TAU 2019), AuditoryHuM provides a scalable, low-cost solution for creating standardised taxonomies. This solution facilitates the training of lightweight scene recognition models deployable to edge devices, such as hearing aids and smart home assistants. The project page and code: https://github.com/Australian-Future-Hearing-Initiative
format	Preprint
id	arxiv_https___arxiv_org_abs_2602_19409
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	AuditoryHuM: Auditory Scene Label Generation and Clustering using Human-MLLM Collaboration Zhong, Henry Buchholz, Jörg M. Maclaren, Julian Carlile, Simon Lyon, Richard F. Sound Manual annotation of audio datasets is labour intensive, and it is challenging to balance label granularity with acoustic separability. We introduce AuditoryHuM, a novel framework for the unsupervised discovery and clustering of auditory scene labels using a collaborative Human-Multimodal Large Language Model (MLLM) approach. By leveraging MLLMs (Gemma and Qwen) the framework generates contextually relevant labels for audio data. To ensure label quality and mitigate hallucinations, we employ zero-shot learning techniques (Human-CLAP) to quantify the alignment between generated text labels and raw audio content. A strategically targeted human-in-the-loop intervention is then used to refine the least aligned pairs. The discovered labels are grouped into thematically cohesive clusters using an adjusted silhouette score that incorporates a penalty parameter to balance cluster cohesion and thematic granularity. Evaluated across three diverse auditory scene datasets (ADVANCE, AHEAD-DS, and TAU 2019), AuditoryHuM provides a scalable, low-cost solution for creating standardised taxonomies. This solution facilitates the training of lightweight scene recognition models deployable to edge devices, such as hearing aids and smart home assistants. The project page and code: https://github.com/Australian-Future-Hearing-Initiative
title	AuditoryHuM: Auditory Scene Label Generation and Clustering using Human-MLLM Collaboration
topic	Sound
url	https://arxiv.org/abs/2602.19409

Similar Items