Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Wang, Tsai-Ning, Chen, Lin-Lin, Zeghidour, Neil, Saeed, Aaqib
Format:	Preprint
Published:	2025
Subjects:	Sound Artificial Intelligence
Online Access:	https://arxiv.org/abs/2512.04847
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866915941395202048
author	Wang, Tsai-Ning Chen, Lin-Lin Zeghidour, Neil Saeed, Aaqib
author_facet	Wang, Tsai-Ning Chen, Lin-Lin Zeghidour, Neil Saeed, Aaqib
contents	Pre-trained audio models excel at detecting acoustic patterns in auscultation sounds but often fail to grasp their clinical significance, limiting their use and performance in diagnostic tasks. To bridge this gap, we introduce AcuLa (Audio-Clinical Understanding via Language Alignment), a lightweight post-training framework that instills semantic understanding into any audio encoder by aligning it with a medical language model, which acts as a "semantic teacher." To enable alignment at scale, we construct a large-scale dataset by leveraging off-the-shelf large language models to translate the rich, structured metadata accompanying existing audio recordings into coherent clinical reports. Our alignment strategy combines a representation-level contrastive objective with a self-supervised modeling, ensuring that the model learns clinical semantics while preserving fine-grained temporal cues. AcuLa achieves state-of-the-art results across 18 diverse cardio-respiratory tasks from 10 different datasets, improving the mean AUROC on classification benchmarks from 0.68 to 0.79 and, on the most challenging COVID-19 cough detection task, boosting the AUROC from 0.55 to 0.89. Our work demonstrates that this audio-language alignment transforms purely acoustic models into clinically-aware diagnostic tools, establishing a novel paradigm for enhancing physiological understanding in audio-based health monitoring.
format	Preprint
id	arxiv_https___arxiv_org_abs_2512_04847
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Language Models as Semantic Teachers: Post-Training Alignment for Medical Audio Understanding Wang, Tsai-Ning Chen, Lin-Lin Zeghidour, Neil Saeed, Aaqib Sound Artificial Intelligence Pre-trained audio models excel at detecting acoustic patterns in auscultation sounds but often fail to grasp their clinical significance, limiting their use and performance in diagnostic tasks. To bridge this gap, we introduce AcuLa (Audio-Clinical Understanding via Language Alignment), a lightweight post-training framework that instills semantic understanding into any audio encoder by aligning it with a medical language model, which acts as a "semantic teacher." To enable alignment at scale, we construct a large-scale dataset by leveraging off-the-shelf large language models to translate the rich, structured metadata accompanying existing audio recordings into coherent clinical reports. Our alignment strategy combines a representation-level contrastive objective with a self-supervised modeling, ensuring that the model learns clinical semantics while preserving fine-grained temporal cues. AcuLa achieves state-of-the-art results across 18 diverse cardio-respiratory tasks from 10 different datasets, improving the mean AUROC on classification benchmarks from 0.68 to 0.79 and, on the most challenging COVID-19 cough detection task, boosting the AUROC from 0.55 to 0.89. Our work demonstrates that this audio-language alignment transforms purely acoustic models into clinically-aware diagnostic tools, establishing a novel paradigm for enhancing physiological understanding in audio-based health monitoring.
title	Language Models as Semantic Teachers: Post-Training Alignment for Medical Audio Understanding
topic	Sound Artificial Intelligence
url	https://arxiv.org/abs/2512.04847

Similar Items