Affichage MARC: :: Library Catalog

Enregistré dans:

Détails bibliographiques
Auteurs principaux:	Li, Xiquan, Xu, Xuenan, Ma, Ziyang, Chen, Wenxi, He, Haolin, Kong, Qiuqiang, Chen, Xie
Format:	Preprint
Publié:	2026
Sujets:	Sound
Accès en ligne:	https://arxiv.org/abs/2604.01155
Tags:	Ajouter un tag Pas de tags, Soyez le premier à ajouter un tag!

_version_	1866915906284683264
author	Li, Xiquan Xu, Xuenan Ma, Ziyang Chen, Wenxi He, Haolin Kong, Qiuqiang Chen, Xie
author_facet	Li, Xiquan Xu, Xuenan Ma, Ziyang Chen, Wenxi He, Haolin Kong, Qiuqiang Chen, Xie
contents	Contrastively pretrained audio-language models (e.g., CLAP) excel at clip-level understanding but struggle with frame-level tasks. Existing extensions fail to exploit the varying granularity of real-world audio-text data, where massive clip-level textual descriptions coexist with limited frame-level annotations. This paper proposes Fine-grained Language-Audio Pretraining (FineLAP), a novel training paradigm that advances both clip- and frame-level alignment in CLAP with heterogeneous data. FineLAP introduces a dual-stream sigmoid loss with a cluster-based sampling strategy to jointly learn from clip- and frame-level supervision. To capture both global semantics and local details, FineLAP uses a decoupled audio projector on top of a self-supervised encoder. To alleviate the scarcity of temporally annotated data, we present FineLAP-100k, a large-scale synthetic SED dataset constructed through a scalable curation pipeline. Extensive experiments demonstrate that FineLAP achieves SOTA performance across multiple audio understanding tasks, including retrieval, classification, sound event detection, and text-to-audio grounding. Ablation studies further show that coarse- and fine-grained alignment are mutually beneficial, providing insights for building better audio-language models (ALMs).
format	Preprint
id	arxiv_https___arxiv_org_abs_2604_01155
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	FineLAP: Taming Heterogeneous Supervision for Fine-grained Language-Audio Pretraining Li, Xiquan Xu, Xuenan Ma, Ziyang Chen, Wenxi He, Haolin Kong, Qiuqiang Chen, Xie Sound Contrastively pretrained audio-language models (e.g., CLAP) excel at clip-level understanding but struggle with frame-level tasks. Existing extensions fail to exploit the varying granularity of real-world audio-text data, where massive clip-level textual descriptions coexist with limited frame-level annotations. This paper proposes Fine-grained Language-Audio Pretraining (FineLAP), a novel training paradigm that advances both clip- and frame-level alignment in CLAP with heterogeneous data. FineLAP introduces a dual-stream sigmoid loss with a cluster-based sampling strategy to jointly learn from clip- and frame-level supervision. To capture both global semantics and local details, FineLAP uses a decoupled audio projector on top of a self-supervised encoder. To alleviate the scarcity of temporally annotated data, we present FineLAP-100k, a large-scale synthetic SED dataset constructed through a scalable curation pipeline. Extensive experiments demonstrate that FineLAP achieves SOTA performance across multiple audio understanding tasks, including retrieval, classification, sound event detection, and text-to-audio grounding. Ablation studies further show that coarse- and fine-grained alignment are mutually beneficial, providing insights for building better audio-language models (ALMs).
title	FineLAP: Taming Heterogeneous Supervision for Fine-grained Language-Audio Pretraining
topic	Sound
url	https://arxiv.org/abs/2604.01155

Documents similaires