Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Lee, Hyeongkeun, Choi, Jongmin, Nam, KiHyun, Chung, Joon Son
Format:	Preprint
Published:	2026
Subjects:	Sound Artificial Intelligence
Online Access:	https://arxiv.org/abs/2601.04658
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866914394625015808
author	Lee, Hyeongkeun Choi, Jongmin Nam, KiHyun Chung, Joon Son
author_facet	Lee, Hyeongkeun Choi, Jongmin Nam, KiHyun Chung, Joon Son
contents	Automated Audio Captioning aims to describe the semantic content of input audio. Recent works have employed large language models (LLMs) as a text decoder to leverage their reasoning capabilities. However, prior approaches that project audio features into the LLM embedding space without considering cross-modal alignment fail to fully utilize these capabilities. To address this, we propose LAMB, an LLM-based audio captioning framework that bridges the modality gap between audio embeddings and the LLM text embedding space. LAMB incorporates a Cross-Modal Aligner that minimizes Cauchy-Schwarz divergence while maximizing mutual information, yielding tighter alignment between audio and text at both global and token levels. We further design a Two-Stream Adapter that extracts semantically enriched audio embeddings, thereby delivering richer information to the Cross-Modal Aligner. Finally, leveraging the aligned audio embeddings, a proposed Token Guide directly computes scores within the LLM text embedding space to steer the output logits of generated captions. Experimental results confirm that our framework strengthens the reasoning capabilities of the LLM decoder, achieving state-of-the-art performance on AudioCaps.
format	Preprint
id	arxiv_https___arxiv_org_abs_2601_04658
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	LAMB: LLM-based Audio Captioning with Modality Gap Bridging via Cauchy-Schwarz Divergence Lee, Hyeongkeun Choi, Jongmin Nam, KiHyun Chung, Joon Son Sound Artificial Intelligence Automated Audio Captioning aims to describe the semantic content of input audio. Recent works have employed large language models (LLMs) as a text decoder to leverage their reasoning capabilities. However, prior approaches that project audio features into the LLM embedding space without considering cross-modal alignment fail to fully utilize these capabilities. To address this, we propose LAMB, an LLM-based audio captioning framework that bridges the modality gap between audio embeddings and the LLM text embedding space. LAMB incorporates a Cross-Modal Aligner that minimizes Cauchy-Schwarz divergence while maximizing mutual information, yielding tighter alignment between audio and text at both global and token levels. We further design a Two-Stream Adapter that extracts semantically enriched audio embeddings, thereby delivering richer information to the Cross-Modal Aligner. Finally, leveraging the aligned audio embeddings, a proposed Token Guide directly computes scores within the LLM text embedding space to steer the output logits of generated captions. Experimental results confirm that our framework strengthens the reasoning capabilities of the LLM decoder, achieving state-of-the-art performance on AudioCaps.
title	LAMB: LLM-based Audio Captioning with Modality Gap Bridging via Cauchy-Schwarz Divergence
topic	Sound Artificial Intelligence
url	https://arxiv.org/abs/2601.04658

Similar Items