Saved in:
Bibliographic Details
Main Authors: Lee, Hyeongkeun, Choi, Jongmin, Nam, KiHyun, Chung, Joon Son
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2601.04658
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866914394625015808
author Lee, Hyeongkeun
Choi, Jongmin
Nam, KiHyun
Chung, Joon Son
author_facet Lee, Hyeongkeun
Choi, Jongmin
Nam, KiHyun
Chung, Joon Son
contents Automated Audio Captioning aims to describe the semantic content of input audio. Recent works have employed large language models (LLMs) as a text decoder to leverage their reasoning capabilities. However, prior approaches that project audio features into the LLM embedding space without considering cross-modal alignment fail to fully utilize these capabilities. To address this, we propose LAMB, an LLM-based audio captioning framework that bridges the modality gap between audio embeddings and the LLM text embedding space. LAMB incorporates a Cross-Modal Aligner that minimizes Cauchy-Schwarz divergence while maximizing mutual information, yielding tighter alignment between audio and text at both global and token levels. We further design a Two-Stream Adapter that extracts semantically enriched audio embeddings, thereby delivering richer information to the Cross-Modal Aligner. Finally, leveraging the aligned audio embeddings, a proposed Token Guide directly computes scores within the LLM text embedding space to steer the output logits of generated captions. Experimental results confirm that our framework strengthens the reasoning capabilities of the LLM decoder, achieving state-of-the-art performance on AudioCaps.
format Preprint
id arxiv_https___arxiv_org_abs_2601_04658
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle LAMB: LLM-based Audio Captioning with Modality Gap Bridging via Cauchy-Schwarz Divergence
Lee, Hyeongkeun
Choi, Jongmin
Nam, KiHyun
Chung, Joon Son
Sound
Artificial Intelligence
Automated Audio Captioning aims to describe the semantic content of input audio. Recent works have employed large language models (LLMs) as a text decoder to leverage their reasoning capabilities. However, prior approaches that project audio features into the LLM embedding space without considering cross-modal alignment fail to fully utilize these capabilities. To address this, we propose LAMB, an LLM-based audio captioning framework that bridges the modality gap between audio embeddings and the LLM text embedding space. LAMB incorporates a Cross-Modal Aligner that minimizes Cauchy-Schwarz divergence while maximizing mutual information, yielding tighter alignment between audio and text at both global and token levels. We further design a Two-Stream Adapter that extracts semantically enriched audio embeddings, thereby delivering richer information to the Cross-Modal Aligner. Finally, leveraging the aligned audio embeddings, a proposed Token Guide directly computes scores within the LLM text embedding space to steer the output logits of generated captions. Experimental results confirm that our framework strengthens the reasoning capabilities of the LLM decoder, achieving state-of-the-art performance on AudioCaps.
title LAMB: LLM-based Audio Captioning with Modality Gap Bridging via Cauchy-Schwarz Divergence
topic Sound
Artificial Intelligence
url https://arxiv.org/abs/2601.04658