Saved in:
Bibliographic Details
Main Authors: Shaulov, Ariel, Shaharabany, Tal, Shaar, Eitan, Chechik, Gal, Wolf, Lior
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2501.03183
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866929660953100288
author Shaulov, Ariel
Shaharabany, Tal
Shaar, Eitan
Chechik, Gal
Wolf, Lior
author_facet Shaulov, Ariel
Shaharabany, Tal
Shaar, Eitan
Chechik, Gal
Wolf, Lior
contents Most current captioning systems use language models trained on data from specific settings, such as image-based captioning via Amazon Mechanical Turk, limiting their ability to generalize to other modality distributions and contexts. This limitation hinders performance in tasks like audio or video captioning, where different semantic cues are needed. Addressing this challenge is crucial for creating more adaptable and versatile captioning frameworks applicable across diverse real-world contexts. In this work, we introduce a method to adapt captioning networks to the semantics of alternative settings, such as capturing audibility in audio captioning, where it is crucial to describe sounds and their sources. Our framework consists of two main components: (i) a frozen captioning system incorporating a language model (LM), and (ii) a text classifier that guides the captioning system. The classifier is trained on a dataset automatically generated by GPT-4, using tailored prompts specifically designed to enhance key aspects of the generated captions. Importantly, the framework operates solely during inference, eliminating the need for further training of the underlying captioning model. We evaluate the framework on various models and modalities, with a focus on audio captioning, and report promising results. Notably, when combined with an existing zero-shot audio captioning system, our framework improves its quality and sets state-of-the-art performance in zero-shot audio captioning.
format Preprint
id arxiv_https___arxiv_org_abs_2501_03183
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle Classifier-Guided Captioning Across Modalities
Shaulov, Ariel
Shaharabany, Tal
Shaar, Eitan
Chechik, Gal
Wolf, Lior
Computation and Language
Artificial Intelligence
Sound
Audio and Speech Processing
Most current captioning systems use language models trained on data from specific settings, such as image-based captioning via Amazon Mechanical Turk, limiting their ability to generalize to other modality distributions and contexts. This limitation hinders performance in tasks like audio or video captioning, where different semantic cues are needed. Addressing this challenge is crucial for creating more adaptable and versatile captioning frameworks applicable across diverse real-world contexts. In this work, we introduce a method to adapt captioning networks to the semantics of alternative settings, such as capturing audibility in audio captioning, where it is crucial to describe sounds and their sources. Our framework consists of two main components: (i) a frozen captioning system incorporating a language model (LM), and (ii) a text classifier that guides the captioning system. The classifier is trained on a dataset automatically generated by GPT-4, using tailored prompts specifically designed to enhance key aspects of the generated captions. Importantly, the framework operates solely during inference, eliminating the need for further training of the underlying captioning model. We evaluate the framework on various models and modalities, with a focus on audio captioning, and report promising results. Notably, when combined with an existing zero-shot audio captioning system, our framework improves its quality and sets state-of-the-art performance in zero-shot audio captioning.
title Classifier-Guided Captioning Across Modalities
topic Computation and Language
Artificial Intelligence
Sound
Audio and Speech Processing
url https://arxiv.org/abs/2501.03183