Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Nguyen, Khanh Binh, Park, Chae Jung
Format:	Preprint
Published:	2026
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2603.22732
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866917359311126528
author	Nguyen, Khanh Binh Park, Chae Jung
author_facet	Nguyen, Khanh Binh Park, Chae Jung
contents	Large-scale pre-trained image-text models exhibit robust multimodal representations, yet applying the Contrastive Language-Image Pre-training (CLIP) model to audio-visual localization remains challenging. Replacing the classification token ([CLS]) with an audio-embedded token ([V_A]) struggles to capture semantic cues, and the prompt "a photo of a [V_A]" fails to establish meaningful connections between audio embeddings and context tokens. To address these issues, we propose Sound-aware Prompt Learning (SOUPLE), which replaces fixed prompts with learnable context tokens. These tokens incorporate visual features to generate conditional context for a mask decoder, effectively bridging semantic correspondence between audio and visual inputs. Experiments on VGGSound, SoundNet, and AVSBench demonstrate that SOUPLE improves localization and segmentation performance.
format	Preprint
id	arxiv_https___arxiv_org_abs_2603_22732
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	SOUPLE: Enhancing Audio-Visual Localization and Segmentation with Learnable Prompt Contexts Nguyen, Khanh Binh Park, Chae Jung Computer Vision and Pattern Recognition Large-scale pre-trained image-text models exhibit robust multimodal representations, yet applying the Contrastive Language-Image Pre-training (CLIP) model to audio-visual localization remains challenging. Replacing the classification token ([CLS]) with an audio-embedded token ([V_A]) struggles to capture semantic cues, and the prompt "a photo of a [V_A]" fails to establish meaningful connections between audio embeddings and context tokens. To address these issues, we propose Sound-aware Prompt Learning (SOUPLE), which replaces fixed prompts with learnable context tokens. These tokens incorporate visual features to generate conditional context for a mask decoder, effectively bridging semantic correspondence between audio and visual inputs. Experiments on VGGSound, SoundNet, and AVSBench demonstrate that SOUPLE improves localization and segmentation performance.
title	SOUPLE: Enhancing Audio-Visual Localization and Segmentation with Learnable Prompt Contexts
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2603.22732

Similar Items