Saved in:
| Main Authors: | , |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2603.22732 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866917359311126528 |
|---|---|
| author | Nguyen, Khanh Binh Park, Chae Jung |
| author_facet | Nguyen, Khanh Binh Park, Chae Jung |
| contents | Large-scale pre-trained image-text models exhibit robust multimodal representations, yet applying the Contrastive Language-Image Pre-training (CLIP) model to audio-visual localization remains challenging. Replacing the classification token ([CLS]) with an audio-embedded token ([V_A]) struggles to capture semantic cues, and the prompt "a photo of a [V_A]" fails to establish meaningful connections between audio embeddings and context tokens. To address these issues, we propose Sound-aware Prompt Learning (SOUPLE), which replaces fixed prompts with learnable context tokens. These tokens incorporate visual features to generate conditional context for a mask decoder, effectively bridging semantic correspondence between audio and visual inputs. Experiments on VGGSound, SoundNet, and AVSBench demonstrate that SOUPLE improves localization and segmentation performance. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2603_22732 |
| institution | arXiv |
| publishDate | 2026 |
| record_format | arxiv |
| spellingShingle | SOUPLE: Enhancing Audio-Visual Localization and Segmentation with Learnable Prompt Contexts Nguyen, Khanh Binh Park, Chae Jung Computer Vision and Pattern Recognition Large-scale pre-trained image-text models exhibit robust multimodal representations, yet applying the Contrastive Language-Image Pre-training (CLIP) model to audio-visual localization remains challenging. Replacing the classification token ([CLS]) with an audio-embedded token ([V_A]) struggles to capture semantic cues, and the prompt "a photo of a [V_A]" fails to establish meaningful connections between audio embeddings and context tokens. To address these issues, we propose Sound-aware Prompt Learning (SOUPLE), which replaces fixed prompts with learnable context tokens. These tokens incorporate visual features to generate conditional context for a mask decoder, effectively bridging semantic correspondence between audio and visual inputs. Experiments on VGGSound, SoundNet, and AVSBench demonstrate that SOUPLE improves localization and segmentation performance. |
| title | SOUPLE: Enhancing Audio-Visual Localization and Segmentation with Learnable Prompt Contexts |
| topic | Computer Vision and Pattern Recognition |
| url | https://arxiv.org/abs/2603.22732 |