Saved in:
Bibliographic Details
Main Authors: Li, Zheng, Song, Yibing, Cheng, Ming-Ming, Li, Xiang, Yang, Jian
Format: Preprint
Published: 2024
Subjects:
Online Access:https://arxiv.org/abs/2412.09442
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866915399379976192
author Li, Zheng
Song, Yibing
Cheng, Ming-Ming
Li, Xiang
Yang, Jian
author_facet Li, Zheng
Song, Yibing
Cheng, Ming-Ming
Li, Xiang
Yang, Jian
contents Textual-based prompt learning methods primarily employ multiple learnable soft prompts and hard class tokens in a cascading manner as text inputs, aiming to align image and text (category) spaces for downstream tasks. However, current training is restricted to aligning images with predefined known categories and cannot be associated with unknown categories. In this work, we propose utilizing universal attributes as a bridge to enhance the alignment between images and unknown categories. Specifically, we introduce an Attribute-anchored Textual Prompt learning method for vision-language models, named ATPrompt. This approach expands the learning space of soft prompts from the original one-dimensional category level into the multi-dimensional attribute level by incorporating multiple attribute tokens into the learnable soft prompts. Through this modification, we transform the text prompt from a category-centric form to an attribute-category hybrid form. Additionally, we introduce a straightforward differentiable attribute search method to identify representative and suitable attributes for downstream tasks. As an easy-to-use plug-in technique, ATPrompt can seamlessly replace the existing basic prompt format in textual-based methods, providing general improvements at a negligible computational cost. Extensive experiments across 11 datasets validate the effectiveness of our method. Code is publicly available at https://github.com/zhengli97/ATPrompt.
format Preprint
id arxiv_https___arxiv_org_abs_2412_09442
institution arXiv
publishDate 2024
record_format arxiv
spellingShingle Advancing Textual Prompt Learning with Anchored Attributes
Li, Zheng
Song, Yibing
Cheng, Ming-Ming
Li, Xiang
Yang, Jian
Computer Vision and Pattern Recognition
Textual-based prompt learning methods primarily employ multiple learnable soft prompts and hard class tokens in a cascading manner as text inputs, aiming to align image and text (category) spaces for downstream tasks. However, current training is restricted to aligning images with predefined known categories and cannot be associated with unknown categories. In this work, we propose utilizing universal attributes as a bridge to enhance the alignment between images and unknown categories. Specifically, we introduce an Attribute-anchored Textual Prompt learning method for vision-language models, named ATPrompt. This approach expands the learning space of soft prompts from the original one-dimensional category level into the multi-dimensional attribute level by incorporating multiple attribute tokens into the learnable soft prompts. Through this modification, we transform the text prompt from a category-centric form to an attribute-category hybrid form. Additionally, we introduce a straightforward differentiable attribute search method to identify representative and suitable attributes for downstream tasks. As an easy-to-use plug-in technique, ATPrompt can seamlessly replace the existing basic prompt format in textual-based methods, providing general improvements at a negligible computational cost. Extensive experiments across 11 datasets validate the effectiveness of our method. Code is publicly available at https://github.com/zhengli97/ATPrompt.
title Advancing Textual Prompt Learning with Anchored Attributes
topic Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2412.09442