Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Zhou, Qihang, Gao, Binbin, Pang, Guansong, Wang, Xin, Chen, Jiming, He, Shibo
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2510.21171
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866918359851859968
author	Zhou, Qihang Gao, Binbin Pang, Guansong Wang, Xin Chen, Jiming He, Shibo
author_facet	Zhou, Qihang Gao, Binbin Pang, Guansong Wang, Xin Chen, Jiming He, Shibo
contents	Adapting CLIP for anomaly detection on unseen objects has shown strong potential in a zero-shot manner. However, existing methods typically rely on a single textual space to align with visual semantics across diverse objects and domains. The indiscriminate alignment hinders the model from accurately capturing varied anomaly semantics. We propose TokenCLIP, a token-wise adaptation framework that enables dynamic alignment between visual and learnable textual spaces for fine-grained anomaly learning. Rather than mapping all visual tokens to a single, token-agnostic textual space, TokenCLIP aligns each token with a customized textual subspace that represents its visual characteristics. Explicitly assigning a unique learnable textual space to each token is computationally intractable and prone to insufficient optimization. We instead expand the token-agnostic textual space into a set of orthogonal subspaces, and then dynamically assign each token to a subspace combination guided by semantic affinity, which jointly supports customized and efficient token-wise adaptation. To this end, we formulate dynamic alignment as an optimal transport problem, where all visual tokens in an image are transported to textual subspaces based on semantic similarity. The transport constraints of OT ensure sufficient optimization across subspaces and encourage them to focus on different semantics. Solving the problem yields a transport plan that adaptively assigns each token to semantically relevant subspaces. A top-k masking is then applied to sparsify the plan and specialize subspaces for distinct visual regions. Extensive experiments demonstrate the superiority of TokenCLIP.
format	Preprint
id	arxiv_https___arxiv_org_abs_2510_21171
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	TokenCLIP: Token-wise Prompt Learning for Zero-shot Anomaly Detection Zhou, Qihang Gao, Binbin Pang, Guansong Wang, Xin Chen, Jiming He, Shibo Computer Vision and Pattern Recognition Adapting CLIP for anomaly detection on unseen objects has shown strong potential in a zero-shot manner. However, existing methods typically rely on a single textual space to align with visual semantics across diverse objects and domains. The indiscriminate alignment hinders the model from accurately capturing varied anomaly semantics. We propose TokenCLIP, a token-wise adaptation framework that enables dynamic alignment between visual and learnable textual spaces for fine-grained anomaly learning. Rather than mapping all visual tokens to a single, token-agnostic textual space, TokenCLIP aligns each token with a customized textual subspace that represents its visual characteristics. Explicitly assigning a unique learnable textual space to each token is computationally intractable and prone to insufficient optimization. We instead expand the token-agnostic textual space into a set of orthogonal subspaces, and then dynamically assign each token to a subspace combination guided by semantic affinity, which jointly supports customized and efficient token-wise adaptation. To this end, we formulate dynamic alignment as an optimal transport problem, where all visual tokens in an image are transported to textual subspaces based on semantic similarity. The transport constraints of OT ensure sufficient optimization across subspaces and encourage them to focus on different semantics. Solving the problem yields a transport plan that adaptively assigns each token to semantically relevant subspaces. A top-k masking is then applied to sparsify the plan and specialize subspaces for distinct visual regions. Extensive experiments demonstrate the superiority of TokenCLIP.
title	TokenCLIP: Token-wise Prompt Learning for Zero-shot Anomaly Detection
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2510.21171

Similar Items