Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Author:	Mandalika, Sriram
Format:	Preprint
Published:	2026
Subjects:	Computer Vision and Pattern Recognition Computation and Language Emerging Technologies
Online Access:	https://arxiv.org/abs/2605.25708
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866918521344098304
author	Mandalika, Sriram
author_facet	Mandalika, Sriram
contents	Multi-domain task-incremental learning requires a model to sequentially acquire knowledge across visually diverse domains without forgetting prior tasks, and without access to task identity at inference. Parameter-efficient methods built on frozen vision-language models have made strong progress, yet all existing approaches rely exclusively on visual features for task routing, confidence estimation, and encoder adaptation, leaving CLIP's cross-modal text embedding space entirely unexploited. We address this gap through three contributions. Text-space task routing replaces visual Gaussian matching with cosine similarity to frozen CLIP text prototypes, giving order-independent routing robust to data scarcity at zero parameter cost. Multi-prototype visual-textual confidence replaces single-Gaussian class modeling with K-means visual prototypes and cross-modal alignment scores under task-calibrated thresholds. Symmetric cross-modal gating extends per-layer Gumbel gates to the text encoder conditioned on batch image features, preserving cross-modal alignment on out-of-distribution inputs. On the MTIL benchmark spanning 11 datasets and 1201 classes, our method achieves 74.2% Transfer, 80.5% Average, and 88.7% Last under Order-I, surpassing the prior state of the art by 5.0, 3.7, and 3.0 percentage points with only 2.5M trainable parameters and no external data.
format	Preprint
id	arxiv_https___arxiv_org_abs_2605_25708
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	CMAP: Cross-Modal Adaptive Prompting for Multi-Domain Task-Incremental Learning Mandalika, Sriram Computer Vision and Pattern Recognition Computation and Language Emerging Technologies Multi-domain task-incremental learning requires a model to sequentially acquire knowledge across visually diverse domains without forgetting prior tasks, and without access to task identity at inference. Parameter-efficient methods built on frozen vision-language models have made strong progress, yet all existing approaches rely exclusively on visual features for task routing, confidence estimation, and encoder adaptation, leaving CLIP's cross-modal text embedding space entirely unexploited. We address this gap through three contributions. Text-space task routing replaces visual Gaussian matching with cosine similarity to frozen CLIP text prototypes, giving order-independent routing robust to data scarcity at zero parameter cost. Multi-prototype visual-textual confidence replaces single-Gaussian class modeling with K-means visual prototypes and cross-modal alignment scores under task-calibrated thresholds. Symmetric cross-modal gating extends per-layer Gumbel gates to the text encoder conditioned on batch image features, preserving cross-modal alignment on out-of-distribution inputs. On the MTIL benchmark spanning 11 datasets and 1201 classes, our method achieves 74.2% Transfer, 80.5% Average, and 88.7% Last under Order-I, surpassing the prior state of the art by 5.0, 3.7, and 3.0 percentage points with only 2.5M trainable parameters and no external data.
title	CMAP: Cross-Modal Adaptive Prompting for Multi-Domain Task-Incremental Learning
topic	Computer Vision and Pattern Recognition Computation and Language Emerging Technologies
url	https://arxiv.org/abs/2605.25708

Similar Items