Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Du, Yao, Zhai, Qiang, Dai, Weihang, Li, Xiaomeng
Format:	Preprint
Published:	2024
Subjects:	Computer Vision and Pattern Recognition Computation and Language Machine Learning
Online Access:	https://arxiv.org/abs/2408.03574
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866917742984036352
author	Du, Yao Zhai, Qiang Dai, Weihang Li, Xiaomeng
author_facet	Du, Yao Zhai, Qiang Dai, Weihang Li, Xiaomeng
contents	Ordinal regression is a fundamental problem within the field of computer vision, with customised well-trained models on specific tasks. While pre-trained vision-language models (VLMs) have exhibited impressive performance on various vision tasks, their potential for ordinal regression has received less exploration. In this study, we first investigate CLIP's potential for ordinal regression, from which we expect the model could generalise to different ordinal regression tasks and scenarios. Unfortunately, vanilla CLIP fails on this task, since current VLMs have a well-documented limitation of encapsulating compositional concepts such as number sense. We propose a simple yet effective method called NumCLIP to improve the quantitative understanding of VLMs. We disassemble the exact image to number-specific text matching problem into coarse classification and fine prediction stages. We discretize and phrase each numerical bin with common language concept to better leverage the available pre-trained alignment in CLIP. To consider the inherent continuous property of ordinal regression, we propose a novel fine-grained cross-modal ranking-based regularisation loss specifically designed to keep both semantic and ordinal alignment in CLIP's feature space. Experimental results on three general ordinal regression tasks demonstrate the effectiveness of NumCLIP, with 10% and 3.83% accuracy improvement on historical image dating and image aesthetics assessment task, respectively. Code is publicly available at https://github.com/xmed-lab/NumCLIP.
format	Preprint
id	arxiv_https___arxiv_org_abs_2408_03574
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	Teach CLIP to Develop a Number Sense for Ordinal Regression Du, Yao Zhai, Qiang Dai, Weihang Li, Xiaomeng Computer Vision and Pattern Recognition Computation and Language Machine Learning Ordinal regression is a fundamental problem within the field of computer vision, with customised well-trained models on specific tasks. While pre-trained vision-language models (VLMs) have exhibited impressive performance on various vision tasks, their potential for ordinal regression has received less exploration. In this study, we first investigate CLIP's potential for ordinal regression, from which we expect the model could generalise to different ordinal regression tasks and scenarios. Unfortunately, vanilla CLIP fails on this task, since current VLMs have a well-documented limitation of encapsulating compositional concepts such as number sense. We propose a simple yet effective method called NumCLIP to improve the quantitative understanding of VLMs. We disassemble the exact image to number-specific text matching problem into coarse classification and fine prediction stages. We discretize and phrase each numerical bin with common language concept to better leverage the available pre-trained alignment in CLIP. To consider the inherent continuous property of ordinal regression, we propose a novel fine-grained cross-modal ranking-based regularisation loss specifically designed to keep both semantic and ordinal alignment in CLIP's feature space. Experimental results on three general ordinal regression tasks demonstrate the effectiveness of NumCLIP, with 10% and 3.83% accuracy improvement on historical image dating and image aesthetics assessment task, respectively. Code is publicly available at https://github.com/xmed-lab/NumCLIP.
title	Teach CLIP to Develop a Number Sense for Ordinal Regression
topic	Computer Vision and Pattern Recognition Computation and Language Machine Learning
url	https://arxiv.org/abs/2408.03574

Similar Items