Saved in:
Bibliographic Details
Main Authors: Guo, Jiaxin, Chen, C. L. Philip, Li, Shuzhen, Zhang, Tong
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2502.00305
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866929693563813888
author Guo, Jiaxin
Chen, C. L. Philip
Li, Shuzhen
Zhang, Tong
author_facet Guo, Jiaxin
Chen, C. L. Philip
Li, Shuzhen
Zhang, Tong
contents Cold-start active learning (CSAL) selects valuable instances from an unlabeled dataset for manual annotation. It provides high-quality data at a low annotation cost for label-scarce text classification. However, existing CSAL methods overlook weak classes and hard representative examples, resulting in biased learning. To address these issues, this paper proposes a novel dual-diversity enhancing and uncertainty-aware (DEUCE) framework for CSAL. Specifically, DEUCE leverages a pretrained language model (PLM) to efficiently extract textual representations, class predictions, and predictive uncertainty. Then, it constructs a Dual-Neighbor Graph (DNG) to combine information on both textual diversity and class diversity, ensuring a balanced data distribution. It further propagates uncertainty information via density-based clustering to select hard representative instances. DEUCE performs well in selecting class-balanced and hard representative data by dual-diversity and informativeness. Experiments on six NLP datasets demonstrate the superiority and efficiency of DEUCE.
format Preprint
id arxiv_https___arxiv_org_abs_2502_00305
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle DEUCE: Dual-diversity Enhancement and Uncertainty-awareness for Cold-start Active Learning
Guo, Jiaxin
Chen, C. L. Philip
Li, Shuzhen
Zhang, Tong
Computation and Language
Artificial Intelligence
Information Retrieval
I.2.6; I.2.7; I.5.1; H.3.1; H.3.3
Cold-start active learning (CSAL) selects valuable instances from an unlabeled dataset for manual annotation. It provides high-quality data at a low annotation cost for label-scarce text classification. However, existing CSAL methods overlook weak classes and hard representative examples, resulting in biased learning. To address these issues, this paper proposes a novel dual-diversity enhancing and uncertainty-aware (DEUCE) framework for CSAL. Specifically, DEUCE leverages a pretrained language model (PLM) to efficiently extract textual representations, class predictions, and predictive uncertainty. Then, it constructs a Dual-Neighbor Graph (DNG) to combine information on both textual diversity and class diversity, ensuring a balanced data distribution. It further propagates uncertainty information via density-based clustering to select hard representative instances. DEUCE performs well in selecting class-balanced and hard representative data by dual-diversity and informativeness. Experiments on six NLP datasets demonstrate the superiority and efficiency of DEUCE.
title DEUCE: Dual-diversity Enhancement and Uncertainty-awareness for Cold-start Active Learning
topic Computation and Language
Artificial Intelligence
Information Retrieval
I.2.6; I.2.7; I.5.1; H.3.1; H.3.3
url https://arxiv.org/abs/2502.00305