Saved in:
Bibliographic Details
Main Authors: Huang, Hongfei, Liang, Tingting, Sun, Xixi, Jin, Zikang, Yin, Yuyu
Format: Preprint
Published: 2024
Subjects:
Online Access:https://arxiv.org/abs/2407.06579
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866913422801633280
author Huang, Hongfei
Liang, Tingting
Sun, Xixi
Jin, Zikang
Yin, Yuyu
author_facet Huang, Hongfei
Liang, Tingting
Sun, Xixi
Jin, Zikang
Yin, Yuyu
contents Existing research on learning with noisy labels predominantly focuses on synthetic label noise. Although synthetic noise possesses well-defined structural properties, it often fails to accurately replicate real-world noise patterns. In recent years, there has been a concerted effort to construct generalizable and controllable instance-dependent noise datasets for image classification, significantly advancing the development of noise-robust learning in this area. However, studies on noisy label learning for text classification remain scarce. To better understand label noise in real-world text classification settings, we constructed the benchmark dataset NoisyAG-News through manual annotation. Initially, we analyzed the annotated data to gather observations about real-world noise. We qualitatively and quantitatively demonstrated that real-world noisy labels adhere to instance-dependent patterns. Subsequently, we conducted comprehensive learning experiments on NoisyAG-News and its corresponding synthetic noise datasets using pre-trained language models and noise-handling techniques. Our findings reveal that while pre-trained models are resilient to synthetic noise, they struggle against instance-dependent noise, with samples of varying confusion levels showing inconsistent performance during training and testing. These real-world noise patterns pose new, significant challenges, prompting a reevaluation of noisy label handling methods. We hope that NoisyAG-News will facilitate the development and evaluation of future solutions for learning with noisy labels.
format Preprint
id arxiv_https___arxiv_org_abs_2407_06579
institution arXiv
publishDate 2024
record_format arxiv
spellingShingle NoisyAG-News: A Benchmark for Addressing Instance-Dependent Noise in Text Classification
Huang, Hongfei
Liang, Tingting
Sun, Xixi
Jin, Zikang
Yin, Yuyu
Computation and Language
Existing research on learning with noisy labels predominantly focuses on synthetic label noise. Although synthetic noise possesses well-defined structural properties, it often fails to accurately replicate real-world noise patterns. In recent years, there has been a concerted effort to construct generalizable and controllable instance-dependent noise datasets for image classification, significantly advancing the development of noise-robust learning in this area. However, studies on noisy label learning for text classification remain scarce. To better understand label noise in real-world text classification settings, we constructed the benchmark dataset NoisyAG-News through manual annotation. Initially, we analyzed the annotated data to gather observations about real-world noise. We qualitatively and quantitatively demonstrated that real-world noisy labels adhere to instance-dependent patterns. Subsequently, we conducted comprehensive learning experiments on NoisyAG-News and its corresponding synthetic noise datasets using pre-trained language models and noise-handling techniques. Our findings reveal that while pre-trained models are resilient to synthetic noise, they struggle against instance-dependent noise, with samples of varying confusion levels showing inconsistent performance during training and testing. These real-world noise patterns pose new, significant challenges, prompting a reevaluation of noisy label handling methods. We hope that NoisyAG-News will facilitate the development and evaluation of future solutions for learning with noisy labels.
title NoisyAG-News: A Benchmark for Addressing Instance-Dependent Noise in Text Classification
topic Computation and Language
url https://arxiv.org/abs/2407.06579