Saved in:
Bibliographic Details
Main Authors: Zhou, Hang, Mueller, Jonas, Kumar, Mayank, Wang, Jane-Ling, Lei, Jing
Format: Preprint
Published: 2023
Subjects:
Online Access:https://arxiv.org/abs/2305.16583
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866911795207208960
author Zhou, Hang
Mueller, Jonas
Kumar, Mayank
Wang, Jane-Ling
Lei, Jing
author_facet Zhou, Hang
Mueller, Jonas
Kumar, Mayank
Wang, Jane-Ling
Lei, Jing
contents Noise plagues many numerical datasets, where the recorded values in the data may fail to match the true underlying values due to reasons including: erroneous sensors, data entry/processing mistakes, or imperfect human estimates. We consider general regression settings with covariates and a potentially corrupted response whose observed values may contain errors. By accounting for various uncertainties, we introduced veracity scores that distinguish between genuine errors and natural data fluctuations, conditioned on the available covariate information in the dataset. We propose a simple yet efficient filtering procedure for eliminating potential errors, and establish theoretical guarantees for our method. We also contribute a new error detection benchmark involving 5 regression datasets with real-world numerical errors (for which the true values are also known). In this benchmark and additional simulation studies, our method identifies incorrect values with better precision/recall than other approaches.
format Preprint
id arxiv_https___arxiv_org_abs_2305_16583
institution arXiv
publishDate 2023
record_format arxiv
spellingShingle Detecting Errors in a Numerical Response via any Regression Model
Zhou, Hang
Mueller, Jonas
Kumar, Mayank
Wang, Jane-Ling
Lei, Jing
Machine Learning
Noise plagues many numerical datasets, where the recorded values in the data may fail to match the true underlying values due to reasons including: erroneous sensors, data entry/processing mistakes, or imperfect human estimates. We consider general regression settings with covariates and a potentially corrupted response whose observed values may contain errors. By accounting for various uncertainties, we introduced veracity scores that distinguish between genuine errors and natural data fluctuations, conditioned on the available covariate information in the dataset. We propose a simple yet efficient filtering procedure for eliminating potential errors, and establish theoretical guarantees for our method. We also contribute a new error detection benchmark involving 5 regression datasets with real-world numerical errors (for which the true values are also known). In this benchmark and additional simulation studies, our method identifies incorrect values with better precision/recall than other approaches.
title Detecting Errors in a Numerical Response via any Regression Model
topic Machine Learning
url https://arxiv.org/abs/2305.16583