Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Camacho, José, Wasielewska, Katarzyna, Espinosa, Pablo, Fuentes-García, Marta
Format:	Preprint
Published:	2023
Subjects:	Machine Learning
Online Access:	https://arxiv.org/abs/2305.19770
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866909416191688704
author	Camacho, José Wasielewska, Katarzyna Espinosa, Pablo Fuentes-García, Marta
author_facet	Camacho, José Wasielewska, Katarzyna Espinosa, Pablo Fuentes-García, Marta
contents	Autonomous or self-driving networks are expected to provide a solution to the myriad of extremely demanding new applications with minimal human supervision. For this purpose, the community relies on the development of new Machine Learning (ML) models and techniques. %, like the celebrated Deep Learning (DL). However, ML can only be as good as the data it is fitted with, and data quality is an elusive concept difficult to assess. In this paper, we show that relatively minor modifications on a benchmark dataset (UGR'16, a flow-based real-traffic dataset for anomaly detection) cause significantly more impact on model performance than the specific ML technique considered. We also show that the measured model performance is uncertain, as a result of labelling inaccuracies. Our findings illustrate that the widely adopted approach of comparing a set of models in terms of performance results (e.g., in terms of accuracy or ROC curves) may lead to incorrect conclusions when done without a proper understanding of dataset biases and sensitivity. We contribute a methodology to interpret a model response that can be useful for this understanding.
format	Preprint
id	arxiv_https___arxiv_org_abs_2305_19770
institution	arXiv
publishDate	2023
record_format	arxiv
spellingShingle	Quality In / Quality Out: Data quality more relevant than model choice in anomaly detection with the UGR'16 Camacho, José Wasielewska, Katarzyna Espinosa, Pablo Fuentes-García, Marta Machine Learning Autonomous or self-driving networks are expected to provide a solution to the myriad of extremely demanding new applications with minimal human supervision. For this purpose, the community relies on the development of new Machine Learning (ML) models and techniques. %, like the celebrated Deep Learning (DL). However, ML can only be as good as the data it is fitted with, and data quality is an elusive concept difficult to assess. In this paper, we show that relatively minor modifications on a benchmark dataset (UGR'16, a flow-based real-traffic dataset for anomaly detection) cause significantly more impact on model performance than the specific ML technique considered. We also show that the measured model performance is uncertain, as a result of labelling inaccuracies. Our findings illustrate that the widely adopted approach of comparing a set of models in terms of performance results (e.g., in terms of accuracy or ROC curves) may lead to incorrect conclusions when done without a proper understanding of dataset biases and sensitivity. We contribute a methodology to interpret a model response that can be useful for this understanding.
title	Quality In / Quality Out: Data quality more relevant than model choice in anomaly detection with the UGR'16
topic	Machine Learning
url	https://arxiv.org/abs/2305.19770

Similar Items