Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Root, Andrew, Jakubowski, Liam, Vanamala, Mounika
Format:	Preprint
Published:	2024
Subjects:	Machine Learning
Online Access:	https://arxiv.org/abs/2412.00609
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866917852921987072
author	Root, Andrew Jakubowski, Liam Vanamala, Mounika
author_facet	Root, Andrew Jakubowski, Liam Vanamala, Mounika
contents	It is well known that the usefulness of a machine learning model is due to its ability to generalize to unseen data. This study uses three popular cyberbullying datasets to explore the effects of data, how it's collected, and how it's labeled, on the resulting machine learning models. The bias introduced from differing definitions of cyberbullying and from data collection is discussed in detail. An emphasis is made on the impact of dataset expansion methods, which utilize current data points to fetch and label new ones. Furthermore, explicit testing is performed to evaluate the ability of a model to generalize to unseen datasets through cross-dataset evaluation. As hypothesized, the models have a significant drop in the Macro F1 Score, with an average drop of 0.222. As such, this study effectively highlights the importance of dataset curation and cross-dataset testing for creating models with real-world applicability. The experiments and other code can be found at https://github.com/rootdrew27/cyberbullying-ml.
format	Preprint
id	arxiv_https___arxiv_org_abs_2412_00609
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	Exploration and Evaluation of Bias in Cyberbullying Detection with Machine Learning Root, Andrew Jakubowski, Liam Vanamala, Mounika Machine Learning It is well known that the usefulness of a machine learning model is due to its ability to generalize to unseen data. This study uses three popular cyberbullying datasets to explore the effects of data, how it's collected, and how it's labeled, on the resulting machine learning models. The bias introduced from differing definitions of cyberbullying and from data collection is discussed in detail. An emphasis is made on the impact of dataset expansion methods, which utilize current data points to fetch and label new ones. Furthermore, explicit testing is performed to evaluate the ability of a model to generalize to unseen datasets through cross-dataset evaluation. As hypothesized, the models have a significant drop in the Macro F1 Score, with an average drop of 0.222. As such, this study effectively highlights the importance of dataset curation and cross-dataset testing for creating models with real-world applicability. The experiments and other code can be found at https://github.com/rootdrew27/cyberbullying-ml.
title	Exploration and Evaluation of Bias in Cyberbullying Detection with Machine Learning
topic	Machine Learning
url	https://arxiv.org/abs/2412.00609

Similar Items