Saved in:
Bibliographic Details
Main Authors: Root, Andrew, Jakubowski, Liam, Vanamala, Mounika
Format: Preprint
Published: 2024
Subjects:
Online Access:https://arxiv.org/abs/2412.00609
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866917852921987072
author Root, Andrew
Jakubowski, Liam
Vanamala, Mounika
author_facet Root, Andrew
Jakubowski, Liam
Vanamala, Mounika
contents It is well known that the usefulness of a machine learning model is due to its ability to generalize to unseen data. This study uses three popular cyberbullying datasets to explore the effects of data, how it's collected, and how it's labeled, on the resulting machine learning models. The bias introduced from differing definitions of cyberbullying and from data collection is discussed in detail. An emphasis is made on the impact of dataset expansion methods, which utilize current data points to fetch and label new ones. Furthermore, explicit testing is performed to evaluate the ability of a model to generalize to unseen datasets through cross-dataset evaluation. As hypothesized, the models have a significant drop in the Macro F1 Score, with an average drop of 0.222. As such, this study effectively highlights the importance of dataset curation and cross-dataset testing for creating models with real-world applicability. The experiments and other code can be found at https://github.com/rootdrew27/cyberbullying-ml.
format Preprint
id arxiv_https___arxiv_org_abs_2412_00609
institution arXiv
publishDate 2024
record_format arxiv
spellingShingle Exploration and Evaluation of Bias in Cyberbullying Detection with Machine Learning
Root, Andrew
Jakubowski, Liam
Vanamala, Mounika
Machine Learning
It is well known that the usefulness of a machine learning model is due to its ability to generalize to unseen data. This study uses three popular cyberbullying datasets to explore the effects of data, how it's collected, and how it's labeled, on the resulting machine learning models. The bias introduced from differing definitions of cyberbullying and from data collection is discussed in detail. An emphasis is made on the impact of dataset expansion methods, which utilize current data points to fetch and label new ones. Furthermore, explicit testing is performed to evaluate the ability of a model to generalize to unseen datasets through cross-dataset evaluation. As hypothesized, the models have a significant drop in the Macro F1 Score, with an average drop of 0.222. As such, this study effectively highlights the importance of dataset curation and cross-dataset testing for creating models with real-world applicability. The experiments and other code can be found at https://github.com/rootdrew27/cyberbullying-ml.
title Exploration and Evaluation of Bias in Cyberbullying Detection with Machine Learning
topic Machine Learning
url https://arxiv.org/abs/2412.00609