Salvato in:
Dettagli Bibliografici
Autori principali: Chou, Hsuan-Yu, Naveed, Wajiha, Zhou, Shuyan, Yang, Xiaowei
Natura: Preprint
Pubblicazione: 2026
Soggetti:
Accesso online:https://arxiv.org/abs/2602.05189
Tags: Aggiungi Tag
Nessun Tag, puoi essere il primo ad aggiungerne!!
_version_ 1866908814174846976
author Chou, Hsuan-Yu
Naveed, Wajiha
Zhou, Shuyan
Yang, Xiaowei
author_facet Chou, Hsuan-Yu
Naveed, Wajiha
Zhou, Shuyan
Yang, Xiaowei
contents As internet access expands, so does exposure to harmful content, increasing the need for effective moderation. Research has demonstrated that large language models (LLMs) can be effectively utilized for social media moderation tasks, including harmful content detection. While proprietary LLMs have been shown to zero-shot outperform traditional machine learning models, the out-of-the-box capability of open-weight LLMs remains an open question. Motivated by recent developments of reasoning LLMs, we evaluate seven state-of-the-art models: four proprietary and three open-weight. Testing with real-world posts on Bluesky, moderation decisions by Bluesky Moderation Service, and annotations by two authors, we find a considerable degree of overlap between the sensitivity (81%--97%) and specificity (91%--100%) of the open-weight LLMs and those (72%--98%, and 93%--99%) of the proprietary ones. Additionally, our analysis reveals that specificity exceeds sensitivity for rudeness detection, but the opposite holds for intolerance and threats. Lastly, we identify inter-rater agreement across human moderators and the LLMs, highlighting considerations for deploying LLMs in both platform-scale and personalized moderation contexts. These findings show open-weight LLMs can support privacy-preserving moderation on consumer-grade hardware and suggest new directions for designing moderation systems that balance community values with individual user preferences.
format Preprint
id arxiv_https___arxiv_org_abs_2602_05189
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle Are Open-Weight LLMs Ready for Social Media Moderation? A Comparative Study on Bluesky
Chou, Hsuan-Yu
Naveed, Wajiha
Zhou, Shuyan
Yang, Xiaowei
Computation and Language
Human-Computer Interaction
Machine Learning
Social and Information Networks
As internet access expands, so does exposure to harmful content, increasing the need for effective moderation. Research has demonstrated that large language models (LLMs) can be effectively utilized for social media moderation tasks, including harmful content detection. While proprietary LLMs have been shown to zero-shot outperform traditional machine learning models, the out-of-the-box capability of open-weight LLMs remains an open question. Motivated by recent developments of reasoning LLMs, we evaluate seven state-of-the-art models: four proprietary and three open-weight. Testing with real-world posts on Bluesky, moderation decisions by Bluesky Moderation Service, and annotations by two authors, we find a considerable degree of overlap between the sensitivity (81%--97%) and specificity (91%--100%) of the open-weight LLMs and those (72%--98%, and 93%--99%) of the proprietary ones. Additionally, our analysis reveals that specificity exceeds sensitivity for rudeness detection, but the opposite holds for intolerance and threats. Lastly, we identify inter-rater agreement across human moderators and the LLMs, highlighting considerations for deploying LLMs in both platform-scale and personalized moderation contexts. These findings show open-weight LLMs can support privacy-preserving moderation on consumer-grade hardware and suggest new directions for designing moderation systems that balance community values with individual user preferences.
title Are Open-Weight LLMs Ready for Social Media Moderation? A Comparative Study on Bluesky
topic Computation and Language
Human-Computer Interaction
Machine Learning
Social and Information Networks
url https://arxiv.org/abs/2602.05189