Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Kunilovskaya, Maria, Bhatia, Gagan, Albertelli, Lisa Sophie, Chen, Yanran, Greisinger, Christian, Kiefer, Lotta, Leiter, Christoph, Roy, Subhadeep, Achamaleh, Tewodros, Manzoor, Muhammad Arslan, Pohl, Sebastian, Hou, Yufang, Eger, Steffen
Format:	Preprint
Published:	2026
Subjects:	Computation and Language Artificial Intelligence
Online Access:	https://arxiv.org/abs/2606.02255
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866917555579387904
author	Kunilovskaya, Maria Bhatia, Gagan Albertelli, Lisa Sophie Chen, Yanran Greisinger, Christian Kiefer, Lotta Leiter, Christoph Roy, Subhadeep Achamaleh, Tewodros Manzoor, Muhammad Arslan Pohl, Sebastian Hou, Yufang Eger, Steffen
author_facet	Kunilovskaya, Maria Bhatia, Gagan Albertelli, Lisa Sophie Chen, Yanran Greisinger, Christian Kiefer, Lotta Leiter, Christoph Roy, Subhadeep Achamaleh, Tewodros Manzoor, Muhammad Arslan Pohl, Sebastian Hou, Yufang Eger, Steffen
contents	Human annotation is the empirical foundation of much NLP research, from dataset construction to model evaluation, but papers often leave unclear who produced the annotations and how the annotation process was controlled. We provide the first large-scale, task-level audit of human annotation reporting across major NLP venues, asking which annotation details are documented, which are missing, and how reporting varies across time, topic, venue, and intended use of human judgment. We introduce a unified taxonomy of annotation-reporting practices and validate an LLM-assisted extraction pipeline against Annotated-gold, a human-adjudicated gold standard of 41 papers and 72 annotation tasks, where the best model reaches human-comparable agreement with adjudicated labels, with Krippendorff's alpha of 0.606 versus 0.585 for human-human agreement. Using this pipeline, we construct Annotated-llm, a dataset covering ACL-venue papers from 2018-2025, with 2,667 extracted annotation tasks from 1,603 papers, and find that papers frequently report operational details such as recruitment strategies, annotator expertise, and annotation volume, but often omit details needed to assess annotation validity, including training, language proficiency, compensation, socio-demographics, adjudication, and agreement values, especially in model-evaluation studies. Our results show that annotation reporting in NLP has improved over time but remains uneven, and they establish a scalable framework and bare-minimum reporting recommendations for making human annotation more reliable, reproducible, and interpretable.
format	Preprint
id	arxiv_https___arxiv_org_abs_2606_02255
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Who Annotates in NLP? A Large-scale Assessment of Human Annotation Reporting between 2018 and 2025 Kunilovskaya, Maria Bhatia, Gagan Albertelli, Lisa Sophie Chen, Yanran Greisinger, Christian Kiefer, Lotta Leiter, Christoph Roy, Subhadeep Achamaleh, Tewodros Manzoor, Muhammad Arslan Pohl, Sebastian Hou, Yufang Eger, Steffen Computation and Language Artificial Intelligence Human annotation is the empirical foundation of much NLP research, from dataset construction to model evaluation, but papers often leave unclear who produced the annotations and how the annotation process was controlled. We provide the first large-scale, task-level audit of human annotation reporting across major NLP venues, asking which annotation details are documented, which are missing, and how reporting varies across time, topic, venue, and intended use of human judgment. We introduce a unified taxonomy of annotation-reporting practices and validate an LLM-assisted extraction pipeline against Annotated-gold, a human-adjudicated gold standard of 41 papers and 72 annotation tasks, where the best model reaches human-comparable agreement with adjudicated labels, with Krippendorff's alpha of 0.606 versus 0.585 for human-human agreement. Using this pipeline, we construct Annotated-llm, a dataset covering ACL-venue papers from 2018-2025, with 2,667 extracted annotation tasks from 1,603 papers, and find that papers frequently report operational details such as recruitment strategies, annotator expertise, and annotation volume, but often omit details needed to assess annotation validity, including training, language proficiency, compensation, socio-demographics, adjudication, and agreement values, especially in model-evaluation studies. Our results show that annotation reporting in NLP has improved over time but remains uneven, and they establish a scalable framework and bare-minimum reporting recommendations for making human annotation more reliable, reproducible, and interpretable.
title	Who Annotates in NLP? A Large-scale Assessment of Human Annotation Reporting between 2018 and 2025
topic	Computation and Language Artificial Intelligence
url	https://arxiv.org/abs/2606.02255

Similar Items