:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Author:	Messing, Solomon
Format:	Preprint
Published:	2026
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2604.11581
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Concept-Guided Chain-of-Thought Prompting for Pairwise Comparison Scoring of Texts with Large Language Models
by: Wu, Patrick Y., et al.
Published: (2023)

Enhancing LLM-Based Data Annotation with Error Decomposition
by: Xu, Zhen, et al.
Published: (2026)

Automatically Benchmarking LLM Code Agents through Agent-Driven Annotation and Evaluation
by: Fu, Lingyue, et al.
Published: (2025)

MQM-APE: Toward High-Quality Error Annotation Predictors with Automatic Post-Editing in LLM Translation Evaluators
by: Lu, Qingyu, et al.
Published: (2024)

ReFACT: A Benchmark for Scientific Confabulation Detection with Positional Error Annotations
by: Wang, Yindong, et al.
Published: (2025)

Comparing LLM Text Annotation Skills: A Study on Human Rights Violations in Social Media Data
by: Nemkova, Poli Apollinaire, et al.
Published: (2025)

JuICE: A Benchmark for Evaluating LLM-Judge in Identifying Cultural Errors
by: Jin, Jiho, et al.
Published: (2026)

Error Span Annotation: A Balanced Approach for Human Evaluation of Machine Translation
by: Kocmi, Tom, et al.
Published: (2024)

REPA: Russian Error Types Annotation for Evaluating Text Generation and Judgment Capabilities
by: Pugachev, Alexander, et al.
Published: (2025)

Remedy-R: Generative Reasoning for Machine Translation Evaluation without Error Annotations
by: Tan, Shaomu, et al.
Published: (2025)

Evaluating LLMs at Detecting Errors in LLM Responses
by: Kamoi, Ryo, et al.
Published: (2024)

Beyond English and Evasion: A Human-Annotated Multi-Domain Benchmark for High-Stakes LLM Safety Evaluation in Chinese
by: Zaghouani, Wajdi, et al.
Published: (2026)

Benchmark^2: Systematic Evaluation of LLM Benchmarks
by: Qian, Qi, et al.
Published: (2026)

The Judge Who Never Admits: Hidden Shortcuts in LLM-based Evaluation
by: Marioriyad, Arash, et al.
Published: (2026)

Quantifying the Impact of Translation Errors on Multilingual LLM Evaluation
by: Thellmann, Klaudia-Doris, et al.
Published: (2026)

Evaluation of Geographical Distortions in Language Models
by: Decoupes, Rémy, et al.
Published: (2024)

Refining and Reusing Annotation Guidelines for LLM Annotation
by: Kim, Kon Woo, et al.
Published: (2026)

GaRAGe: A Benchmark with Grounding Annotations for RAG Evaluation
by: Sorodoc, Ionut-Teodor, et al.
Published: (2025)

Position: LLM Unlearning Benchmarks are Weak Measures of Progress
by: Thaker, Pratiksha, et al.
Published: (2024)

MedErrBench: A Fine-Grained Multilingual Benchmark for Medical Error Detection and Correction with Clinical Expert Annotations
by: Ma, Congbo, et al.
Published: (2026)

ErAConD : Error Annotated Conversational Dialog Dataset for Grammatical Error Correction
by: Yuan, Xun, et al.
Published: (2021)

Donkii: Can Annotation Error Detection Methods Find Errors in Instruction-Tuning Datasets?
by: Weber-Genzel, Leon, et al.
Published: (2023)

Evaluating the Impact of LLM-Assisted Annotation in a Perspectivized Setting: the Case of FrameNet Annotation
by: Belcavello, Frederico, et al.
Published: (2025)

Measuring and Mitigating Persona Distortions from AI Writing Assistance
by: Röttger, Paul, et al.
Published: (2026)

Do These LLM Benchmarks Agree? Fixing Benchmark Evaluation with BenchBench
by: Perlitz, Yotam, et al.
Published: (2024)

An Annotated Dataset of Errors in Premodern Greek and Baselines for Detecting Them
by: Brooks, Creston, et al.
Published: (2024)

Marking: Visual Grading with Highlighting Errors and Annotating Missing Bits
by: Sonkar, Shashank, et al.
Published: (2024)

LLM-as-an-Annotator: Training Lightweight Models with LLM-Annotated Examples for Aspect Sentiment Tuple Prediction
by: Hellwig, Nils Constantin, et al.
Published: (2026)

Towards Reproducible LLM Evaluation: Quantifying Uncertainty in LLM Benchmark Scores
by: Blackwell, Robert E., et al.
Published: (2024)

LED: A Benchmark for Evaluating Layout Error Detection in Document Analysis
by: Heo, Inbum, et al.
Published: (2026)

Measuring Faithfulness and Abstention: An Automated Pipeline for Evaluating LLM-Generated 3-ply Case-Based Legal Arguments
by: Zhang, Li, et al.
Published: (2025)

To Err Is Human; To Annotate, SILICON? Toward Robust Reproducibility in LLM Annotation
by: Cheng, Xiang, et al.
Published: (2024)

Evaluating Knowledge Generation and Self-Refinement Strategies for LLM-based Column Type Annotation
by: Korini, Keti, et al.
Published: (2025)

If in a Crowdsourced Data Annotation Pipeline, a GPT-4
by: He, Zeyu, et al.
Published: (2024)

Benchmark Transparency: Measuring the Impact of Data on Evaluation
by: Kovatchev, Venelin, et al.
Published: (2024)

TalkTag: Fine-Grained Morphosyntactic Error Annotation for Transcribed Speech
by: Venturini, Shamira, et al.
Published: (2026)

Refining Word-Based Grammatical Error Annotation for L2 Korean
by: Park, Jungyeul, et al.
Published: (2026)

Annotation Errors and NER: A Study with OntoNotes 5.0
by: Bernier-Colborne, Gabriel, et al.
Published: (2024)

Schema Lineage Extraction at Scale: Multilingual Pipelines, Composite Evaluation, and Language-Model Benchmarks
by: Yin, Jiaqi, et al.
Published: (2025)

Evaluating Novelty in AI-Generated Research Plans Using Multi-Workflow LLM Pipelines
by: Saraogi, Devesh, et al.
Published: (2025)