Saved in:
| Main Author: | Messing, Solomon |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2604.11581 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Concept-Guided Chain-of-Thought Prompting for Pairwise Comparison Scoring of Texts with Large Language Models
by: Wu, Patrick Y., et al.
Published: (2023)
by: Wu, Patrick Y., et al.
Published: (2023)
Enhancing LLM-Based Data Annotation with Error Decomposition
by: Xu, Zhen, et al.
Published: (2026)
by: Xu, Zhen, et al.
Published: (2026)
Automatically Benchmarking LLM Code Agents through Agent-Driven Annotation and Evaluation
by: Fu, Lingyue, et al.
Published: (2025)
by: Fu, Lingyue, et al.
Published: (2025)
MQM-APE: Toward High-Quality Error Annotation Predictors with Automatic Post-Editing in LLM Translation Evaluators
by: Lu, Qingyu, et al.
Published: (2024)
by: Lu, Qingyu, et al.
Published: (2024)
ReFACT: A Benchmark for Scientific Confabulation Detection with Positional Error Annotations
by: Wang, Yindong, et al.
Published: (2025)
by: Wang, Yindong, et al.
Published: (2025)
Comparing LLM Text Annotation Skills: A Study on Human Rights Violations in Social Media Data
by: Nemkova, Poli Apollinaire, et al.
Published: (2025)
by: Nemkova, Poli Apollinaire, et al.
Published: (2025)
JuICE: A Benchmark for Evaluating LLM-Judge in Identifying Cultural Errors
by: Jin, Jiho, et al.
Published: (2026)
by: Jin, Jiho, et al.
Published: (2026)
Error Span Annotation: A Balanced Approach for Human Evaluation of Machine Translation
by: Kocmi, Tom, et al.
Published: (2024)
by: Kocmi, Tom, et al.
Published: (2024)
REPA: Russian Error Types Annotation for Evaluating Text Generation and Judgment Capabilities
by: Pugachev, Alexander, et al.
Published: (2025)
by: Pugachev, Alexander, et al.
Published: (2025)
Remedy-R: Generative Reasoning for Machine Translation Evaluation without Error Annotations
by: Tan, Shaomu, et al.
Published: (2025)
by: Tan, Shaomu, et al.
Published: (2025)
Evaluating LLMs at Detecting Errors in LLM Responses
by: Kamoi, Ryo, et al.
Published: (2024)
by: Kamoi, Ryo, et al.
Published: (2024)
Beyond English and Evasion: A Human-Annotated Multi-Domain Benchmark for High-Stakes LLM Safety Evaluation in Chinese
by: Zaghouani, Wajdi, et al.
Published: (2026)
by: Zaghouani, Wajdi, et al.
Published: (2026)
Benchmark^2: Systematic Evaluation of LLM Benchmarks
by: Qian, Qi, et al.
Published: (2026)
by: Qian, Qi, et al.
Published: (2026)
The Judge Who Never Admits: Hidden Shortcuts in LLM-based Evaluation
by: Marioriyad, Arash, et al.
Published: (2026)
by: Marioriyad, Arash, et al.
Published: (2026)
Quantifying the Impact of Translation Errors on Multilingual LLM Evaluation
by: Thellmann, Klaudia-Doris, et al.
Published: (2026)
by: Thellmann, Klaudia-Doris, et al.
Published: (2026)
Evaluation of Geographical Distortions in Language Models
by: Decoupes, Rémy, et al.
Published: (2024)
by: Decoupes, Rémy, et al.
Published: (2024)
Refining and Reusing Annotation Guidelines for LLM Annotation
by: Kim, Kon Woo, et al.
Published: (2026)
by: Kim, Kon Woo, et al.
Published: (2026)
GaRAGe: A Benchmark with Grounding Annotations for RAG Evaluation
by: Sorodoc, Ionut-Teodor, et al.
Published: (2025)
by: Sorodoc, Ionut-Teodor, et al.
Published: (2025)
Position: LLM Unlearning Benchmarks are Weak Measures of Progress
by: Thaker, Pratiksha, et al.
Published: (2024)
by: Thaker, Pratiksha, et al.
Published: (2024)
MedErrBench: A Fine-Grained Multilingual Benchmark for Medical Error Detection and Correction with Clinical Expert Annotations
by: Ma, Congbo, et al.
Published: (2026)
by: Ma, Congbo, et al.
Published: (2026)
ErAConD : Error Annotated Conversational Dialog Dataset for Grammatical Error Correction
by: Yuan, Xun, et al.
Published: (2021)
by: Yuan, Xun, et al.
Published: (2021)
Donkii: Can Annotation Error Detection Methods Find Errors in Instruction-Tuning Datasets?
by: Weber-Genzel, Leon, et al.
Published: (2023)
by: Weber-Genzel, Leon, et al.
Published: (2023)
Evaluating the Impact of LLM-Assisted Annotation in a Perspectivized Setting: the Case of FrameNet Annotation
by: Belcavello, Frederico, et al.
Published: (2025)
by: Belcavello, Frederico, et al.
Published: (2025)
Measuring and Mitigating Persona Distortions from AI Writing Assistance
by: Röttger, Paul, et al.
Published: (2026)
by: Röttger, Paul, et al.
Published: (2026)
Do These LLM Benchmarks Agree? Fixing Benchmark Evaluation with BenchBench
by: Perlitz, Yotam, et al.
Published: (2024)
by: Perlitz, Yotam, et al.
Published: (2024)
An Annotated Dataset of Errors in Premodern Greek and Baselines for Detecting Them
by: Brooks, Creston, et al.
Published: (2024)
by: Brooks, Creston, et al.
Published: (2024)
Marking: Visual Grading with Highlighting Errors and Annotating Missing Bits
by: Sonkar, Shashank, et al.
Published: (2024)
by: Sonkar, Shashank, et al.
Published: (2024)
LLM-as-an-Annotator: Training Lightweight Models with LLM-Annotated Examples for Aspect Sentiment Tuple Prediction
by: Hellwig, Nils Constantin, et al.
Published: (2026)
by: Hellwig, Nils Constantin, et al.
Published: (2026)
Towards Reproducible LLM Evaluation: Quantifying Uncertainty in LLM Benchmark Scores
by: Blackwell, Robert E., et al.
Published: (2024)
by: Blackwell, Robert E., et al.
Published: (2024)
LED: A Benchmark for Evaluating Layout Error Detection in Document Analysis
by: Heo, Inbum, et al.
Published: (2026)
by: Heo, Inbum, et al.
Published: (2026)
Measuring Faithfulness and Abstention: An Automated Pipeline for Evaluating LLM-Generated 3-ply Case-Based Legal Arguments
by: Zhang, Li, et al.
Published: (2025)
by: Zhang, Li, et al.
Published: (2025)
To Err Is Human; To Annotate, SILICON? Toward Robust Reproducibility in LLM Annotation
by: Cheng, Xiang, et al.
Published: (2024)
by: Cheng, Xiang, et al.
Published: (2024)
Evaluating Knowledge Generation and Self-Refinement Strategies for LLM-based Column Type Annotation
by: Korini, Keti, et al.
Published: (2025)
by: Korini, Keti, et al.
Published: (2025)
If in a Crowdsourced Data Annotation Pipeline, a GPT-4
by: He, Zeyu, et al.
Published: (2024)
by: He, Zeyu, et al.
Published: (2024)
Benchmark Transparency: Measuring the Impact of Data on Evaluation
by: Kovatchev, Venelin, et al.
Published: (2024)
by: Kovatchev, Venelin, et al.
Published: (2024)
TalkTag: Fine-Grained Morphosyntactic Error Annotation for Transcribed Speech
by: Venturini, Shamira, et al.
Published: (2026)
by: Venturini, Shamira, et al.
Published: (2026)
Refining Word-Based Grammatical Error Annotation for L2 Korean
by: Park, Jungyeul, et al.
Published: (2026)
by: Park, Jungyeul, et al.
Published: (2026)
Annotation Errors and NER: A Study with OntoNotes 5.0
by: Bernier-Colborne, Gabriel, et al.
Published: (2024)
by: Bernier-Colborne, Gabriel, et al.
Published: (2024)
Schema Lineage Extraction at Scale: Multilingual Pipelines, Composite Evaluation, and Language-Model Benchmarks
by: Yin, Jiaqi, et al.
Published: (2025)
by: Yin, Jiaqi, et al.
Published: (2025)
Evaluating Novelty in AI-Generated Research Plans Using Multi-Workflow LLM Pipelines
by: Saraogi, Devesh, et al.
Published: (2025)
by: Saraogi, Devesh, et al.
Published: (2025)
Similar Items
-
Concept-Guided Chain-of-Thought Prompting for Pairwise Comparison Scoring of Texts with Large Language Models
by: Wu, Patrick Y., et al.
Published: (2023) -
Enhancing LLM-Based Data Annotation with Error Decomposition
by: Xu, Zhen, et al.
Published: (2026) -
Automatically Benchmarking LLM Code Agents through Agent-Driven Annotation and Evaluation
by: Fu, Lingyue, et al.
Published: (2025) -
MQM-APE: Toward High-Quality Error Annotation Predictors with Automatic Post-Editing in LLM Translation Evaluators
by: Lu, Qingyu, et al.
Published: (2024) -
ReFACT: A Benchmark for Scientific Confabulation Detection with Positional Error Annotations
by: Wang, Yindong, et al.
Published: (2025)