:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Ruiz, Tomas, Agustoslu, Tanalp, Schwemmer, Carsten
Format:	Preprint
Published:	2026
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2603.19744
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Sign Language Sense Disambiguation
by: Grimm, Jana, et al.
Published: (2024)

BoN Appetit Team at LeWiDi-2025: Best-of-N Test-time Scaling Can Not Stomach Annotation Disagreements (Yet)
by: Ruiz, Tomas, et al.
Published: (2025)

From Ground Truth to Measurement: A Statistical Framework for Human Labeling
by: Chew, Robert, et al.
Published: (2026)

MLLM-as-a-Judge for Image Safety without Human Labeling
by: Wang, Zhenting, et al.
Published: (2024)

FreePRM: Training Process Reward Models Without Ground Truth Process Labels
by: Sun, Lin, et al.
Published: (2025)

Aligning MLLM Benchmark With Human Preferences via Structural Equation Modeling
by: Xiong, Shengwu., et al.
Published: (2025)

Information Extraction from Heterogeneous Documents without Ground Truth Labels using Synthetic Label Generation and Knowledge Distillation
by: Bhattacharyya, Aniket, et al.
Published: (2024)

When No Benchmark Exists: Validating Comparative LLM Safety Scoring Without Ground-Truth Labels
by: Gautam, Sushant, et al.
Published: (2026)

Training and Evaluating with Human Label Variation: An Empirical Study
by: Kurniawan, Kemal, et al.
Published: (2025)

Information Density Principle for MLLM Benchmarks
by: Li, Chunyi, et al.
Published: (2025)

Navigating Rifts in Human-LLM Grounding: Study and Benchmark
by: Shaikh, Omar, et al.
Published: (2025)

Improve MLLM Benchmark Efficiency through Interview
by: Wen, Farong, et al.
Published: (2025)

Human Label Variation in Implicit Discourse Relation Recognition
by: Yung, Frances, et al.
Published: (2026)

On the Interplay between Human Label Variation and Model Fairness
by: Kurniawan, Kemal, et al.
Published: (2025)

Fine-grained Fallacy Detection with Human Label Variation
by: Ramponi, Alan, et al.
Published: (2025)

Rethinking Visual Neglect: Steering via Context-Preference for MLLM Hallucination Mitigation
by: Wu, Jingwen, et al.
Published: (2026)

Human-Aligned MLLM Judges for Fine-Grained Image Editing Evaluation: A Benchmark, Framework, and Analysis
by: Liu, Runzhou, et al.
Published: (2026)

Rethinking Scientific Summarization Evaluation: Grounding Explainable Metrics on Facet-aware Benchmark
by: Chen, Xiuying, et al.
Published: (2024)

Interpreting Predictive Probabilities: Model Confidence or Human Label Variation?
by: Baan, Joris, et al.
Published: (2024)

Decoupling the Effect of Chain-of-Thought Reasoning: A Human Label Variation Perspective
by: Chen, Beiduo, et al.
Published: (2026)

Revisiting Active Learning under (Human) Label Variation
by: Gruber, Cornelia, et al.
Published: (2025)

No Free Labels: Limitations of LLM-as-a-Judge Without Human Grounding
by: Krumdick, Michael, et al.
Published: (2025)

The Promises and Pitfalls of LLM Annotations in Dataset Labeling: a Case Study on Media Bias Detection
by: Horych, Tomas, et al.
Published: (2024)

Faithfulness Metrics Don't Measure Faithfulness: A Meta-Evaluation with Ground Truth
by: Gur-Arieh, Yoav, et al.
Published: (2026)

VariErr NLI: Separating Annotation Error from Human Label Variation
by: Weber-Genzel, Leon, et al.
Published: (2024)

The Ecological Fallacy in Annotation: Modelling Human Label Variation goes beyond Sociodemographics
by: Orlikowski, Matthias, et al.
Published: (2023)

Threading the Needle: Reweaving Chain-of-Thought Reasoning to Explain Human Label Variation
by: Chen, Beiduo, et al.
Published: (2025)

Different Tastes of Entities: Investigating Human Label Variation in Named Entity Annotations
by: Peng, Siyao, et al.
Published: (2024)

UbuntuGuard: A Culturally-Grounded Policy Benchmark for Equitable AI Safety in African Languages
by: Abdullahi, Tassallah, et al.
Published: (2026)

GEM: Empowering MLLM for Grounded ECG Understanding with Time Series and Images
by: Lan, Xiang, et al.
Published: (2025)

Ground Truth Generation for Multilingual Historical NLP using LLMs
by: Gladstone, Clovis, et al.
Published: (2025)

MLLM-CompBench: A Comparative Reasoning Benchmark for Multimodal LLMs
by: Kil, Jihyung, et al.
Published: (2024)

Agree, Disagree, Explain: Decomposing Human Label Variation in NLI through the Lens of Explanations
by: Hong, Pingjun, et al.
Published: (2025)

Beyond Single Ground Truth: Reference Monism as Epistemic Injustice in ASR Evaluation
by: Choi, Anna Seo Gyeong, et al.
Published: (2026)

MLLM-CTBench: A Benchmark for Continual Instruction Tuning with Reasoning Process Diagnosis
by: Guo, Haiyun, et al.
Published: (2025)

Grounded Visual Factualization: Factual Anchor-Based Finetuning for Enhancing MLLM Factual Consistency
by: Morbiato, Filippo, et al.
Published: (2025)

Ranking Large Language Models without Ground Truth
by: Dhurandhar, Amit, et al.
Published: (2024)

VIVA: A Benchmark for Vision-Grounded Decision-Making with Human Values
by: Hu, Zhe, et al.
Published: (2024)

The Consensus Trap: Dissecting Subjectivity and the "Ground Truth" Illusion in Data Annotation
by: Munir, Sheza, et al.
Published: (2026)

Rethinking Atomic Decomposition for LLM Judges: A Prompt-Controlled Study of Reference-Grounded QA Evaluation
by: Zhang, Xinran
Published: (2026)