Saved in:
| Main Authors: | Ruiz, Tomas, Agustoslu, Tanalp, Schwemmer, Carsten |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2603.19744 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Sign Language Sense Disambiguation
by: Grimm, Jana, et al.
Published: (2024)
by: Grimm, Jana, et al.
Published: (2024)
BoN Appetit Team at LeWiDi-2025: Best-of-N Test-time Scaling Can Not Stomach Annotation Disagreements (Yet)
by: Ruiz, Tomas, et al.
Published: (2025)
by: Ruiz, Tomas, et al.
Published: (2025)
From Ground Truth to Measurement: A Statistical Framework for Human Labeling
by: Chew, Robert, et al.
Published: (2026)
by: Chew, Robert, et al.
Published: (2026)
MLLM-as-a-Judge for Image Safety without Human Labeling
by: Wang, Zhenting, et al.
Published: (2024)
by: Wang, Zhenting, et al.
Published: (2024)
FreePRM: Training Process Reward Models Without Ground Truth Process Labels
by: Sun, Lin, et al.
Published: (2025)
by: Sun, Lin, et al.
Published: (2025)
Aligning MLLM Benchmark With Human Preferences via Structural Equation Modeling
by: Xiong, Shengwu., et al.
Published: (2025)
by: Xiong, Shengwu., et al.
Published: (2025)
Information Extraction from Heterogeneous Documents without Ground Truth Labels using Synthetic Label Generation and Knowledge Distillation
by: Bhattacharyya, Aniket, et al.
Published: (2024)
by: Bhattacharyya, Aniket, et al.
Published: (2024)
When No Benchmark Exists: Validating Comparative LLM Safety Scoring Without Ground-Truth Labels
by: Gautam, Sushant, et al.
Published: (2026)
by: Gautam, Sushant, et al.
Published: (2026)
Training and Evaluating with Human Label Variation: An Empirical Study
by: Kurniawan, Kemal, et al.
Published: (2025)
by: Kurniawan, Kemal, et al.
Published: (2025)
Information Density Principle for MLLM Benchmarks
by: Li, Chunyi, et al.
Published: (2025)
by: Li, Chunyi, et al.
Published: (2025)
Navigating Rifts in Human-LLM Grounding: Study and Benchmark
by: Shaikh, Omar, et al.
Published: (2025)
by: Shaikh, Omar, et al.
Published: (2025)
Improve MLLM Benchmark Efficiency through Interview
by: Wen, Farong, et al.
Published: (2025)
by: Wen, Farong, et al.
Published: (2025)
Human Label Variation in Implicit Discourse Relation Recognition
by: Yung, Frances, et al.
Published: (2026)
by: Yung, Frances, et al.
Published: (2026)
On the Interplay between Human Label Variation and Model Fairness
by: Kurniawan, Kemal, et al.
Published: (2025)
by: Kurniawan, Kemal, et al.
Published: (2025)
Fine-grained Fallacy Detection with Human Label Variation
by: Ramponi, Alan, et al.
Published: (2025)
by: Ramponi, Alan, et al.
Published: (2025)
Rethinking Visual Neglect: Steering via Context-Preference for MLLM Hallucination Mitigation
by: Wu, Jingwen, et al.
Published: (2026)
by: Wu, Jingwen, et al.
Published: (2026)
Human-Aligned MLLM Judges for Fine-Grained Image Editing Evaluation: A Benchmark, Framework, and Analysis
by: Liu, Runzhou, et al.
Published: (2026)
by: Liu, Runzhou, et al.
Published: (2026)
Rethinking Scientific Summarization Evaluation: Grounding Explainable Metrics on Facet-aware Benchmark
by: Chen, Xiuying, et al.
Published: (2024)
by: Chen, Xiuying, et al.
Published: (2024)
Interpreting Predictive Probabilities: Model Confidence or Human Label Variation?
by: Baan, Joris, et al.
Published: (2024)
by: Baan, Joris, et al.
Published: (2024)
Decoupling the Effect of Chain-of-Thought Reasoning: A Human Label Variation Perspective
by: Chen, Beiduo, et al.
Published: (2026)
by: Chen, Beiduo, et al.
Published: (2026)
Revisiting Active Learning under (Human) Label Variation
by: Gruber, Cornelia, et al.
Published: (2025)
by: Gruber, Cornelia, et al.
Published: (2025)
No Free Labels: Limitations of LLM-as-a-Judge Without Human Grounding
by: Krumdick, Michael, et al.
Published: (2025)
by: Krumdick, Michael, et al.
Published: (2025)
The Promises and Pitfalls of LLM Annotations in Dataset Labeling: a Case Study on Media Bias Detection
by: Horych, Tomas, et al.
Published: (2024)
by: Horych, Tomas, et al.
Published: (2024)
Faithfulness Metrics Don't Measure Faithfulness: A Meta-Evaluation with Ground Truth
by: Gur-Arieh, Yoav, et al.
Published: (2026)
by: Gur-Arieh, Yoav, et al.
Published: (2026)
VariErr NLI: Separating Annotation Error from Human Label Variation
by: Weber-Genzel, Leon, et al.
Published: (2024)
by: Weber-Genzel, Leon, et al.
Published: (2024)
The Ecological Fallacy in Annotation: Modelling Human Label Variation goes beyond Sociodemographics
by: Orlikowski, Matthias, et al.
Published: (2023)
by: Orlikowski, Matthias, et al.
Published: (2023)
Threading the Needle: Reweaving Chain-of-Thought Reasoning to Explain Human Label Variation
by: Chen, Beiduo, et al.
Published: (2025)
by: Chen, Beiduo, et al.
Published: (2025)
Different Tastes of Entities: Investigating Human Label Variation in Named Entity Annotations
by: Peng, Siyao, et al.
Published: (2024)
by: Peng, Siyao, et al.
Published: (2024)
UbuntuGuard: A Culturally-Grounded Policy Benchmark for Equitable AI Safety in African Languages
by: Abdullahi, Tassallah, et al.
Published: (2026)
by: Abdullahi, Tassallah, et al.
Published: (2026)
GEM: Empowering MLLM for Grounded ECG Understanding with Time Series and Images
by: Lan, Xiang, et al.
Published: (2025)
by: Lan, Xiang, et al.
Published: (2025)
Ground Truth Generation for Multilingual Historical NLP using LLMs
by: Gladstone, Clovis, et al.
Published: (2025)
by: Gladstone, Clovis, et al.
Published: (2025)
MLLM-CompBench: A Comparative Reasoning Benchmark for Multimodal LLMs
by: Kil, Jihyung, et al.
Published: (2024)
by: Kil, Jihyung, et al.
Published: (2024)
Agree, Disagree, Explain: Decomposing Human Label Variation in NLI through the Lens of Explanations
by: Hong, Pingjun, et al.
Published: (2025)
by: Hong, Pingjun, et al.
Published: (2025)
Beyond Single Ground Truth: Reference Monism as Epistemic Injustice in ASR Evaluation
by: Choi, Anna Seo Gyeong, et al.
Published: (2026)
by: Choi, Anna Seo Gyeong, et al.
Published: (2026)
MLLM-CTBench: A Benchmark for Continual Instruction Tuning with Reasoning Process Diagnosis
by: Guo, Haiyun, et al.
Published: (2025)
by: Guo, Haiyun, et al.
Published: (2025)
Grounded Visual Factualization: Factual Anchor-Based Finetuning for Enhancing MLLM Factual Consistency
by: Morbiato, Filippo, et al.
Published: (2025)
by: Morbiato, Filippo, et al.
Published: (2025)
Ranking Large Language Models without Ground Truth
by: Dhurandhar, Amit, et al.
Published: (2024)
by: Dhurandhar, Amit, et al.
Published: (2024)
VIVA: A Benchmark for Vision-Grounded Decision-Making with Human Values
by: Hu, Zhe, et al.
Published: (2024)
by: Hu, Zhe, et al.
Published: (2024)
The Consensus Trap: Dissecting Subjectivity and the "Ground Truth" Illusion in Data Annotation
by: Munir, Sheza, et al.
Published: (2026)
by: Munir, Sheza, et al.
Published: (2026)
Rethinking Atomic Decomposition for LLM Judges: A Prompt-Controlled Study of Reference-Grounded QA Evaluation
by: Zhang, Xinran
Published: (2026)
by: Zhang, Xinran
Published: (2026)
Similar Items
-
Sign Language Sense Disambiguation
by: Grimm, Jana, et al.
Published: (2024) -
BoN Appetit Team at LeWiDi-2025: Best-of-N Test-time Scaling Can Not Stomach Annotation Disagreements (Yet)
by: Ruiz, Tomas, et al.
Published: (2025) -
From Ground Truth to Measurement: A Statistical Framework for Human Labeling
by: Chew, Robert, et al.
Published: (2026) -
MLLM-as-a-Judge for Image Safety without Human Labeling
by: Wang, Zhenting, et al.
Published: (2024) -
FreePRM: Training Process Reward Models Without Ground Truth Process Labels
by: Sun, Lin, et al.
Published: (2025)