Saved in:
| Main Authors: | Eigler, Lukáš, Libovický, Jindřich, Hurych, David |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2603.09403 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
On the Credibility of Evaluating LLMs using Survey Questions
by: Libovický, Jindřich
Published: (2026)
by: Libovický, Jindřich
Published: (2026)
Lexically Grounded Subword Segmentation
by: Libovický, Jindřich, et al.
Published: (2024)
by: Libovický, Jindřich, et al.
Published: (2024)
Evaluating Morphological Plausibility of Subword Tokenization via Statistical Alignment with Morpho-Syntactic Features
by: Stephen, Abishek, et al.
Published: (2026)
by: Stephen, Abishek, et al.
Published: (2026)
Investigating the Effect of Parallel Data in the Cross-Lingual Transfer for Vision-Language Encoders
by: Manea, Andrei-Alexandru, et al.
Published: (2025)
by: Manea, Andrei-Alexandru, et al.
Published: (2025)
Crowdsourcing Piedmontese to Test LLMs on Non-Standard Orthography
by: Vico, Gianluca, et al.
Published: (2026)
by: Vico, Gianluca, et al.
Published: (2026)
Different Time, Different Language: Revisiting the Bias Against Non-Native Speakers in GPT Detectors
by: Ali, Adnan Al, et al.
Published: (2026)
by: Ali, Adnan Al, et al.
Published: (2026)
How Gender Interacts with Political Values: A Case Study on Czech BERT Models
by: Ali, Adnan Al, et al.
Published: (2024)
by: Ali, Adnan Al, et al.
Published: (2024)
Multilingual Vision-Language Models, A Survey
by: Manea, Andrei-Alexandru, et al.
Published: (2025)
by: Manea, Andrei-Alexandru, et al.
Published: (2025)
CUS-QA: Local-Knowledge-Oriented Open-Ended Question Answering Dataset
by: Libovický, Jindřich, et al.
Published: (2025)
by: Libovický, Jindřich, et al.
Published: (2025)
Understanding Cross-Lingual Alignment -- A Survey
by: Hämmerl, Katharina, et al.
Published: (2024)
by: Hämmerl, Katharina, et al.
Published: (2024)
Beyond Literal Token Overlap: Token Alignability for Multilinguality
by: Hämmerl, Katharina, et al.
Published: (2025)
by: Hämmerl, Katharina, et al.
Published: (2025)
Teaching LLMs at Charles University: Assignments and Activities
by: Helcl, Jindřich, et al.
Published: (2024)
by: Helcl, Jindřich, et al.
Published: (2024)
Enhancing Conceptual Understanding in Multimodal Contrastive Learning through Hard Negative Samples
by: Rösch, Philipp J., et al.
Published: (2024)
by: Rösch, Philipp J., et al.
Published: (2024)
Conditional Unigram Tokenization with Parallel Data
by: Vico, Gianluca, et al.
Published: (2025)
by: Vico, Gianluca, et al.
Published: (2025)
Evaluating Metrics for Safety with LLM-as-Judges
by: Clegg, Kester, et al.
Published: (2025)
by: Clegg, Kester, et al.
Published: (2025)
Evaluation Metrics for Text Data Augmentation in NLP
by: Amadeus, Marcellus, et al.
Published: (2024)
by: Amadeus, Marcellus, et al.
Published: (2024)
Charles Translator: A Machine Translation System between Ukrainian and Czech
by: Popel, Martin, et al.
Published: (2024)
by: Popel, Martin, et al.
Published: (2024)
Multilingual Text-to-Image Generation Magnifies Gender Stereotypes and Prompt Engineering May Not Help You
by: Friedrich, Felix, et al.
Published: (2024)
by: Friedrich, Felix, et al.
Published: (2024)
An LLM-as-Judge Metric for Bridging the Gap with Human Evaluation in SE Tasks
by: Zhou, Xin, et al.
Published: (2025)
by: Zhou, Xin, et al.
Published: (2025)
Systematic Evaluation of LLM-as-a-Judge in LLM Alignment Tasks: Explainable Metrics and Diverse Prompt Templates
by: Wei, Hui, et al.
Published: (2024)
by: Wei, Hui, et al.
Published: (2024)
Agreement Metrics for LLM-as-Judge Evaluation: What to Report and Why
by: Rao, Delip, et al.
Published: (2026)
by: Rao, Delip, et al.
Published: (2026)
RankJudge: A Multi-Turn LLM-as-a-Judge Synthetic Benchmark Generator
by: Tang, Zhenwei, et al.
Published: (2026)
by: Tang, Zhenwei, et al.
Published: (2026)
NagaNLP: Bootstrapping NLP for Low-Resource Nagamese Creole with Human-in-the-Loop Synthetic Data
by: Maiti, Agniva, et al.
Published: (2025)
by: Maiti, Agniva, et al.
Published: (2025)
Beyond LLM-as-a-Judge: Deterministic Metrics for Multilingual Generative Text Evaluation
by: Alam, Firoj, et al.
Published: (2026)
by: Alam, Firoj, et al.
Published: (2026)
HarmMetric Eval: Benchmarking Metrics and Judges for LLM Harmfulness Assessment
by: Yang, Langqi, et al.
Published: (2025)
by: Yang, Langqi, et al.
Published: (2025)
MM-Eval: A Multilingual Meta-Evaluation Benchmark for LLM-as-a-Judge and Reward Models
by: Son, Guijin, et al.
Published: (2024)
by: Son, Guijin, et al.
Published: (2024)
When Metrics Disagree: Automatic Similarity vs. LLM-as-a-Judge for Clinical Dialogue Evaluation
by: Sun, Bian, et al.
Published: (2026)
by: Sun, Bian, et al.
Published: (2026)
We Need to Talk About Classification Evaluation Metrics in NLP
by: Vickers, Peter, et al.
Published: (2024)
by: Vickers, Peter, et al.
Published: (2024)
Who Judges the Judge? Evaluating LLM-as-a-Judge for French Medical open-ended QA
by: Belmadani, Ikram, et al.
Published: (2026)
by: Belmadani, Ikram, et al.
Published: (2026)
Evaluating Scoring Bias in LLM-as-a-Judge
by: Li, Qingquan, et al.
Published: (2025)
by: Li, Qingquan, et al.
Published: (2025)
SEAL: Can Saturated Benchmarks Be Revived by LLM-as-a-Meta-Judge?
by: Chen, Jiamin, et al.
Published: (2026)
by: Chen, Jiamin, et al.
Published: (2026)
LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks
by: Bavaresco, Anna, et al.
Published: (2024)
by: Bavaresco, Anna, et al.
Published: (2024)
CyclicJudge: Mitigating Judge Bias Efficiently in LLM-based Evaluation
by: Zhu, Ziyi, et al.
Published: (2026)
by: Zhu, Ziyi, et al.
Published: (2026)
LLM-as-a-Judge for Privacy Evaluation? Exploring the Alignment of Human and LLM Perceptions of Privacy in Textual Data
by: Meisenbacher, Stephen, et al.
Published: (2025)
by: Meisenbacher, Stephen, et al.
Published: (2025)
Contextual Metric Meta-Evaluation by Measuring Local Metric Accuracy
by: Deviyani, Athiya, et al.
Published: (2025)
by: Deviyani, Athiya, et al.
Published: (2025)
Meta-Rewarding Language Models: Self-Improving Alignment with LLM-as-a-Meta-Judge
by: Wu, Tianhao, et al.
Published: (2024)
by: Wu, Tianhao, et al.
Published: (2024)
Counting on Consensus: Selecting the Right Inter-annotator Agreement Metric for NLP Annotation and Evaluation
by: James, Joseph
Published: (2026)
by: James, Joseph
Published: (2026)
Evaluating Judges as Evaluators: The JETTS Benchmark of LLM-as-Judges as Test-Time Scaling Evaluators
by: Zhou, Yilun, et al.
Published: (2025)
by: Zhou, Yilun, et al.
Published: (2025)
Balanced Accuracy: The Right Metric for Evaluating LLM Judges -- Explained through Youden's J statistic
by: Collot, Stephane, et al.
Published: (2025)
by: Collot, Stephane, et al.
Published: (2025)
Criterion Validity of LLM-as-Judge for Business Outcomes in Conversational Commerce
by: Chen, Liang, et al.
Published: (2026)
by: Chen, Liang, et al.
Published: (2026)
Similar Items
-
On the Credibility of Evaluating LLMs using Survey Questions
by: Libovický, Jindřich
Published: (2026) -
Lexically Grounded Subword Segmentation
by: Libovický, Jindřich, et al.
Published: (2024) -
Evaluating Morphological Plausibility of Subword Tokenization via Statistical Alignment with Morpho-Syntactic Features
by: Stephen, Abishek, et al.
Published: (2026) -
Investigating the Effect of Parallel Data in the Cross-Lingual Transfer for Vision-Language Encoders
by: Manea, Andrei-Alexandru, et al.
Published: (2025) -
Crowdsourcing Piedmontese to Test LLMs on Non-Standard Orthography
by: Vico, Gianluca, et al.
Published: (2026)