:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Eigler, Lukáš, Libovický, Jindřich, Hurych, David
Format:	Preprint
Published:	2026
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2603.09403
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

On the Credibility of Evaluating LLMs using Survey Questions
by: Libovický, Jindřich
Published: (2026)

Lexically Grounded Subword Segmentation
by: Libovický, Jindřich, et al.
Published: (2024)

Evaluating Morphological Plausibility of Subword Tokenization via Statistical Alignment with Morpho-Syntactic Features
by: Stephen, Abishek, et al.
Published: (2026)

Investigating the Effect of Parallel Data in the Cross-Lingual Transfer for Vision-Language Encoders
by: Manea, Andrei-Alexandru, et al.
Published: (2025)

Crowdsourcing Piedmontese to Test LLMs on Non-Standard Orthography
by: Vico, Gianluca, et al.
Published: (2026)

Different Time, Different Language: Revisiting the Bias Against Non-Native Speakers in GPT Detectors
by: Ali, Adnan Al, et al.
Published: (2026)

How Gender Interacts with Political Values: A Case Study on Czech BERT Models
by: Ali, Adnan Al, et al.
Published: (2024)

Multilingual Vision-Language Models, A Survey
by: Manea, Andrei-Alexandru, et al.
Published: (2025)

CUS-QA: Local-Knowledge-Oriented Open-Ended Question Answering Dataset
by: Libovický, Jindřich, et al.
Published: (2025)

Understanding Cross-Lingual Alignment -- A Survey
by: Hämmerl, Katharina, et al.
Published: (2024)

Beyond Literal Token Overlap: Token Alignability for Multilinguality
by: Hämmerl, Katharina, et al.
Published: (2025)

Teaching LLMs at Charles University: Assignments and Activities
by: Helcl, Jindřich, et al.
Published: (2024)

Enhancing Conceptual Understanding in Multimodal Contrastive Learning through Hard Negative Samples
by: Rösch, Philipp J., et al.
Published: (2024)

Conditional Unigram Tokenization with Parallel Data
by: Vico, Gianluca, et al.
Published: (2025)

Evaluating Metrics for Safety with LLM-as-Judges
by: Clegg, Kester, et al.
Published: (2025)

Evaluation Metrics for Text Data Augmentation in NLP
by: Amadeus, Marcellus, et al.
Published: (2024)

Charles Translator: A Machine Translation System between Ukrainian and Czech
by: Popel, Martin, et al.
Published: (2024)

Multilingual Text-to-Image Generation Magnifies Gender Stereotypes and Prompt Engineering May Not Help You
by: Friedrich, Felix, et al.
Published: (2024)

An LLM-as-Judge Metric for Bridging the Gap with Human Evaluation in SE Tasks
by: Zhou, Xin, et al.
Published: (2025)

Systematic Evaluation of LLM-as-a-Judge in LLM Alignment Tasks: Explainable Metrics and Diverse Prompt Templates
by: Wei, Hui, et al.
Published: (2024)

Agreement Metrics for LLM-as-Judge Evaluation: What to Report and Why
by: Rao, Delip, et al.
Published: (2026)

RankJudge: A Multi-Turn LLM-as-a-Judge Synthetic Benchmark Generator
by: Tang, Zhenwei, et al.
Published: (2026)

NagaNLP: Bootstrapping NLP for Low-Resource Nagamese Creole with Human-in-the-Loop Synthetic Data
by: Maiti, Agniva, et al.
Published: (2025)

Beyond LLM-as-a-Judge: Deterministic Metrics for Multilingual Generative Text Evaluation
by: Alam, Firoj, et al.
Published: (2026)

HarmMetric Eval: Benchmarking Metrics and Judges for LLM Harmfulness Assessment
by: Yang, Langqi, et al.
Published: (2025)

MM-Eval: A Multilingual Meta-Evaluation Benchmark for LLM-as-a-Judge and Reward Models
by: Son, Guijin, et al.
Published: (2024)

When Metrics Disagree: Automatic Similarity vs. LLM-as-a-Judge for Clinical Dialogue Evaluation
by: Sun, Bian, et al.
Published: (2026)

We Need to Talk About Classification Evaluation Metrics in NLP
by: Vickers, Peter, et al.
Published: (2024)

Who Judges the Judge? Evaluating LLM-as-a-Judge for French Medical open-ended QA
by: Belmadani, Ikram, et al.
Published: (2026)

Evaluating Scoring Bias in LLM-as-a-Judge
by: Li, Qingquan, et al.
Published: (2025)

SEAL: Can Saturated Benchmarks Be Revived by LLM-as-a-Meta-Judge?
by: Chen, Jiamin, et al.
Published: (2026)

LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks
by: Bavaresco, Anna, et al.
Published: (2024)

CyclicJudge: Mitigating Judge Bias Efficiently in LLM-based Evaluation
by: Zhu, Ziyi, et al.
Published: (2026)

LLM-as-a-Judge for Privacy Evaluation? Exploring the Alignment of Human and LLM Perceptions of Privacy in Textual Data
by: Meisenbacher, Stephen, et al.
Published: (2025)

Contextual Metric Meta-Evaluation by Measuring Local Metric Accuracy
by: Deviyani, Athiya, et al.
Published: (2025)

Meta-Rewarding Language Models: Self-Improving Alignment with LLM-as-a-Meta-Judge
by: Wu, Tianhao, et al.
Published: (2024)

Counting on Consensus: Selecting the Right Inter-annotator Agreement Metric for NLP Annotation and Evaluation
by: James, Joseph
Published: (2026)

Evaluating Judges as Evaluators: The JETTS Benchmark of LLM-as-Judges as Test-Time Scaling Evaluators
by: Zhou, Yilun, et al.
Published: (2025)

Balanced Accuracy: The Right Metric for Evaluating LLM Judges -- Explained through Youden's J statistic
by: Collot, Stephane, et al.
Published: (2025)

Criterion Validity of LLM-as-Judge for Business Outcomes in Conversational Commerce
by: Chen, Liang, et al.
Published: (2026)