:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Arzt, Varvara, Hanbury, Allan
Format:	Preprint
Published:	2024
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2411.05224
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Relation Extraction or Pattern Matching? Unravelling the Generalisation Limits of Language Models for Biographical RE
by: Arzt, Varvara, et al.
Published: (2025)

Cloud-Based Benchmarking of Medical Image Analysis
by: Allan Hanbury

Exploring the Latest LLMs for Leaderboard Extraction
by: Kabongo, Salomon, et al.
Published: (2024)

LEGOBench: Scientific Leaderboard Generation Benchmark
by: Singh, Shruti, et al.
Published: (2024)

Beyond the Leaderboard: Rethinking Medical Benchmarks for Large Language Models
by: Chen, Wenting, et al.
Published: (2025)

Benchmarking LLM Faithfulness in RAG with Evolving Leaderboards
by: Tamber, Manveer Singh, et al.
Published: (2025)

GeoBenchX: Benchmarking LLMs in Agent Solving Multistep Geospatial Tasks
by: Krechetova, Varvara, et al.
Published: (2025)

Prompt-to-Leaderboard
by: Frick, Evan, et al.
Published: (2025)

AustroTox: A Dataset for Target-Based Austrian German Offensive Language Detection
by: Pachinger, Pia, et al.
Published: (2024)

Libra-Leaderboard: Towards Responsible AI through a Balanced Leaderboard of Safety and Capability
by: Li, Haonan, et al.
Published: (2024)

Beyond the Leaderboard: Understanding Performance Disparities in Large Language Models via Model Diffing
by: Boughorbel, Sabri, et al.
Published: (2025)

The FACTS Leaderboard: A Comprehensive Benchmark for Large Language Model Factuality
by: Cheng, Aileen, et al.
Published: (2025)

The Leaderboard Illusion
by: Singh, Shivalika, et al.
Published: (2025)

The FACTS Grounding Leaderboard: Benchmarking LLMs' Ability to Ground Responses to Long-Form Input
by: Jacovi, Alon, et al.
Published: (2025)

La Leaderboard: A Large Language Model Leaderboard for Spanish Varieties and Languages of Spain and Latin America
by: Grandury, María, et al.
Published: (2025)

Open Universal Arabic ASR Leaderboard
by: Wang, Yingzhi, et al.
Published: (2024)

League: Leaderboard Generation on Demand
by: Wu, Jian, et al.
Published: (2025)

When Benchmarks are Targets: Revealing the Sensitivity of Large Language Model Leaderboards
by: Alzahrani, Norah, et al.
Published: (2024)

Open ASR Leaderboard: Towards Reproducible and Transparent Multilingual and Long-Form Speech Recognition Evaluation
by: Srivastav, Vaibhav, et al.
Published: (2025)

User-centric Subjective Leaderboard by Customizable Reward Modeling
by: Jia, Qi, et al.
Published: (2025)

Improving LLM Leaderboards with Psychometrical Methodology
by: Federiakin, Denis
Published: (2025)

Top Leaderboard Ranking = Top Coding Proficiency, Always? EvoEval: Evolving Coding Benchmarks via LLM
by: Xia, Chunqiu Steven, et al.
Published: (2024)

Open Ko-LLM Leaderboard: Evaluating Large Language Models in Korean with Ko-H5 Benchmark
by: Park, Chanjun, et al.
Published: (2024)

Open-LLM-Leaderboard: From Multi-choice to Open-style Questions for LLMs Evaluation, Benchmark, and Arena
by: Myrzakhan, Aidar, et al.
Published: (2024)

AgentAtlas: Beyond Outcome Leaderboards for LLM Agents
by: Mazaheri, Parsa, et al.
Published: (2026)

Testimole-Conversational: A 30-Billion-Word Italian Discussion Board Corpus (1996-2024) for Language Modeling and Sociolinguistic Research
by: Rinaldi, Matteo, et al.
Published: (2026)

Instruction Finetuning for Leaderboard Generation from Empirical AI Research
by: Kabongo, Salomon, et al.
Published: (2024)

Evaluating Large Language Models with Grid-Based Game Competitions: An Extensible LLM Benchmark and Leaderboard
by: Topsakal, Oguzhan, et al.
Published: (2024)

LexRel: Benchmarking Legal Relation Extraction for Chinese Civil Cases
by: Cai, Yida, et al.
Published: (2025)

Beyond Isolated Dots: Benchmarking Structured Table Construction as Deep Knowledge Extraction
by: Zhong, Tianyun, et al.
Published: (2025)

MULTI: Multimodal Understanding Leaderboard with Text and Images
by: Zhu, Zichen, et al.
Published: (2024)

The LLM Effect on IR Benchmarks: A Meta-Analysis of Effectiveness, Baselines, and Contamination
by: Staudinger, Moritz, et al.
Published: (2026)

The Trust Paradox: How CS Researchers Engage LLM Leaderboards
by: Sadeghi, Pouya, et al.
Published: (2026)

A Position Paper on the Automatic Generation of Machine Learning Leaderboards
by: Timmer, Roelien C, et al.
Published: (2025)

The Hallucinations Leaderboard -- An Open Effort to Measure Hallucinations in Large Language Models
by: Hong, Giwon, et al.
Published: (2024)

Effective Context Selection in LLM-based Leaderboard Generation: An Empirical Study
by: Kabongo, Salomon, et al.
Published: (2024)

Reliable, Reproducible, and Really Fast Leaderboards with Evalica
by: Ustalov, Dmitry
Published: (2024)

Holistic Agent Leaderboard: The Missing Infrastructure for AI Agent Evaluation
by: Kapoor, Sayash, et al.
Published: (2025)

Efficient Performance Tracking: Leveraging Large Language Models for Automated Construction of Scientific Leaderboards
by: Şahinuç, Furkan, et al.
Published: (2024)

CLARIN-PT-LDB: An Open LLM Leaderboard for Portuguese to assess Language, Culture and Civility
by: Silva, João, et al.
Published: (2026)