Saved in:
| Main Authors: | Arzt, Varvara, Hanbury, Allan |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2411.05224 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Relation Extraction or Pattern Matching? Unravelling the Generalisation Limits of Language Models for Biographical RE
by: Arzt, Varvara, et al.
Published: (2025)
by: Arzt, Varvara, et al.
Published: (2025)
Cloud-Based Benchmarking of Medical Image Analysis
by: Allan Hanbury
by: Allan Hanbury
Exploring the Latest LLMs for Leaderboard Extraction
by: Kabongo, Salomon, et al.
Published: (2024)
by: Kabongo, Salomon, et al.
Published: (2024)
LEGOBench: Scientific Leaderboard Generation Benchmark
by: Singh, Shruti, et al.
Published: (2024)
by: Singh, Shruti, et al.
Published: (2024)
Beyond the Leaderboard: Rethinking Medical Benchmarks for Large Language Models
by: Chen, Wenting, et al.
Published: (2025)
by: Chen, Wenting, et al.
Published: (2025)
Benchmarking LLM Faithfulness in RAG with Evolving Leaderboards
by: Tamber, Manveer Singh, et al.
Published: (2025)
by: Tamber, Manveer Singh, et al.
Published: (2025)
GeoBenchX: Benchmarking LLMs in Agent Solving Multistep Geospatial Tasks
by: Krechetova, Varvara, et al.
Published: (2025)
by: Krechetova, Varvara, et al.
Published: (2025)
Prompt-to-Leaderboard
by: Frick, Evan, et al.
Published: (2025)
by: Frick, Evan, et al.
Published: (2025)
AustroTox: A Dataset for Target-Based Austrian German Offensive Language Detection
by: Pachinger, Pia, et al.
Published: (2024)
by: Pachinger, Pia, et al.
Published: (2024)
Libra-Leaderboard: Towards Responsible AI through a Balanced Leaderboard of Safety and Capability
by: Li, Haonan, et al.
Published: (2024)
by: Li, Haonan, et al.
Published: (2024)
Beyond the Leaderboard: Understanding Performance Disparities in Large Language Models via Model Diffing
by: Boughorbel, Sabri, et al.
Published: (2025)
by: Boughorbel, Sabri, et al.
Published: (2025)
The FACTS Leaderboard: A Comprehensive Benchmark for Large Language Model Factuality
by: Cheng, Aileen, et al.
Published: (2025)
by: Cheng, Aileen, et al.
Published: (2025)
The Leaderboard Illusion
by: Singh, Shivalika, et al.
Published: (2025)
by: Singh, Shivalika, et al.
Published: (2025)
The FACTS Grounding Leaderboard: Benchmarking LLMs' Ability to Ground Responses to Long-Form Input
by: Jacovi, Alon, et al.
Published: (2025)
by: Jacovi, Alon, et al.
Published: (2025)
La Leaderboard: A Large Language Model Leaderboard for Spanish Varieties and Languages of Spain and Latin America
by: Grandury, María, et al.
Published: (2025)
by: Grandury, María, et al.
Published: (2025)
Open Universal Arabic ASR Leaderboard
by: Wang, Yingzhi, et al.
Published: (2024)
by: Wang, Yingzhi, et al.
Published: (2024)
League: Leaderboard Generation on Demand
by: Wu, Jian, et al.
Published: (2025)
by: Wu, Jian, et al.
Published: (2025)
When Benchmarks are Targets: Revealing the Sensitivity of Large Language Model Leaderboards
by: Alzahrani, Norah, et al.
Published: (2024)
by: Alzahrani, Norah, et al.
Published: (2024)
Open ASR Leaderboard: Towards Reproducible and Transparent Multilingual and Long-Form Speech Recognition Evaluation
by: Srivastav, Vaibhav, et al.
Published: (2025)
by: Srivastav, Vaibhav, et al.
Published: (2025)
User-centric Subjective Leaderboard by Customizable Reward Modeling
by: Jia, Qi, et al.
Published: (2025)
by: Jia, Qi, et al.
Published: (2025)
Improving LLM Leaderboards with Psychometrical Methodology
by: Federiakin, Denis
Published: (2025)
by: Federiakin, Denis
Published: (2025)
Top Leaderboard Ranking = Top Coding Proficiency, Always? EvoEval: Evolving Coding Benchmarks via LLM
by: Xia, Chunqiu Steven, et al.
Published: (2024)
by: Xia, Chunqiu Steven, et al.
Published: (2024)
Open Ko-LLM Leaderboard: Evaluating Large Language Models in Korean with Ko-H5 Benchmark
by: Park, Chanjun, et al.
Published: (2024)
by: Park, Chanjun, et al.
Published: (2024)
Open-LLM-Leaderboard: From Multi-choice to Open-style Questions for LLMs Evaluation, Benchmark, and Arena
by: Myrzakhan, Aidar, et al.
Published: (2024)
by: Myrzakhan, Aidar, et al.
Published: (2024)
AgentAtlas: Beyond Outcome Leaderboards for LLM Agents
by: Mazaheri, Parsa, et al.
Published: (2026)
by: Mazaheri, Parsa, et al.
Published: (2026)
Testimole-Conversational: A 30-Billion-Word Italian Discussion Board Corpus (1996-2024) for Language Modeling and Sociolinguistic Research
by: Rinaldi, Matteo, et al.
Published: (2026)
by: Rinaldi, Matteo, et al.
Published: (2026)
Instruction Finetuning for Leaderboard Generation from Empirical AI Research
by: Kabongo, Salomon, et al.
Published: (2024)
by: Kabongo, Salomon, et al.
Published: (2024)
Evaluating Large Language Models with Grid-Based Game Competitions: An Extensible LLM Benchmark and Leaderboard
by: Topsakal, Oguzhan, et al.
Published: (2024)
by: Topsakal, Oguzhan, et al.
Published: (2024)
LexRel: Benchmarking Legal Relation Extraction for Chinese Civil Cases
by: Cai, Yida, et al.
Published: (2025)
by: Cai, Yida, et al.
Published: (2025)
Beyond Isolated Dots: Benchmarking Structured Table Construction as Deep Knowledge Extraction
by: Zhong, Tianyun, et al.
Published: (2025)
by: Zhong, Tianyun, et al.
Published: (2025)
MULTI: Multimodal Understanding Leaderboard with Text and Images
by: Zhu, Zichen, et al.
Published: (2024)
by: Zhu, Zichen, et al.
Published: (2024)
The LLM Effect on IR Benchmarks: A Meta-Analysis of Effectiveness, Baselines, and Contamination
by: Staudinger, Moritz, et al.
Published: (2026)
by: Staudinger, Moritz, et al.
Published: (2026)
The Trust Paradox: How CS Researchers Engage LLM Leaderboards
by: Sadeghi, Pouya, et al.
Published: (2026)
by: Sadeghi, Pouya, et al.
Published: (2026)
A Position Paper on the Automatic Generation of Machine Learning Leaderboards
by: Timmer, Roelien C, et al.
Published: (2025)
by: Timmer, Roelien C, et al.
Published: (2025)
The Hallucinations Leaderboard -- An Open Effort to Measure Hallucinations in Large Language Models
by: Hong, Giwon, et al.
Published: (2024)
by: Hong, Giwon, et al.
Published: (2024)
Effective Context Selection in LLM-based Leaderboard Generation: An Empirical Study
by: Kabongo, Salomon, et al.
Published: (2024)
by: Kabongo, Salomon, et al.
Published: (2024)
Reliable, Reproducible, and Really Fast Leaderboards with Evalica
by: Ustalov, Dmitry
Published: (2024)
by: Ustalov, Dmitry
Published: (2024)
Holistic Agent Leaderboard: The Missing Infrastructure for AI Agent Evaluation
by: Kapoor, Sayash, et al.
Published: (2025)
by: Kapoor, Sayash, et al.
Published: (2025)
Efficient Performance Tracking: Leveraging Large Language Models for Automated Construction of Scientific Leaderboards
by: Şahinuç, Furkan, et al.
Published: (2024)
by: Şahinuç, Furkan, et al.
Published: (2024)
CLARIN-PT-LDB: An Open LLM Leaderboard for Portuguese to assess Language, Culture and Civility
by: Silva, João, et al.
Published: (2026)
by: Silva, João, et al.
Published: (2026)
Similar Items
-
Relation Extraction or Pattern Matching? Unravelling the Generalisation Limits of Language Models for Biographical RE
by: Arzt, Varvara, et al.
Published: (2025) -
Cloud-Based Benchmarking of Medical Image Analysis
by: Allan Hanbury -
Exploring the Latest LLMs for Leaderboard Extraction
by: Kabongo, Salomon, et al.
Published: (2024) -
LEGOBench: Scientific Leaderboard Generation Benchmark
by: Singh, Shruti, et al.
Published: (2024) -
Beyond the Leaderboard: Rethinking Medical Benchmarks for Large Language Models
by: Chen, Wenting, et al.
Published: (2025)