Saved in:
| Main Authors: | Pröhl, Thorsten, Putzier, Erik, Zarnekow, Rüdiger |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2406.11670 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
NC-Bench: An LLM Benchmark for Evaluating Conversational Competence
by: Moore, Robert J., et al.
Published: (2026)
by: Moore, Robert J., et al.
Published: (2026)
A Novel Psychometrics-Based Approach to Developing Professional Competency Benchmark for Large Language Models
by: Kardanova, Elena, et al.
Published: (2024)
by: Kardanova, Elena, et al.
Published: (2024)
Bench4KE: Benchmarking Automated Competency Question Generation
by: Lippolis, Anna Sofia, et al.
Published: (2025)
by: Lippolis, Anna Sofia, et al.
Published: (2025)
DetectRL: Benchmarking LLM-Generated Text Detection in Real-World Scenarios
by: Wu, Junchao, et al.
Published: (2024)
by: Wu, Junchao, et al.
Published: (2024)
AD-LLM: Benchmarking Large Language Models for Anomaly Detection
by: Yang, Tiankai, et al.
Published: (2024)
by: Yang, Tiankai, et al.
Published: (2024)
Measuring Competency, Not Performance: Item-Aware Evaluation Across Medical Benchmarks
by: Luo, Zhimeng, et al.
Published: (2025)
by: Luo, Zhimeng, et al.
Published: (2025)
CulturALL: Benchmarking Multilingual and Multicultural Competence of LLMs on Grounded Tasks
by: Lin, Peiqin, et al.
Published: (2026)
by: Lin, Peiqin, et al.
Published: (2026)
Evaluating Clinical Competencies of Large Language Models with a General Practice Benchmark
by: Li, Zheqing, et al.
Published: (2025)
by: Li, Zheqing, et al.
Published: (2025)
Do Benchmarks Underestimate LLM Performance? Evaluating Hallucination Detection With LLM-First Human-Adjudicated Assessment
by: Atasoy, I. F., et al.
Published: (2026)
by: Atasoy, I. F., et al.
Published: (2026)
Stereotype Detection in LLMs: A Multiclass, Explainable, and Benchmark-Driven Approach
by: Wu, Zekun, et al.
Published: (2024)
by: Wu, Zekun, et al.
Published: (2024)
Uncovering Competency Gaps in Large Language Models and Their Benchmarks
by: Bohacek, Maty, et al.
Published: (2025)
by: Bohacek, Maty, et al.
Published: (2025)
Benchmark Profiling: Mechanistic Diagnosis of LLM Benchmarks
by: Kim, Dongjun, et al.
Published: (2025)
by: Kim, Dongjun, et al.
Published: (2025)
LLM for Comparative Narrative Analysis
by: Kampen, Leo, et al.
Published: (2025)
by: Kampen, Leo, et al.
Published: (2025)
Is Your Paper Being Reviewed by an LLM? Benchmarking AI Text Detection in Peer Review
by: Yu, Sungduk, et al.
Published: (2025)
by: Yu, Sungduk, et al.
Published: (2025)
Confidence is Not Competence
by: Sanyal, Debdeep, et al.
Published: (2025)
by: Sanyal, Debdeep, et al.
Published: (2025)
LiveFact: A Dynamic, Time-Aware Benchmark for LLM-Driven Fake News Detection
by: Xu, Cheng, et al.
Published: (2026)
by: Xu, Cheng, et al.
Published: (2026)
SaudiCulture: A Benchmark for Evaluating Large Language Models Cultural Competence within Saudi Arabia
by: Ayash, Lama, et al.
Published: (2025)
by: Ayash, Lama, et al.
Published: (2025)
CGU-ILALab at FoodBench-QA 2026: Comparing Traditional and LLM-based Approaches for Recipe Nutrient Estimation
by: Chen, Wei-Chun, et al.
Published: (2026)
by: Chen, Wei-Chun, et al.
Published: (2026)
HalluLens: LLM Hallucination Benchmark
by: Bang, Yejin, et al.
Published: (2025)
by: Bang, Yejin, et al.
Published: (2025)
Who Benchmarks the Benchmarks? A Case Study of LLM Evaluation in Icelandic
by: Ingimundarson, Finnur Ágúst, et al.
Published: (2026)
by: Ingimundarson, Finnur Ágúst, et al.
Published: (2026)
Benchmarking Advanced Text Anonymisation Methods: A Comparative Study on Novel and Traditional Approaches
by: Asimopoulos, Dimitris, et al.
Published: (2024)
by: Asimopoulos, Dimitris, et al.
Published: (2024)
Do We Still Need Humans in the Loop? Comparing Human and LLM Annotation in Active Learning for Hostility Detection
by: Hakimi, Ahmad Dawar, et al.
Published: (2026)
by: Hakimi, Ahmad Dawar, et al.
Published: (2026)
Multi-Lingual Cyber Threat Detection in Tweets/X Using ML, DL, and LLM: A Comparative Analysis
by: Murad, Saydul Akbar, et al.
Published: (2025)
by: Murad, Saydul Akbar, et al.
Published: (2025)
How to Detect and Defeat Molecular Mirage: A Metric-Driven Benchmark for Hallucination in LLM-based Molecular Comprehension
by: Li, Hao, et al.
Published: (2025)
by: Li, Hao, et al.
Published: (2025)
Can AI Freelancers Compete? Benchmarking Earnings, Reliability, and Task Success at Scale
by: Noever, David, et al.
Published: (2025)
by: Noever, David, et al.
Published: (2025)
Spanish and LLM Benchmarks: is MMLU Lost in Translation?
by: Plaza, Irene, et al.
Published: (2024)
by: Plaza, Irene, et al.
Published: (2024)
Benchmarking LLM Faithfulness in RAG with Evolving Leaderboards
by: Tamber, Manveer Singh, et al.
Published: (2025)
by: Tamber, Manveer Singh, et al.
Published: (2025)
Benchmark of stylistic variation in LLM-generated texts
by: Milička, Jiří, et al.
Published: (2025)
by: Milička, Jiří, et al.
Published: (2025)
Benchmarking and Improving LLM Robustness for Personalized Generation
by: Okite, Chimaobi, et al.
Published: (2025)
by: Okite, Chimaobi, et al.
Published: (2025)
From Layers to Submodules: Rethinking Granularity in Replacement-Based LLM Compression
by: Cunegatti, Elia, et al.
Published: (2026)
by: Cunegatti, Elia, et al.
Published: (2026)
Benchmark Test-Time Scaling of General LLM Agents
by: Li, Xiaochuan, et al.
Published: (2026)
by: Li, Xiaochuan, et al.
Published: (2026)
LLM-Powered Benchmark Factory: Reliable, Generic, and Efficient
by: Yuan, Peiwen, et al.
Published: (2025)
by: Yuan, Peiwen, et al.
Published: (2025)
Robust Planning with Compound LLM Architectures: An LLM-Modulo Approach
by: Gundawar, Atharva, et al.
Published: (2024)
by: Gundawar, Atharva, et al.
Published: (2024)
Two Pathways to Truthfulness: On the Intrinsic Encoding of LLM Hallucinations
by: Luo, Wen, et al.
Published: (2026)
by: Luo, Wen, et al.
Published: (2026)
Two-dimensional early exit optimisation of LLM inference
by: Hůla, Jan, et al.
Published: (2026)
by: Hůla, Jan, et al.
Published: (2026)
Comparing Hallucination Detection Metrics for Multilingual Generation
by: Kang, Haoqiang, et al.
Published: (2024)
by: Kang, Haoqiang, et al.
Published: (2024)
Who Sees the Risk? Stakeholder Conflicts and Explanatory Policies in LLM-based Risk Assessment
by: Yadav, Srishti, et al.
Published: (2025)
by: Yadav, Srishti, et al.
Published: (2025)
Comparing Approaches to Automatic Summarization in Less-Resourced Languages
by: Palen-Michel, Chester, et al.
Published: (2025)
by: Palen-Michel, Chester, et al.
Published: (2025)
keepitsimple at SemEval-2025 Task 3: LLM-Uncertainty based Approach for Multilingual Hallucination Span Detection
by: Vemula, Saketh Reddy, et al.
Published: (2025)
by: Vemula, Saketh Reddy, et al.
Published: (2025)
The Base-Rate Effect on LLM Benchmark Performance: Disambiguating Test-Taking Strategies from Benchmark Performance
by: Moore, Kyle, et al.
Published: (2024)
by: Moore, Kyle, et al.
Published: (2024)
Similar Items
-
NC-Bench: An LLM Benchmark for Evaluating Conversational Competence
by: Moore, Robert J., et al.
Published: (2026) -
A Novel Psychometrics-Based Approach to Developing Professional Competency Benchmark for Large Language Models
by: Kardanova, Elena, et al.
Published: (2024) -
Bench4KE: Benchmarking Automated Competency Question Generation
by: Lippolis, Anna Sofia, et al.
Published: (2025) -
DetectRL: Benchmarking LLM-Generated Text Detection in Real-World Scenarios
by: Wu, Junchao, et al.
Published: (2024) -
AD-LLM: Benchmarking Large Language Models for Anomaly Detection
by: Yang, Tiankai, et al.
Published: (2024)