:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Pröhl, Thorsten, Putzier, Erik, Zarnekow, Rüdiger
Format:	Preprint
Published:	2024
Subjects:	Computation and Language Artificial Intelligence
Online Access:	https://arxiv.org/abs/2406.11670
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

NC-Bench: An LLM Benchmark for Evaluating Conversational Competence
by: Moore, Robert J., et al.
Published: (2026)

A Novel Psychometrics-Based Approach to Developing Professional Competency Benchmark for Large Language Models
by: Kardanova, Elena, et al.
Published: (2024)

Bench4KE: Benchmarking Automated Competency Question Generation
by: Lippolis, Anna Sofia, et al.
Published: (2025)

DetectRL: Benchmarking LLM-Generated Text Detection in Real-World Scenarios
by: Wu, Junchao, et al.
Published: (2024)

AD-LLM: Benchmarking Large Language Models for Anomaly Detection
by: Yang, Tiankai, et al.
Published: (2024)

Measuring Competency, Not Performance: Item-Aware Evaluation Across Medical Benchmarks
by: Luo, Zhimeng, et al.
Published: (2025)

CulturALL: Benchmarking Multilingual and Multicultural Competence of LLMs on Grounded Tasks
by: Lin, Peiqin, et al.
Published: (2026)

Evaluating Clinical Competencies of Large Language Models with a General Practice Benchmark
by: Li, Zheqing, et al.
Published: (2025)

Do Benchmarks Underestimate LLM Performance? Evaluating Hallucination Detection With LLM-First Human-Adjudicated Assessment
by: Atasoy, I. F., et al.
Published: (2026)

Stereotype Detection in LLMs: A Multiclass, Explainable, and Benchmark-Driven Approach
by: Wu, Zekun, et al.
Published: (2024)

Uncovering Competency Gaps in Large Language Models and Their Benchmarks
by: Bohacek, Maty, et al.
Published: (2025)

Benchmark Profiling: Mechanistic Diagnosis of LLM Benchmarks
by: Kim, Dongjun, et al.
Published: (2025)

LLM for Comparative Narrative Analysis
by: Kampen, Leo, et al.
Published: (2025)

Is Your Paper Being Reviewed by an LLM? Benchmarking AI Text Detection in Peer Review
by: Yu, Sungduk, et al.
Published: (2025)

Confidence is Not Competence
by: Sanyal, Debdeep, et al.
Published: (2025)

LiveFact: A Dynamic, Time-Aware Benchmark for LLM-Driven Fake News Detection
by: Xu, Cheng, et al.
Published: (2026)

SaudiCulture: A Benchmark for Evaluating Large Language Models Cultural Competence within Saudi Arabia
by: Ayash, Lama, et al.
Published: (2025)

CGU-ILALab at FoodBench-QA 2026: Comparing Traditional and LLM-based Approaches for Recipe Nutrient Estimation
by: Chen, Wei-Chun, et al.
Published: (2026)

HalluLens: LLM Hallucination Benchmark
by: Bang, Yejin, et al.
Published: (2025)

Who Benchmarks the Benchmarks? A Case Study of LLM Evaluation in Icelandic
by: Ingimundarson, Finnur Ágúst, et al.
Published: (2026)

Benchmarking Advanced Text Anonymisation Methods: A Comparative Study on Novel and Traditional Approaches
by: Asimopoulos, Dimitris, et al.
Published: (2024)

Do We Still Need Humans in the Loop? Comparing Human and LLM Annotation in Active Learning for Hostility Detection
by: Hakimi, Ahmad Dawar, et al.
Published: (2026)

Multi-Lingual Cyber Threat Detection in Tweets/X Using ML, DL, and LLM: A Comparative Analysis
by: Murad, Saydul Akbar, et al.
Published: (2025)

How to Detect and Defeat Molecular Mirage: A Metric-Driven Benchmark for Hallucination in LLM-based Molecular Comprehension
by: Li, Hao, et al.
Published: (2025)

Can AI Freelancers Compete? Benchmarking Earnings, Reliability, and Task Success at Scale
by: Noever, David, et al.
Published: (2025)

Spanish and LLM Benchmarks: is MMLU Lost in Translation?
by: Plaza, Irene, et al.
Published: (2024)

Benchmarking LLM Faithfulness in RAG with Evolving Leaderboards
by: Tamber, Manveer Singh, et al.
Published: (2025)

Benchmark of stylistic variation in LLM-generated texts
by: Milička, Jiří, et al.
Published: (2025)

Benchmarking and Improving LLM Robustness for Personalized Generation
by: Okite, Chimaobi, et al.
Published: (2025)

From Layers to Submodules: Rethinking Granularity in Replacement-Based LLM Compression
by: Cunegatti, Elia, et al.
Published: (2026)

Benchmark Test-Time Scaling of General LLM Agents
by: Li, Xiaochuan, et al.
Published: (2026)

LLM-Powered Benchmark Factory: Reliable, Generic, and Efficient
by: Yuan, Peiwen, et al.
Published: (2025)

Robust Planning with Compound LLM Architectures: An LLM-Modulo Approach
by: Gundawar, Atharva, et al.
Published: (2024)

Two Pathways to Truthfulness: On the Intrinsic Encoding of LLM Hallucinations
by: Luo, Wen, et al.
Published: (2026)

Two-dimensional early exit optimisation of LLM inference
by: Hůla, Jan, et al.
Published: (2026)

Comparing Hallucination Detection Metrics for Multilingual Generation
by: Kang, Haoqiang, et al.
Published: (2024)

Who Sees the Risk? Stakeholder Conflicts and Explanatory Policies in LLM-based Risk Assessment
by: Yadav, Srishti, et al.
Published: (2025)

Comparing Approaches to Automatic Summarization in Less-Resourced Languages
by: Palen-Michel, Chester, et al.
Published: (2025)

keepitsimple at SemEval-2025 Task 3: LLM-Uncertainty based Approach for Multilingual Hallucination Span Detection
by: Vemula, Saketh Reddy, et al.
Published: (2025)

The Base-Rate Effect on LLM Benchmark Performance: Disambiguating Test-Taking Strategies from Benchmark Performance
by: Moore, Kyle, et al.
Published: (2024)