:: Library Catalog

Buchumschlag

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Kovatchev, Venelin, Lease, Matthew
Format:	Preprint
Veröffentlicht:	2024
Schlagworte:	Computation and Language Artificial Intelligence
Online-Zugang:	https://arxiv.org/abs/2404.00748
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Ähnliche Einträge

Finding Pareto Trade-offs in Fair and Accurate Detection of Toxic Speech
von: Gupta, Soumyajit, et al.
Veröffentlicht: (2022)

Capturing Classic Authorial Style in Long-Form Story Generation with GRPO Fine-Tuning
von: Liu, Jinlong, et al.
Veröffentlicht: (2025)

Measuring Competency, Not Performance: Item-Aware Evaluation Across Medical Benchmarks
von: Luo, Zhimeng, et al.
Veröffentlicht: (2025)

Transparent Screening for LLM Inference and Training Impacts
von: Pachot, Arnault, et al.
Veröffentlicht: (2026)

Codenames as a Benchmark for Large Language Models
von: Stephenson, Matthew, et al.
Veröffentlicht: (2024)

Transparent Reference-free Automated Evaluation of Open-Ended User Survey Responses
von: An, Subin, et al.
Veröffentlicht: (2025)

Wrapper Boxes: Faithful Attribution of Model Predictions to Training Data
von: Su, Yiheng, et al.
Veröffentlicht: (2023)

WXImpactBench: A Disruptive Weather Impact Understanding Benchmark for Evaluating Large Language Models
von: Yu, Yongan, et al.
Veröffentlicht: (2025)

HoH: A Dynamic Benchmark for Evaluating the Impact of Outdated Information on Retrieval-Augmented Generation
von: Ouyang, Jie, et al.
Veröffentlicht: (2025)

Measuring the Impact of Lexical Training Data Coverage on Hallucination Detection in Large Language Models
von: Zhang, Shuo, et al.
Veröffentlicht: (2025)

Measuring Data Science Automation: A Survey of Evaluation Tools for AI Assistants and Agents
von: Testini, Irene, et al.
Veröffentlicht: (2025)

Evaluating the Evaluator: Measuring LLMs' Adherence to Task Evaluation Instructions
von: Murugadoss, Bhuvanashree, et al.
Veröffentlicht: (2024)

Who Benchmarks the Benchmarks? A Case Study of LLM Evaluation in Icelandic
von: Ingimundarson, Finnur Ágúst, et al.
Veröffentlicht: (2026)

Benchmarking Data Science Agents
von: Zhang, Yuge, et al.
Veröffentlicht: (2024)

Transparent and Coherent Procedural Mistake Detection
von: Storks, Shane, et al.
Veröffentlicht: (2024)

NarraBench: A Comprehensive Framework for Narrative Benchmarking
von: Hamilton, Sil, et al.
Veröffentlicht: (2025)

Understanding the Role of LLMs in Multimodal Evaluation Benchmarks
von: Jiang, Botian, et al.
Veröffentlicht: (2024)

DHP Benchmark: Are LLMs Good NLG Evaluators?
von: Wang, Yicheng, et al.
Veröffentlicht: (2024)

Generating Benchmarks for Factuality Evaluation of Language Models
von: Muhlgay, Dor, et al.
Veröffentlicht: (2023)

On Robustness and Reliability of Benchmark-Based Evaluation of LLMs
von: Lunardi, Riccardo, et al.
Veröffentlicht: (2025)

The Qiyas Benchmark: Measuring ChatGPT Mathematical and Language Understanding in Arabic
von: Al-Khalifa, Shahad, et al.
Veröffentlicht: (2024)

Measuring and Benchmarking Large Language Models' Capabilities to Generate Persuasive Language
von: Pauli, Amalie Brogaard, et al.
Veröffentlicht: (2024)

Measuring what Matters: Construct Validity in Large Language Model Benchmarks
von: Bean, Andrew M., et al.
Veröffentlicht: (2025)

RUVA: Personalized Transparent On-Device Graph Reasoning
von: Conte, Gabriele, et al.
Veröffentlicht: (2026)

Overestimation in LLM Evaluation: A Controlled Large-Scale Study on Data Contamination's Impact on Machine Translation
von: Kocyigit, Muhammed Yusuf, et al.
Veröffentlicht: (2025)

The Self-Execution Benchmark: Measuring LLMs' Attempts to Overcome Their Lack of Self-Execution
von: Ezra, Elon, et al.
Veröffentlicht: (2025)

QGEval: Benchmarking Multi-dimensional Evaluation for Question Generation
von: Fu, Weiping, et al.
Veröffentlicht: (2024)

CEval: A Benchmark for Evaluating Counterfactual Text Generation
von: Nguyen, Van Bach, et al.
Veröffentlicht: (2024)

Generating Leakage-Free Benchmarks for Robust RAG Evaluation
von: Liu, Jiayi, et al.
Veröffentlicht: (2026)

NC-Bench: An LLM Benchmark for Evaluating Conversational Competence
von: Moore, Robert J., et al.
Veröffentlicht: (2026)

RE-IMAGINE: Symbolic Benchmark Synthesis for Reasoning Evaluation
von: Xu, Xinnuo, et al.
Veröffentlicht: (2025)

Evaluating the Performance of Large Language Models on GAOKAO Benchmark
von: Zhang, Xiaotian, et al.
Veröffentlicht: (2023)

Kinship Data Benchmark for Multi-hop Reasoning
von: Sun, Tianda, et al.
Veröffentlicht: (2026)

DAComp: Benchmarking Data Agents across the Full Data Intelligence Lifecycle
von: Lei, Fangyu, et al.
Veröffentlicht: (2025)

SPARQL Query Generation with LLMs: Measuring the Impact of Training Data Memorization and Knowledge Injection
von: Gashkov, Aleksandr, et al.
Veröffentlicht: (2025)

An LLM Maturity Model for Reliable and Transparent Text-to-Query
von: Yu, Lei, et al.
Veröffentlicht: (2024)

Every Answer Matters: Evaluating Commonsense with Probabilistic Measures
von: Cheng, Qi, et al.
Veröffentlicht: (2024)

AgentQuest: A Modular Benchmark Framework to Measure Progress and Improve LLM Agents
von: Gioacchini, Luca, et al.
Veröffentlicht: (2024)

Mobile-Bench: An Evaluation Benchmark for LLM-based Mobile Agents
von: Deng, Shihan, et al.
Veröffentlicht: (2024)

PythonSaga: Redefining the Benchmark to Evaluate Code Generating LLMs
von: Yadav, Ankit, et al.
Veröffentlicht: (2024)