Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Imajo, Kentaro, Hirano, Masanori, Suzuki, Shuji, Mikami, Hiroaki
Format:	Preprint
Published:	2025
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2502.09316
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866912231270121472
author	Imajo, Kentaro Hirano, Masanori Suzuki, Shuji Mikami, Hiroaki
author_facet	Imajo, Kentaro Hirano, Masanori Suzuki, Shuji Mikami, Hiroaki
contents	Evaluating the open-ended text generation of large language models (LLMs) is challenging because of the lack of a clear ground truth and the high cost of human or LLM-based assessments. We propose a novel benchmark that evaluates LLMs using n-gram statistics and rules, without relying on human judgement or LLM-as-a-judge approaches. Using 50 question and reference answer sets, we introduce three new metrics based on n-grams and rules: Fluency, Truthfulness, and Helpfulness. Our benchmark strongly correlates with GPT-4o-based evaluations while requiring significantly fewer computational resources, demonstrating its effectiveness as a scalable alternative for assessing LLMs' open-ended generation capabilities.
format	Preprint
id	arxiv_https___arxiv_org_abs_2502_09316
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	A Judge-free LLM Open-ended Generation Benchmark Based on the Distributional Hypothesis Imajo, Kentaro Hirano, Masanori Suzuki, Shuji Mikami, Hiroaki Computation and Language Evaluating the open-ended text generation of large language models (LLMs) is challenging because of the lack of a clear ground truth and the high cost of human or LLM-based assessments. We propose a novel benchmark that evaluates LLMs using n-gram statistics and rules, without relying on human judgement or LLM-as-a-judge approaches. Using 50 question and reference answer sets, we introduce three new metrics based on n-grams and rules: Fluency, Truthfulness, and Helpfulness. Our benchmark strongly correlates with GPT-4o-based evaluations while requiring significantly fewer computational resources, demonstrating its effectiveness as a scalable alternative for assessing LLMs' open-ended generation capabilities.
title	A Judge-free LLM Open-ended Generation Benchmark Based on the Distributional Hypothesis
topic	Computation and Language
url	https://arxiv.org/abs/2502.09316

Similar Items