Saved in:
| Main Authors: | , , , |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2502.09316 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866912231270121472 |
|---|---|
| author | Imajo, Kentaro Hirano, Masanori Suzuki, Shuji Mikami, Hiroaki |
| author_facet | Imajo, Kentaro Hirano, Masanori Suzuki, Shuji Mikami, Hiroaki |
| contents | Evaluating the open-ended text generation of large language models (LLMs) is challenging because of the lack of a clear ground truth and the high cost of human or LLM-based assessments. We propose a novel benchmark that evaluates LLMs using n-gram statistics and rules, without relying on human judgement or LLM-as-a-judge approaches. Using 50 question and reference answer sets, we introduce three new metrics based on n-grams and rules: Fluency, Truthfulness, and Helpfulness. Our benchmark strongly correlates with GPT-4o-based evaluations while requiring significantly fewer computational resources, demonstrating its effectiveness as a scalable alternative for assessing LLMs' open-ended generation capabilities. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2502_09316 |
| institution | arXiv |
| publishDate | 2025 |
| record_format | arxiv |
| spellingShingle | A Judge-free LLM Open-ended Generation Benchmark Based on the Distributional Hypothesis Imajo, Kentaro Hirano, Masanori Suzuki, Shuji Mikami, Hiroaki Computation and Language Evaluating the open-ended text generation of large language models (LLMs) is challenging because of the lack of a clear ground truth and the high cost of human or LLM-based assessments. We propose a novel benchmark that evaluates LLMs using n-gram statistics and rules, without relying on human judgement or LLM-as-a-judge approaches. Using 50 question and reference answer sets, we introduce three new metrics based on n-grams and rules: Fluency, Truthfulness, and Helpfulness. Our benchmark strongly correlates with GPT-4o-based evaluations while requiring significantly fewer computational resources, demonstrating its effectiveness as a scalable alternative for assessing LLMs' open-ended generation capabilities. |
| title | A Judge-free LLM Open-ended Generation Benchmark Based on the Distributional Hypothesis |
| topic | Computation and Language |
| url | https://arxiv.org/abs/2502.09316 |