Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Lin, Yihang, Gao, Yunze, Lin, Zeyang, Li, Dongbo, Peng, Kun, Song, Chenglong, Liu, Yue
Format:	Preprint
Published:	2026
Subjects:	Computation and Language Artificial Intelligence Sound
Online Access:	https://arxiv.org/abs/2605.28882
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866913168604790784
author	Lin, Yihang Gao, Yunze Lin, Zeyang Li, Dongbo Peng, Kun Song, Chenglong Liu, Yue
author_facet	Lin, Yihang Gao, Yunze Lin, Zeyang Li, Dongbo Peng, Kun Song, Chenglong Liu, Yue
contents	With the rapid advancement of large language models, evaluating human-likeness in open-ended conversation has become increasingly important. However, human-likeness is a form of tacit knowledge that humans perceive intuitively, yet the underlying criteria resist explicit formulation. Human judgments vary widely, with strong agreement on some cases and legitimate disagreement on others. Meanwhile, the criteria behind human judgments remain implicit, leaving no clear basis for constructing cases. Further, what counts as human-like is not static, but evolving with model capability and human expectations. Despite progress in evaluation methods such as expert-authored benchmarks, Reward Models, and self-evolving benchmarks, none addresses all three challenges simultaneously. Therefore, we propose GrowLoop, a self-evolving conversation evaluation system that continuously adapts as models advance and scenarios shift. With minimal human seed annotations as the first mover, LLM agents iteratively extract and refine evaluation rubrics through Heuristic Learning. Human-AI agreement is required where annotators converge, while only plausibility is expected where they diverge. Moreover, the Rubric-Case co-evolution mechanism enables continuous evolution, expanded through new seeds when the evaluation target moves. Applied to human-likeness evaluation in open-ended conversation, the generated rubrics not only substantially outperform existing methods in alignment with human judgments, but also uncover issues that annotators overlook. The resulting benchmark effectively discriminates models across capability tiers and reveals where they fall short, while generalizing to new scenarios and adapting as models advance. Our work shifts the benchmarking paradigm from manual updates or difficulty scaling to comprehensive, continuous self-evolution.
format	Preprint
id	arxiv_https___arxiv_org_abs_2605_28882
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	GrowLoop: Self-Evolving Conversation Evaluation Seeded by Human Lin, Yihang Gao, Yunze Lin, Zeyang Li, Dongbo Peng, Kun Song, Chenglong Liu, Yue Computation and Language Artificial Intelligence Sound With the rapid advancement of large language models, evaluating human-likeness in open-ended conversation has become increasingly important. However, human-likeness is a form of tacit knowledge that humans perceive intuitively, yet the underlying criteria resist explicit formulation. Human judgments vary widely, with strong agreement on some cases and legitimate disagreement on others. Meanwhile, the criteria behind human judgments remain implicit, leaving no clear basis for constructing cases. Further, what counts as human-like is not static, but evolving with model capability and human expectations. Despite progress in evaluation methods such as expert-authored benchmarks, Reward Models, and self-evolving benchmarks, none addresses all three challenges simultaneously. Therefore, we propose GrowLoop, a self-evolving conversation evaluation system that continuously adapts as models advance and scenarios shift. With minimal human seed annotations as the first mover, LLM agents iteratively extract and refine evaluation rubrics through Heuristic Learning. Human-AI agreement is required where annotators converge, while only plausibility is expected where they diverge. Moreover, the Rubric-Case co-evolution mechanism enables continuous evolution, expanded through new seeds when the evaluation target moves. Applied to human-likeness evaluation in open-ended conversation, the generated rubrics not only substantially outperform existing methods in alignment with human judgments, but also uncover issues that annotators overlook. The resulting benchmark effectively discriminates models across capability tiers and reveals where they fall short, while generalizing to new scenarios and adapting as models advance. Our work shifts the benchmarking paradigm from manual updates or difficulty scaling to comprehensive, continuous self-evolution.
title	GrowLoop: Self-Evolving Conversation Evaluation Seeded by Human
topic	Computation and Language Artificial Intelligence Sound
url	https://arxiv.org/abs/2605.28882

Similar Items