Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Huang, Yu-Shiang, Wang, Chuan-Ju, Chen, Chung-Chi
Format:	Preprint
Published:	2025
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2507.01923
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866913923497721856
author	Huang, Yu-Shiang Wang, Chuan-Ju Chen, Chung-Chi
author_facet	Huang, Yu-Shiang Wang, Chuan-Ju Chen, Chung-Chi
contents	Natural language generation (NLG) is increasingly deployed in high-stakes domains, yet common intrinsic evaluation methods, such as n-gram overlap or sentence plausibility, weakly correlate with actual decision-making efficacy. We propose a decision-oriented framework for evaluating generated text by directly measuring its influence on human and large language model (LLM) decision outcomes. Using market digest texts--including objective morning summaries and subjective closing-bell analyses--as test cases, we assess decision quality based on the financial performance of trades executed by human investors and autonomous LLM agents informed exclusively by these texts. Our findings reveal that neither humans nor LLM agents consistently surpass random performance when relying solely on summaries. However, richer analytical commentaries enable collaborative human-LLM teams to outperform individual human or agent baselines significantly. Our approach underscores the importance of evaluating generated text by its ability to facilitate synergistic decision-making between humans and LLMs, highlighting critical limitations of traditional intrinsic metrics.
format	Preprint
id	arxiv_https___arxiv_org_abs_2507_01923
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Decision-Oriented Text Evaluation Huang, Yu-Shiang Wang, Chuan-Ju Chen, Chung-Chi Computation and Language Natural language generation (NLG) is increasingly deployed in high-stakes domains, yet common intrinsic evaluation methods, such as n-gram overlap or sentence plausibility, weakly correlate with actual decision-making efficacy. We propose a decision-oriented framework for evaluating generated text by directly measuring its influence on human and large language model (LLM) decision outcomes. Using market digest texts--including objective morning summaries and subjective closing-bell analyses--as test cases, we assess decision quality based on the financial performance of trades executed by human investors and autonomous LLM agents informed exclusively by these texts. Our findings reveal that neither humans nor LLM agents consistently surpass random performance when relying solely on summaries. However, richer analytical commentaries enable collaborative human-LLM teams to outperform individual human or agent baselines significantly. Our approach underscores the importance of evaluating generated text by its ability to facilitate synergistic decision-making between humans and LLMs, highlighting critical limitations of traditional intrinsic metrics.
title	Decision-Oriented Text Evaluation
topic	Computation and Language
url	https://arxiv.org/abs/2507.01923

Similar Items