Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Healey, Jennifer, Byrum, Laurie, Akhtar, Md Nadeem, Bhargava, Surabhi, Sinha, Moumita
Format:	Preprint
Published:	2025
Subjects:	Computation and Language Artificial Intelligence
Online Access:	https://arxiv.org/abs/2505.03053
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866910929200873472
author	Healey, Jennifer Byrum, Laurie Akhtar, Md Nadeem Bhargava, Surabhi Sinha, Moumita
author_facet	Healey, Jennifer Byrum, Laurie Akhtar, Md Nadeem Bhargava, Surabhi Sinha, Moumita
contents	LLM evaluation is challenging even the case of base models. In real world deployments, evaluation is further complicated by the interplay of task specific prompts and experiential context. At scale, bias evaluation is often based on short context, fixed choice benchmarks that can be rapidly evaluated, however, these can lose validity when the LLMs' deployed context differs. Large scale human evaluation is often seen as too intractable and costly. Here we present our journey towards developing a semi-automated bias evaluation framework for free text responses that has human insights at its core. We discuss how we developed an operational definition of bias that helped us automate our pipeline and a methodology for classifying bias beyond multiple choice. We additionally comment on how human evaluation helped us uncover problematic templates in a bias benchmark.
format	Preprint
id	arxiv_https___arxiv_org_abs_2505_03053
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Developing A Framework to Support Human Evaluation of Bias in Generated Free Response Text Healey, Jennifer Byrum, Laurie Akhtar, Md Nadeem Bhargava, Surabhi Sinha, Moumita Computation and Language Artificial Intelligence LLM evaluation is challenging even the case of base models. In real world deployments, evaluation is further complicated by the interplay of task specific prompts and experiential context. At scale, bias evaluation is often based on short context, fixed choice benchmarks that can be rapidly evaluated, however, these can lose validity when the LLMs' deployed context differs. Large scale human evaluation is often seen as too intractable and costly. Here we present our journey towards developing a semi-automated bias evaluation framework for free text responses that has human insights at its core. We discuss how we developed an operational definition of bias that helped us automate our pipeline and a methodology for classifying bias beyond multiple choice. We additionally comment on how human evaluation helped us uncover problematic templates in a bias benchmark.
title	Developing A Framework to Support Human Evaluation of Bias in Generated Free Response Text
topic	Computation and Language Artificial Intelligence
url	https://arxiv.org/abs/2505.03053

Similar Items