Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Chen, Zhi-Yuan, Wang, Hao, Zhang, Xinyu, Hu, Enrui, Lin, Yankai
Format:	Preprint
Published:	2025
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2506.02592
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866913872176218112
author	Chen, Zhi-Yuan Wang, Hao Zhang, Xinyu Hu, Enrui Lin, Yankai
author_facet	Chen, Zhi-Yuan Wang, Hao Zhang, Xinyu Hu, Enrui Lin, Yankai
contents	Recent studies show that large language models (LLMs) exhibit self-preference bias when serving as judges, meaning they tend to favor their own responses over those generated by other models. Existing methods typically measure this bias by calculating the difference between the scores a judge model assigns to its own responses and those it assigns to responses from other models. However, this approach conflates self-preference bias with response quality, as higher-quality responses from the judge model may also lead to positive score differences, even in the absence of bias. To address this issue, we introduce gold judgments as proxies for the actual quality of responses and propose the DBG score, which measures self-preference bias as the difference between the scores assigned by the judge model to its own responses and the corresponding gold judgments. Since gold judgments reflect true response quality, the DBG score mitigates the confounding effect of response quality on bias measurement. Using the DBG score, we conduct comprehensive experiments to assess self-preference bias across LLMs of varying versions, sizes, and reasoning abilities. Additionally, we investigate two factors that influence and help alleviate self-preference bias: response text style and the post-training data of judge models. Finally, we explore potential underlying mechanisms of self-preference bias from an attention-based perspective. Our code and data are available at https://github.com/zhiyuanc2001/self-preference.
format	Preprint
id	arxiv_https___arxiv_org_abs_2506_02592
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Beyond the Surface: Measuring Self-Preference in LLM Judgments Chen, Zhi-Yuan Wang, Hao Zhang, Xinyu Hu, Enrui Lin, Yankai Computation and Language Recent studies show that large language models (LLMs) exhibit self-preference bias when serving as judges, meaning they tend to favor their own responses over those generated by other models. Existing methods typically measure this bias by calculating the difference between the scores a judge model assigns to its own responses and those it assigns to responses from other models. However, this approach conflates self-preference bias with response quality, as higher-quality responses from the judge model may also lead to positive score differences, even in the absence of bias. To address this issue, we introduce gold judgments as proxies for the actual quality of responses and propose the DBG score, which measures self-preference bias as the difference between the scores assigned by the judge model to its own responses and the corresponding gold judgments. Since gold judgments reflect true response quality, the DBG score mitigates the confounding effect of response quality on bias measurement. Using the DBG score, we conduct comprehensive experiments to assess self-preference bias across LLMs of varying versions, sizes, and reasoning abilities. Additionally, we investigate two factors that influence and help alleviate self-preference bias: response text style and the post-training data of judge models. Finally, we explore potential underlying mechanisms of self-preference bias from an attention-based perspective. Our code and data are available at https://github.com/zhiyuanc2001/self-preference.
title	Beyond the Surface: Measuring Self-Preference in LLM Judgments
topic	Computation and Language
url	https://arxiv.org/abs/2506.02592

Similar Items