Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Diao, Lingxiao, Xu, Xinyue, Sun, Wanxuan, Yang, Cheng, Zhang, Zhuosheng
Format:	Preprint
Published:	2025
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2505.11368
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866912435173064704
author	Diao, Lingxiao Xu, Xinyue Sun, Wanxuan Yang, Cheng Zhang, Zhuosheng
author_facet	Diao, Lingxiao Xu, Xinyue Sun, Wanxuan Yang, Cheng Zhang, Zhuosheng
contents	Large language models (LLMs) have been widely deployed as autonomous agents capable of following user instructions and making decisions in real-world applications. Previous studies have made notable progress in benchmarking the instruction following capabilities of LLMs in general domains, with a primary focus on their inherent commonsense knowledge. Recently, LLMs have been increasingly deployed as domain-oriented agents, which rely on domain-oriented guidelines that may conflict with their commonsense knowledge. These guidelines exhibit two key characteristics: they consist of a wide range of domain-oriented rules and are subject to frequent updates. Despite these challenges, the absence of comprehensive benchmarks for evaluating the domain-oriented guideline following capabilities of LLMs presents a significant obstacle to their effective assessment and further development. In this paper, we introduce GuideBench, a comprehensive benchmark designed to evaluate guideline following performance of LLMs. GuideBench evaluates LLMs on three critical aspects: (i) adherence to diverse rules, (ii) robustness to rule updates, and (iii) alignment with human preferences. Experimental results on a range of LLMs indicate substantial opportunities for improving their ability to follow domain-oriented guidelines.
format	Preprint
id	arxiv_https___arxiv_org_abs_2505_11368
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	GuideBench: Benchmarking Domain-Oriented Guideline Following for LLM Agents Diao, Lingxiao Xu, Xinyue Sun, Wanxuan Yang, Cheng Zhang, Zhuosheng Computation and Language Large language models (LLMs) have been widely deployed as autonomous agents capable of following user instructions and making decisions in real-world applications. Previous studies have made notable progress in benchmarking the instruction following capabilities of LLMs in general domains, with a primary focus on their inherent commonsense knowledge. Recently, LLMs have been increasingly deployed as domain-oriented agents, which rely on domain-oriented guidelines that may conflict with their commonsense knowledge. These guidelines exhibit two key characteristics: they consist of a wide range of domain-oriented rules and are subject to frequent updates. Despite these challenges, the absence of comprehensive benchmarks for evaluating the domain-oriented guideline following capabilities of LLMs presents a significant obstacle to their effective assessment and further development. In this paper, we introduce GuideBench, a comprehensive benchmark designed to evaluate guideline following performance of LLMs. GuideBench evaluates LLMs on three critical aspects: (i) adherence to diverse rules, (ii) robustness to rule updates, and (iii) alignment with human preferences. Experimental results on a range of LLMs indicate substantial opportunities for improving their ability to follow domain-oriented guidelines.
title	GuideBench: Benchmarking Domain-Oriented Guideline Following for LLM Agents
topic	Computation and Language
url	https://arxiv.org/abs/2505.11368

Similar Items