Saved in:
Bibliographic Details
Main Authors: Diao, Lingxiao, Xu, Xinyue, Sun, Wanxuan, Yang, Cheng, Zhang, Zhuosheng
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2505.11368
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866912435173064704
author Diao, Lingxiao
Xu, Xinyue
Sun, Wanxuan
Yang, Cheng
Zhang, Zhuosheng
author_facet Diao, Lingxiao
Xu, Xinyue
Sun, Wanxuan
Yang, Cheng
Zhang, Zhuosheng
contents Large language models (LLMs) have been widely deployed as autonomous agents capable of following user instructions and making decisions in real-world applications. Previous studies have made notable progress in benchmarking the instruction following capabilities of LLMs in general domains, with a primary focus on their inherent commonsense knowledge. Recently, LLMs have been increasingly deployed as domain-oriented agents, which rely on domain-oriented guidelines that may conflict with their commonsense knowledge. These guidelines exhibit two key characteristics: they consist of a wide range of domain-oriented rules and are subject to frequent updates. Despite these challenges, the absence of comprehensive benchmarks for evaluating the domain-oriented guideline following capabilities of LLMs presents a significant obstacle to their effective assessment and further development. In this paper, we introduce GuideBench, a comprehensive benchmark designed to evaluate guideline following performance of LLMs. GuideBench evaluates LLMs on three critical aspects: (i) adherence to diverse rules, (ii) robustness to rule updates, and (iii) alignment with human preferences. Experimental results on a range of LLMs indicate substantial opportunities for improving their ability to follow domain-oriented guidelines.
format Preprint
id arxiv_https___arxiv_org_abs_2505_11368
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle GuideBench: Benchmarking Domain-Oriented Guideline Following for LLM Agents
Diao, Lingxiao
Xu, Xinyue
Sun, Wanxuan
Yang, Cheng
Zhang, Zhuosheng
Computation and Language
Large language models (LLMs) have been widely deployed as autonomous agents capable of following user instructions and making decisions in real-world applications. Previous studies have made notable progress in benchmarking the instruction following capabilities of LLMs in general domains, with a primary focus on their inherent commonsense knowledge. Recently, LLMs have been increasingly deployed as domain-oriented agents, which rely on domain-oriented guidelines that may conflict with their commonsense knowledge. These guidelines exhibit two key characteristics: they consist of a wide range of domain-oriented rules and are subject to frequent updates. Despite these challenges, the absence of comprehensive benchmarks for evaluating the domain-oriented guideline following capabilities of LLMs presents a significant obstacle to their effective assessment and further development. In this paper, we introduce GuideBench, a comprehensive benchmark designed to evaluate guideline following performance of LLMs. GuideBench evaluates LLMs on three critical aspects: (i) adherence to diverse rules, (ii) robustness to rule updates, and (iii) alignment with human preferences. Experimental results on a range of LLMs indicate substantial opportunities for improving their ability to follow domain-oriented guidelines.
title GuideBench: Benchmarking Domain-Oriented Guideline Following for LLM Agents
topic Computation and Language
url https://arxiv.org/abs/2505.11368