MARC21: :: Library Catalog

Salvato in:

Dettagli Bibliografici
Autori principali:	Wu, Xiaodong, Wang, Minhao, Liu, Yichen, Shi, Xiaoming, Yan, He, Lu, Xiangju, Zhu, Junmin, Zhang, Wei
Natura:	Preprint
Pubblicazione:	2024
Soggetti:	Computation and Language
Accesso online:	https://arxiv.org/abs/2411.07037
Tags:	Aggiungi Tag Nessun Tag, puoi essere il primo ad aggiungerne!!

_version_	1866908463052881920
author	Wu, Xiaodong Wang, Minhao Liu, Yichen Shi, Xiaoming Yan, He Lu, Xiangju Zhu, Junmin Zhang, Wei
author_facet	Wu, Xiaodong Wang, Minhao Liu, Yichen Shi, Xiaoming Yan, He Lu, Xiangju Zhu, Junmin Zhang, Wei
contents	As Large Language Models (LLMs) evolve in natural language processing (NLP), their ability to stably follow instructions in long-context inputs has become critical for real-world applications. However, existing benchmarks seldom focus on instruction-following in long-context scenarios or stability on different inputs. To bridge this gap, we introduce LIFBench, a scalable dataset designed to evaluate LLMs' instruction-following capabilities and stability across long contexts. LIFBench comprises three long-context scenarios and eleven diverse tasks, featuring 2,766 instructions generated through an automated expansion method across three dimensions: length, expression, and variables. For evaluation, we propose LIFEval, a rubric-based assessment method that enables precise, automated scoring of complex LLM responses without reliance on LLM-assisted assessments or human judgment. This method allows for a comprehensive analysis of model performance and stability from multiple perspectives. We conduct detailed experiments on 20 prominent LLMs across six length intervals. Our work contributes LIFBench and LIFEval as robust tools for assessing LLM performance in complex and long-context settings, offering valuable insights to guide future advancements in LLM development.
format	Preprint
id	arxiv_https___arxiv_org_abs_2411_07037
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	LIFBench: Evaluating the Instruction Following Performance and Stability of Large Language Models in Long-Context Scenarios Wu, Xiaodong Wang, Minhao Liu, Yichen Shi, Xiaoming Yan, He Lu, Xiangju Zhu, Junmin Zhang, Wei Computation and Language As Large Language Models (LLMs) evolve in natural language processing (NLP), their ability to stably follow instructions in long-context inputs has become critical for real-world applications. However, existing benchmarks seldom focus on instruction-following in long-context scenarios or stability on different inputs. To bridge this gap, we introduce LIFBench, a scalable dataset designed to evaluate LLMs' instruction-following capabilities and stability across long contexts. LIFBench comprises three long-context scenarios and eleven diverse tasks, featuring 2,766 instructions generated through an automated expansion method across three dimensions: length, expression, and variables. For evaluation, we propose LIFEval, a rubric-based assessment method that enables precise, automated scoring of complex LLM responses without reliance on LLM-assisted assessments or human judgment. This method allows for a comprehensive analysis of model performance and stability from multiple perspectives. We conduct detailed experiments on 20 prominent LLMs across six length intervals. Our work contributes LIFBench and LIFEval as robust tools for assessing LLM performance in complex and long-context settings, offering valuable insights to guide future advancements in LLM development.
title	LIFBench: Evaluating the Instruction Following Performance and Stability of Large Language Models in Long-Context Scenarios
topic	Computation and Language
url	https://arxiv.org/abs/2411.07037

Documenti analoghi