Salvato in:
Dettagli Bibliografici
Autori principali: Kuo, Tzu-Lin, Liao, Feng-Ting, Hsieh, Mu-Wei, Chang, Fu-Chieh, Hsu, Po-Chun, Shiu, Da-Shan
Natura: Preprint
Pubblicazione: 2024
Soggetti:
Accesso online:https://arxiv.org/abs/2409.12558
Tags: Aggiungi Tag
Nessun Tag, puoi essere il primo ad aggiungerne!!
_version_ 1866916625228234752
author Kuo, Tzu-Lin
Liao, Feng-Ting
Hsieh, Mu-Wei
Chang, Fu-Chieh
Hsu, Po-Chun
Shiu, Da-Shan
author_facet Kuo, Tzu-Lin
Liao, Feng-Ting
Hsieh, Mu-Wei
Chang, Fu-Chieh
Hsu, Po-Chun
Shiu, Da-Shan
contents In real-world applications with Large Language Models (LLMs), external retrieval mechanisms - such as Search-Augmented Generation (SAG), tool utilization, and Retrieval-Augmented Generation (RAG) - are often employed to enhance the quality of augmented generations in dialogues. These approaches often come with multi-turn dialogue, where each interaction is enriched by relevant information retrieved from external sources. Existing benchmarks either assess LLMs' chat abilities in multi-turn dialogues or their use of retrieval for augmented responses in single-turn settings. However, there is a gap in evaluating LLMs' ability to leverage retrieval for more precise responses across multiple turns. To address this limitation, we introduce RAD-Bench (Retrieval Augmented Dialogue), a benchmark designed to evaluate LLMs' capabilities in multi-turn dialogues following retrievals, essential for their deployment in context-rich applications. RAD-Bench evaluates two key abilities of LLMs: Retrieval Synthesis and Retrieval Reasoning. These are measured using discriminative questions and retrieved contexts, and corresponding reference answers, assessing how effectively LLMs integrate and reason with context to maintain and enhance conversation quality over multiple turns. Our evaluation results on commonly used LLMs reveal that model performance deteriorates as additional layers of conditions or constraints are applied across conversation turns, even when accurate retrieved contexts are provided. The data and code are available at https://github.com/mtkresearch/RAD-Bench
format Preprint
id arxiv_https___arxiv_org_abs_2409_12558
institution arXiv
publishDate 2024
record_format arxiv
spellingShingle RAD-Bench: Evaluating Large Language Models Capabilities in Retrieval Augmented Dialogues
Kuo, Tzu-Lin
Liao, Feng-Ting
Hsieh, Mu-Wei
Chang, Fu-Chieh
Hsu, Po-Chun
Shiu, Da-Shan
Computation and Language
In real-world applications with Large Language Models (LLMs), external retrieval mechanisms - such as Search-Augmented Generation (SAG), tool utilization, and Retrieval-Augmented Generation (RAG) - are often employed to enhance the quality of augmented generations in dialogues. These approaches often come with multi-turn dialogue, where each interaction is enriched by relevant information retrieved from external sources. Existing benchmarks either assess LLMs' chat abilities in multi-turn dialogues or their use of retrieval for augmented responses in single-turn settings. However, there is a gap in evaluating LLMs' ability to leverage retrieval for more precise responses across multiple turns. To address this limitation, we introduce RAD-Bench (Retrieval Augmented Dialogue), a benchmark designed to evaluate LLMs' capabilities in multi-turn dialogues following retrievals, essential for their deployment in context-rich applications. RAD-Bench evaluates two key abilities of LLMs: Retrieval Synthesis and Retrieval Reasoning. These are measured using discriminative questions and retrieved contexts, and corresponding reference answers, assessing how effectively LLMs integrate and reason with context to maintain and enhance conversation quality over multiple turns. Our evaluation results on commonly used LLMs reveal that model performance deteriorates as additional layers of conditions or constraints are applied across conversation turns, even when accurate retrieved contexts are provided. The data and code are available at https://github.com/mtkresearch/RAD-Bench
title RAD-Bench: Evaluating Large Language Models Capabilities in Retrieval Augmented Dialogues
topic Computation and Language
url https://arxiv.org/abs/2409.12558