Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Arcadinho, Samuel, Aparicio, David, Almeida, Mariana
Format:	Preprint
Published:	2024
Subjects:	Computation and Language Artificial Intelligence Machine Learning
Online Access:	https://arxiv.org/abs/2409.15934
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866913539829006336
author	Arcadinho, Samuel Aparicio, David Almeida, Mariana
author_facet	Arcadinho, Samuel Aparicio, David Almeida, Mariana
contents	Tool-augmented LLMs are a promising approach to create AI agents that can have realistic conversations, follow procedures, and call appropriate functions. However, evaluating them is challenging due to the diversity of possible conversations, and existing datasets focus only on single interactions and function-calling. We present a test generation pipeline to evaluate LLMs as conversational AI agents. Our framework uses LLMs to generate diverse tests grounded on user-defined procedures. For that, we use intermediate graphs to limit the LLM test generator's tendency to hallucinate content that is not grounded on input procedures, and enforces high coverage of the possible conversations. Additionally, we put forward ALMITA, a manually curated dataset for evaluating AI agents in customer support, and use it to evaluate existing LLMs. Our results show that while tool-augmented LLMs perform well in single interactions, they often struggle to handle complete conversations. While our focus is on customer support, our method is general and capable of AI agents for different domains.
format	Preprint
id	arxiv_https___arxiv_org_abs_2409_15934
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	Automated test generation to evaluate tool-augmented LLMs as conversational AI agents Arcadinho, Samuel Aparicio, David Almeida, Mariana Computation and Language Artificial Intelligence Machine Learning Tool-augmented LLMs are a promising approach to create AI agents that can have realistic conversations, follow procedures, and call appropriate functions. However, evaluating them is challenging due to the diversity of possible conversations, and existing datasets focus only on single interactions and function-calling. We present a test generation pipeline to evaluate LLMs as conversational AI agents. Our framework uses LLMs to generate diverse tests grounded on user-defined procedures. For that, we use intermediate graphs to limit the LLM test generator's tendency to hallucinate content that is not grounded on input procedures, and enforces high coverage of the possible conversations. Additionally, we put forward ALMITA, a manually curated dataset for evaluating AI agents in customer support, and use it to evaluate existing LLMs. Our results show that while tool-augmented LLMs perform well in single interactions, they often struggle to handle complete conversations. While our focus is on customer support, our method is general and capable of AI agents for different domains.
title	Automated test generation to evaluate tool-augmented LLMs as conversational AI agents
topic	Computation and Language Artificial Intelligence Machine Learning
url	https://arxiv.org/abs/2409.15934

Similar Items