Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Salva, Sébastien, Taguelmimt, Redha
Format:	Preprint
Published:	2025
Subjects:	Software Engineering Artificial Intelligence D.2.4; D.2.5; F.3.1
Online Access:	https://arxiv.org/abs/2509.19136
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866918152122662912
author	Salva, Sébastien Taguelmimt, Redha
author_facet	Salva, Sébastien Taguelmimt, Redha
contents	The use of natural language (NL) test cases for validating graphical user interface (GUI) applications is emerging as a promising direction to manually written executable test scripts, which are costly to develop and difficult to maintain. Recent advances in large language models (LLMs) have opened the possibility of the direct execution of NL test cases by LLM agents. This paper investigates this direction, focusing on the impact on NL test case unsoundness and on test case execution consistency. NL test cases are inherently unsound, as they may yield false failures due to ambiguous instructions or unpredictable agent behaviour. Furthermore, repeated executions of the same NL test case may lead to inconsistent outcomes, undermining test reliability. To address these challenges, we propose an algorithm for executing NL test cases with guardrail mechanisms and specialised agents that dynamically verify the correct execution of each test step. We introduce measures to evaluate the capabilities of LLMs in test execution and one measure to quantify execution consistency. We propose a definition of weak unsoundness to characterise contexts in which NL test case execution remains acceptable, with respect to the industrial quality levels Six Sigma. Our experimental evaluation with eight publicly available LLMs, ranging from 3B to 70B parameters, demonstrates both the potential and current limitations of current LLM agents for GUI testing. Our experiments show that Meta Llama 3.1 70B demonstrates acceptable capabilities in NL test case execution with high execution consistency (above the level 3-sigma). We provide prototype tools, test suites, and results.
format	Preprint
id	arxiv_https___arxiv_org_abs_2509_19136
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	On the Soundness and Consistency of LLM Agents for Executing Test Cases Written in Natural Language Salva, Sébastien Taguelmimt, Redha Software Engineering Artificial Intelligence D.2.4; D.2.5; F.3.1 The use of natural language (NL) test cases for validating graphical user interface (GUI) applications is emerging as a promising direction to manually written executable test scripts, which are costly to develop and difficult to maintain. Recent advances in large language models (LLMs) have opened the possibility of the direct execution of NL test cases by LLM agents. This paper investigates this direction, focusing on the impact on NL test case unsoundness and on test case execution consistency. NL test cases are inherently unsound, as they may yield false failures due to ambiguous instructions or unpredictable agent behaviour. Furthermore, repeated executions of the same NL test case may lead to inconsistent outcomes, undermining test reliability. To address these challenges, we propose an algorithm for executing NL test cases with guardrail mechanisms and specialised agents that dynamically verify the correct execution of each test step. We introduce measures to evaluate the capabilities of LLMs in test execution and one measure to quantify execution consistency. We propose a definition of weak unsoundness to characterise contexts in which NL test case execution remains acceptable, with respect to the industrial quality levels Six Sigma. Our experimental evaluation with eight publicly available LLMs, ranging from 3B to 70B parameters, demonstrates both the potential and current limitations of current LLM agents for GUI testing. Our experiments show that Meta Llama 3.1 70B demonstrates acceptable capabilities in NL test case execution with high execution consistency (above the level 3-sigma). We provide prototype tools, test suites, and results.
title	On the Soundness and Consistency of LLM Agents for Executing Test Cases Written in Natural Language
topic	Software Engineering Artificial Intelligence D.2.4; D.2.5; F.3.1
url	https://arxiv.org/abs/2509.19136

Similar Items