Vista Equipo: :: Library Catalog

Guardado en:

Detalles Bibliográficos
Autores principales:	Rathnayake, Amila, Shahin, Mojtaba, Abaei, Golnoush
Formato:	Preprint
Publicado:	2026
Materias:	Software Engineering
Acceso en línea:	https://arxiv.org/abs/2603.04729
Etiquetas:	Agregar Etiqueta Sin Etiquetas, Sea el primero en etiquetar este registro!

_version_	1866915835666235392
author	Rathnayake, Amila Shahin, Mojtaba Abaei, Golnoush
author_facet	Rathnayake, Amila Shahin, Mojtaba Abaei, Golnoush
contents	This paper presents an evaluation of three LLMs, GPT-4, Claude 3, and Gemini, for automated Behaviour-Driven Development (BDD) scenarios generation. To support this evaluation, we constructed a dataset of 500 user stories, requirement descriptions, and their corresponding BDD scenarios, drawn from four proprietary software products. We assessed the quality of BDD scenarios generated by LLMs using a multidimensional evaluation framework encompassing text and semantic similarity metrics, LLM-based evaluation, and human expert assessment. Our findings reveal that although GPT-4 achieves higher scores in text and semantic similarity metrics, Claude 3 produces scenarios rated highest by both human experts and LLM-based evaluators. LLM-based evaluators, particularly DeepSeek, show a stronger correlation with human judgment than with text similarity and semantic similarity metrics. The effectiveness of prompting techniques is model-specific: GPT-4 performs best with zero-shot, Claude 3 benefits from chain-of-thought reasoning, and Gemini achieves optimal results with few-shot examples. Input quality determines the effectiveness of BDD scenario generation: detailed requirement descriptions alone yield high-quality scenarios, whereas user stories alone yield low-quality scenarios. Our experiments indicate that setting temperature to 0 and top_p to 1.0 produced the highest-quality BDD scenarios across all models.
format	Preprint
id	arxiv_https___arxiv_org_abs_2603_04729
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Behaviour Driven Development Scenario Generation with Large Language Models Rathnayake, Amila Shahin, Mojtaba Abaei, Golnoush Software Engineering This paper presents an evaluation of three LLMs, GPT-4, Claude 3, and Gemini, for automated Behaviour-Driven Development (BDD) scenarios generation. To support this evaluation, we constructed a dataset of 500 user stories, requirement descriptions, and their corresponding BDD scenarios, drawn from four proprietary software products. We assessed the quality of BDD scenarios generated by LLMs using a multidimensional evaluation framework encompassing text and semantic similarity metrics, LLM-based evaluation, and human expert assessment. Our findings reveal that although GPT-4 achieves higher scores in text and semantic similarity metrics, Claude 3 produces scenarios rated highest by both human experts and LLM-based evaluators. LLM-based evaluators, particularly DeepSeek, show a stronger correlation with human judgment than with text similarity and semantic similarity metrics. The effectiveness of prompting techniques is model-specific: GPT-4 performs best with zero-shot, Claude 3 benefits from chain-of-thought reasoning, and Gemini achieves optimal results with few-shot examples. Input quality determines the effectiveness of BDD scenario generation: detailed requirement descriptions alone yield high-quality scenarios, whereas user stories alone yield low-quality scenarios. Our experiments indicate that setting temperature to 0 and top_p to 1.0 produced the highest-quality BDD scenarios across all models.
title	Behaviour Driven Development Scenario Generation with Large Language Models
topic	Software Engineering
url	https://arxiv.org/abs/2603.04729

Ejemplares similares