Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Muhammad, Rizqullah, Albassam, Emad
Format:	Recurso digital
Language:
Published:	Zenodo 2025
Online Access:	https://doi.org/10.5281/zenodo.17735860
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866901824549683200
author	Muhammad, Rizqullah Albassam, Emad
author_facet	Muhammad, Rizqullah Albassam, Emad
contents	<div> <div>This repository contains the experimental data and analysis for a research study demonstrating how Test-Driven Prompting (TDP) significantly improves AI code generation across multiple programming languages and difficulty levels. All experimental data, statistical analysis, and qualitative comparisons are included.</div> <br> <div>## What is Test-Driven Prompting?</div> <br> <div>Instead of just asking an AI to write code, Test-Driven Prompting includes example test cases in the prompt. This helps the AI understand exactly what the code should do, similar to how human programmers use test-driven development.</div> <br> <div>![alt text](methodology_overview.png)</div> <br> <div>## Key Findings</div> <br> <div>Our study tested 4 different AI models (GPT-4, Claude, Qwen) on Android (Java) and iOS (Swift) and found:</div> <br> <div>- First mobile evaluation: While TDP has mainly been studied for Python, this is the first empirical study of TDP in mobile development (Java, Swift).</div> <div>- Scope & scale: 8,704 evaluations across 544 programming tasks (HumanEval, MBPP), comparing two prompting strategies (base, test-driven) and four LLMs (GPT-4o, GPT-4o-mini, Qwen 14B, Qwen 32B).</div> <div>- Measured effect: Average accuracy increase of +2.22 percentage points (pp) over baseline for TDP (95% CI [1.22–3.23 pp], p < 0.001, d = 0.3974).</div> <div>- Platform differences: LLMs perform worse on mobile languages (66.85%–88.87%) than on Python (86.90%–91.30%), and exhibit reduced responsiveness to TDP in mobile development.</div> <div>- ⚖️ Practical guidance: We provide recommendations for choosing trade-offs (max accuracy vs. budget vs. self-hosted) and offer platform-specific suggestions for applying TDP.</div> <br> <div>TDP is a reliable, low-overhead prompt engineering method for mobile app development that integrates smoothly with existing test-driven workflows; we recommend researchers and practitioners adopt TDP as a standard part of LLM-assisted mobile development.</div> <br> <div>## Authors</div> <br> <div>Muhammad Rizqullah (<mrizqullah@stu.kau.edu.sa>) and Emad Albassam (<ealbassam@kau.edu.sa>)</div> <div>Computer Science Department, King Abdulaziz University, Jeddah, Saudi Arabia</div> <br> <div>Corresponding author: Muhammad Rizqullah. Any enquiries about the research should be directed to him.</div> <br> <div>## Repository Structure</div> <br> <div>- `datasets/` - Programming problems and test cases from HumanEval, MBPP, and Code Contests</div> <div>- `raw_results/` - Complete experimental results for each AI model and dataset combination</div> <div>- `results/` - Statistical analysis and comparison reports</div> </div>
format	Recurso digital
id	zenodo_https___doi_org_10_5281_zenodo_17735860
institution	Zenodo
language
publishDate	2025
publisher	Zenodo
record_format	zenodo
spellingShingle	IJIM Dataset & Experimental Results Muhammad, Rizqullah Albassam, Emad <div> <div>This repository contains the experimental data and analysis for a research study demonstrating how Test-Driven Prompting (TDP) significantly improves AI code generation across multiple programming languages and difficulty levels. All experimental data, statistical analysis, and qualitative comparisons are included.</div> <br> <div>## What is Test-Driven Prompting?</div> <br> <div>Instead of just asking an AI to write code, Test-Driven Prompting includes example test cases in the prompt. This helps the AI understand exactly what the code should do, similar to how human programmers use test-driven development.</div> <br> <div>![alt text](methodology_overview.png)</div> <br> <div>## Key Findings</div> <br> <div>Our study tested 4 different AI models (GPT-4, Claude, Qwen) on Android (Java) and iOS (Swift) and found:</div> <br> <div>- First mobile evaluation: While TDP has mainly been studied for Python, this is the first empirical study of TDP in mobile development (Java, Swift).</div> <div>- Scope & scale: 8,704 evaluations across 544 programming tasks (HumanEval, MBPP), comparing two prompting strategies (base, test-driven) and four LLMs (GPT-4o, GPT-4o-mini, Qwen 14B, Qwen 32B).</div> <div>- Measured effect: Average accuracy increase of +2.22 percentage points (pp) over baseline for TDP (95% CI [1.22–3.23 pp], p < 0.001, d = 0.3974).</div> <div>- Platform differences: LLMs perform worse on mobile languages (66.85%–88.87%) than on Python (86.90%–91.30%), and exhibit reduced responsiveness to TDP in mobile development.</div> <div>- ⚖️ Practical guidance: We provide recommendations for choosing trade-offs (max accuracy vs. budget vs. self-hosted) and offer platform-specific suggestions for applying TDP.</div> <br> <div>TDP is a reliable, low-overhead prompt engineering method for mobile app development that integrates smoothly with existing test-driven workflows; we recommend researchers and practitioners adopt TDP as a standard part of LLM-assisted mobile development.</div> <br> <div>## Authors</div> <br> <div>Muhammad Rizqullah (<mrizqullah@stu.kau.edu.sa>) and Emad Albassam (<ealbassam@kau.edu.sa>)</div> <div>Computer Science Department, King Abdulaziz University, Jeddah, Saudi Arabia</div> <br> <div>Corresponding author: Muhammad Rizqullah. Any enquiries about the research should be directed to him.</div> <br> <div>## Repository Structure</div> <br> <div>- `datasets/` - Programming problems and test cases from HumanEval, MBPP, and Code Contests</div> <div>- `raw_results/` - Complete experimental results for each AI model and dataset combination</div> <div>- `results/` - Statistical analysis and comparison reports</div> </div>
title	IJIM Dataset & Experimental Results
url	https://doi.org/10.5281/zenodo.17735860

Similar Items