Saved in:
| Main Authors: | , |
|---|---|
| Format: | Recurso digital |
| Language: | |
| Published: |
Zenodo
2025
|
| Online Access: | https://doi.org/10.5281/zenodo.17735860 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866901824549683200 |
|---|---|
| author | Muhammad, Rizqullah Albassam, Emad |
| author_facet | Muhammad, Rizqullah Albassam, Emad |
| contents | <div> <div>This repository contains the experimental data and analysis for a research study demonstrating how **Test-Driven Prompting (TDP)** significantly improves AI code generation across multiple programming languages and difficulty levels. All experimental data, statistical analysis, and qualitative comparisons are included.</div> <br> <div>## What is Test-Driven Prompting?</div> <br> <div>Instead of just asking an AI to write code, Test-Driven Prompting includes example test cases in the prompt. This helps the AI understand exactly what the code should do, similar to how human programmers use test-driven development.</div> <br> <div></div> <br> <div>## Key Findings</div> <br> <div>Our study tested 4 different AI models (GPT-4, Claude, Qwen) on Android (Java) and iOS (Swift) and found:</div> <br> <div>- First mobile evaluation: While TDP has mainly been studied for Python, this is the first empirical study of TDP in mobile development (Java, Swift).</div> <div>- Scope & scale: 8,704 evaluations across 544 programming tasks (HumanEval, MBPP), comparing two prompting strategies (base, test-driven) and four LLMs (GPT-4o, GPT-4o-mini, Qwen 14B, Qwen 32B).</div> <div>- Measured effect: Average accuracy increase of +2.22 percentage points (pp) over baseline for TDP (95% CI [1.22–3.23 pp], p < 0.001, d = 0.3974).</div> <div>- Platform differences: LLMs perform worse on mobile languages (66.85%–88.87%) than on Python (86.90%–91.30%), and exhibit reduced responsiveness to TDP in mobile development.</div> <div>- ⚖️ Practical guidance: We provide recommendations for choosing trade-offs (max accuracy vs. budget vs. self-hosted) and offer platform-specific suggestions for applying TDP.</div> <br> <div>TDP is a reliable, low-overhead prompt engineering method for mobile app development that integrates smoothly with existing test-driven workflows; we recommend researchers and practitioners adopt TDP as a standard part of LLM-assisted mobile development.</div> <br> <div>## Authors</div> <br> <div>**Muhammad Rizqullah** (<mrizqullah@stu.kau.edu.sa>) and **Emad Albassam** (<ealbassam@kau.edu.sa>)</div> <div>Computer Science Department, King Abdulaziz University, Jeddah, Saudi Arabia</div> <br> <div>*Corresponding author: Muhammad Rizqullah. Any enquiries about the research should be directed to him.*</div> <br> <div>## Repository Structure</div> <br> <div>- `datasets/` - Programming problems and test cases from HumanEval, MBPP, and Code Contests</div> <div>- `raw_results/` - Complete experimental results for each AI model and dataset combination</div> <div>- `results/` - Statistical analysis and comparison reports</div> </div> |
| format | Recurso digital |
| id | zenodo_https___doi_org_10_5281_zenodo_17735860 |
| institution | Zenodo |
| language | |
| publishDate | 2025 |
| publisher | Zenodo |
| record_format | zenodo |
| spellingShingle | IJIM Dataset & Experimental Results Muhammad, Rizqullah Albassam, Emad <div> <div>This repository contains the experimental data and analysis for a research study demonstrating how **Test-Driven Prompting (TDP)** significantly improves AI code generation across multiple programming languages and difficulty levels. All experimental data, statistical analysis, and qualitative comparisons are included.</div> <br> <div>## What is Test-Driven Prompting?</div> <br> <div>Instead of just asking an AI to write code, Test-Driven Prompting includes example test cases in the prompt. This helps the AI understand exactly what the code should do, similar to how human programmers use test-driven development.</div> <br> <div></div> <br> <div>## Key Findings</div> <br> <div>Our study tested 4 different AI models (GPT-4, Claude, Qwen) on Android (Java) and iOS (Swift) and found:</div> <br> <div>- First mobile evaluation: While TDP has mainly been studied for Python, this is the first empirical study of TDP in mobile development (Java, Swift).</div> <div>- Scope & scale: 8,704 evaluations across 544 programming tasks (HumanEval, MBPP), comparing two prompting strategies (base, test-driven) and four LLMs (GPT-4o, GPT-4o-mini, Qwen 14B, Qwen 32B).</div> <div>- Measured effect: Average accuracy increase of +2.22 percentage points (pp) over baseline for TDP (95% CI [1.22–3.23 pp], p < 0.001, d = 0.3974).</div> <div>- Platform differences: LLMs perform worse on mobile languages (66.85%–88.87%) than on Python (86.90%–91.30%), and exhibit reduced responsiveness to TDP in mobile development.</div> <div>- ⚖️ Practical guidance: We provide recommendations for choosing trade-offs (max accuracy vs. budget vs. self-hosted) and offer platform-specific suggestions for applying TDP.</div> <br> <div>TDP is a reliable, low-overhead prompt engineering method for mobile app development that integrates smoothly with existing test-driven workflows; we recommend researchers and practitioners adopt TDP as a standard part of LLM-assisted mobile development.</div> <br> <div>## Authors</div> <br> <div>**Muhammad Rizqullah** (<mrizqullah@stu.kau.edu.sa>) and **Emad Albassam** (<ealbassam@kau.edu.sa>)</div> <div>Computer Science Department, King Abdulaziz University, Jeddah, Saudi Arabia</div> <br> <div>*Corresponding author: Muhammad Rizqullah. Any enquiries about the research should be directed to him.*</div> <br> <div>## Repository Structure</div> <br> <div>- `datasets/` - Programming problems and test cases from HumanEval, MBPP, and Code Contests</div> <div>- `raw_results/` - Complete experimental results for each AI model and dataset combination</div> <div>- `results/` - Statistical analysis and comparison reports</div> </div> |
| title | IJIM Dataset & Experimental Results |
| url | https://doi.org/10.5281/zenodo.17735860 |