Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Shetty, Pranam Prakash, Balakrishnan, Adarsh, Xu, Mengqiao, Xi, Xiaoyin, Yu, Zhe
Format:	Preprint
Published:	2026
Subjects:	Software Engineering 68-04
Online Access:	https://arxiv.org/abs/2603.06276
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866915923118522368
author	Shetty, Pranam Prakash Balakrishnan, Adarsh Xu, Mengqiao Xi, Xiaoyin Yu, Zhe
author_facet	Shetty, Pranam Prakash Balakrishnan, Adarsh Xu, Mengqiao Xi, Xiaoyin Yu, Zhe
contents	This study investigates the use of large language models (LLMs) for story point estimation. Story points are unitless, project-specific effort estimates that help developers on the scrum team forecast which product backlog items they plan to complete in a sprint. To facilitate this process, machine learning models, especially deep neural networks, have been applied to predict the story points based on the title and description of each item. However, such machine learning models require sufficient amounts of training data (with ground truth story points annotated by human developers) from the same software project to achieve decent prediction performance. This motivated us to explore whether LLMs are capable of (RQ1) predicting story points without training data or (RQ2) with only a few training data points. Our empirical results with four LLMs on 16 software projects show that, without any training data (zero-shot prompting), LLMs can predict story points better than supervised deep learning models trained on 80% of the data. The prediction performance of LLMs can be further improved with a few training examples (few-shot prompting). In addition, a recent study explored the use of comparative judgments (between a given pair of items which one requires more effort to implement) instead of directly annotating the story points to reduce the cognitive burden on developers. Therefore, this study also explores (RQ3) whether comparative judgments are easier to predict than story points for LLMs and (RQ4) whether comparative judgments can serve as few-shot examples for LLMs to improve their predictions of story points. Empirical results suggest that it is not easier for LLMs to predict comparative judgments than to directly estimate the story points, but comparative judgments can serve as few-shot examples to improve the LLMs' prediction performance as well as the human-annotated story points.
format	Preprint
id	arxiv_https___arxiv_org_abs_2603_06276
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Story Point Estimation Using Large Language Models Shetty, Pranam Prakash Balakrishnan, Adarsh Xu, Mengqiao Xi, Xiaoyin Yu, Zhe Software Engineering 68-04 This study investigates the use of large language models (LLMs) for story point estimation. Story points are unitless, project-specific effort estimates that help developers on the scrum team forecast which product backlog items they plan to complete in a sprint. To facilitate this process, machine learning models, especially deep neural networks, have been applied to predict the story points based on the title and description of each item. However, such machine learning models require sufficient amounts of training data (with ground truth story points annotated by human developers) from the same software project to achieve decent prediction performance. This motivated us to explore whether LLMs are capable of (RQ1) predicting story points without training data or (RQ2) with only a few training data points. Our empirical results with four LLMs on 16 software projects show that, without any training data (zero-shot prompting), LLMs can predict story points better than supervised deep learning models trained on 80% of the data. The prediction performance of LLMs can be further improved with a few training examples (few-shot prompting). In addition, a recent study explored the use of comparative judgments (between a given pair of items which one requires more effort to implement) instead of directly annotating the story points to reduce the cognitive burden on developers. Therefore, this study also explores (RQ3) whether comparative judgments are easier to predict than story points for LLMs and (RQ4) whether comparative judgments can serve as few-shot examples for LLMs to improve their predictions of story points. Empirical results suggest that it is not easier for LLMs to predict comparative judgments than to directly estimate the story points, but comparative judgments can serve as few-shot examples to improve the LLMs' prediction performance as well as the human-annotated story points.
title	Story Point Estimation Using Large Language Models
topic	Software Engineering 68-04
url	https://arxiv.org/abs/2603.06276

Similar Items