Saved in:
Bibliographic Details
Main Authors: Shetty, Pranam Prakash, Balakrishnan, Adarsh, Xu, Mengqiao, Xi, Xiaoyin, Yu, Zhe
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2603.06276
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866915923118522368
author Shetty, Pranam Prakash
Balakrishnan, Adarsh
Xu, Mengqiao
Xi, Xiaoyin
Yu, Zhe
author_facet Shetty, Pranam Prakash
Balakrishnan, Adarsh
Xu, Mengqiao
Xi, Xiaoyin
Yu, Zhe
contents This study investigates the use of large language models (LLMs) for story point estimation. Story points are unitless, project-specific effort estimates that help developers on the scrum team forecast which product backlog items they plan to complete in a sprint. To facilitate this process, machine learning models, especially deep neural networks, have been applied to predict the story points based on the title and description of each item. However, such machine learning models require sufficient amounts of training data (with ground truth story points annotated by human developers) from the same software project to achieve decent prediction performance. This motivated us to explore whether LLMs are capable of (RQ1) predicting story points without training data or (RQ2) with only a few training data points. Our empirical results with four LLMs on 16 software projects show that, without any training data (zero-shot prompting), LLMs can predict story points better than supervised deep learning models trained on 80% of the data. The prediction performance of LLMs can be further improved with a few training examples (few-shot prompting). In addition, a recent study explored the use of comparative judgments (between a given pair of items which one requires more effort to implement) instead of directly annotating the story points to reduce the cognitive burden on developers. Therefore, this study also explores (RQ3) whether comparative judgments are easier to predict than story points for LLMs and (RQ4) whether comparative judgments can serve as few-shot examples for LLMs to improve their predictions of story points. Empirical results suggest that it is not easier for LLMs to predict comparative judgments than to directly estimate the story points, but comparative judgments can serve as few-shot examples to improve the LLMs' prediction performance as well as the human-annotated story points.
format Preprint
id arxiv_https___arxiv_org_abs_2603_06276
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle Story Point Estimation Using Large Language Models
Shetty, Pranam Prakash
Balakrishnan, Adarsh
Xu, Mengqiao
Xi, Xiaoyin
Yu, Zhe
Software Engineering
68-04
This study investigates the use of large language models (LLMs) for story point estimation. Story points are unitless, project-specific effort estimates that help developers on the scrum team forecast which product backlog items they plan to complete in a sprint. To facilitate this process, machine learning models, especially deep neural networks, have been applied to predict the story points based on the title and description of each item. However, such machine learning models require sufficient amounts of training data (with ground truth story points annotated by human developers) from the same software project to achieve decent prediction performance. This motivated us to explore whether LLMs are capable of (RQ1) predicting story points without training data or (RQ2) with only a few training data points. Our empirical results with four LLMs on 16 software projects show that, without any training data (zero-shot prompting), LLMs can predict story points better than supervised deep learning models trained on 80% of the data. The prediction performance of LLMs can be further improved with a few training examples (few-shot prompting). In addition, a recent study explored the use of comparative judgments (between a given pair of items which one requires more effort to implement) instead of directly annotating the story points to reduce the cognitive burden on developers. Therefore, this study also explores (RQ3) whether comparative judgments are easier to predict than story points for LLMs and (RQ4) whether comparative judgments can serve as few-shot examples for LLMs to improve their predictions of story points. Empirical results suggest that it is not easier for LLMs to predict comparative judgments than to directly estimate the story points, but comparative judgments can serve as few-shot examples to improve the LLMs' prediction performance as well as the human-annotated story points.
title Story Point Estimation Using Large Language Models
topic Software Engineering
68-04
url https://arxiv.org/abs/2603.06276