Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Xiao, YingJian, Hu, RongQun, Gong, WeiWei, Li, HongWei, Jie, AnQuan
Format:	Preprint
Published:	2025
Subjects:	Software Engineering
Online Access:	https://arxiv.org/abs/2510.20521
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866912665951010816
author	Xiao, YingJian Hu, RongQun Gong, WeiWei Li, HongWei Jie, AnQuan
author_facet	Xiao, YingJian Hu, RongQun Gong, WeiWei Li, HongWei Jie, AnQuan
contents	Large language models (LLMs) have demonstrated remarkable capabilities in code-related tasks, particularly in automated program repair. However, the effectiveness of such repairs is highly dependent on the performance of upstream fault localization, for which comprehensive evaluations are currently lacking. This paper presents a systematic empirical study on LLMs in the statement-level code fault localization task. We evaluate representative open-source models (Qwen2.5-coder-32b-instruct, DeepSeek-V3) and closed-source models (GPT-4.1 mini, Gemini-2.5-flash) to assess their fault localization capabilities on the HumanEval-Java and Defects4J datasets. The study investigates the impact of different prompting strategies--including standard prompts, few-shot examples, and chain-of-reasoning--on model performance, with a focus on analysis across accuracy, time efficiency, and economic cost dimensions. Our experimental results show that incorporating bug report context significantly enhances model performance. Few-shot learning shows potential for improvement but exhibits noticeable diminishing marginal returns, while chain-of-thought reasoning's effectiveness is highly contingent on the model's inherent reasoning capabilities. This study not only highlights the performance characteristics and trade-offs of different models in fault localization tasks, but also offers valuable insights into the strengths of current LLMs and strategies for improving fault localization effectiveness.
format	Preprint
id	arxiv_https___arxiv_org_abs_2510_20521
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Large Language Models for Fault Localization: An Empirical Study Xiao, YingJian Hu, RongQun Gong, WeiWei Li, HongWei Jie, AnQuan Software Engineering Large language models (LLMs) have demonstrated remarkable capabilities in code-related tasks, particularly in automated program repair. However, the effectiveness of such repairs is highly dependent on the performance of upstream fault localization, for which comprehensive evaluations are currently lacking. This paper presents a systematic empirical study on LLMs in the statement-level code fault localization task. We evaluate representative open-source models (Qwen2.5-coder-32b-instruct, DeepSeek-V3) and closed-source models (GPT-4.1 mini, Gemini-2.5-flash) to assess their fault localization capabilities on the HumanEval-Java and Defects4J datasets. The study investigates the impact of different prompting strategies--including standard prompts, few-shot examples, and chain-of-reasoning--on model performance, with a focus on analysis across accuracy, time efficiency, and economic cost dimensions. Our experimental results show that incorporating bug report context significantly enhances model performance. Few-shot learning shows potential for improvement but exhibits noticeable diminishing marginal returns, while chain-of-thought reasoning's effectiveness is highly contingent on the model's inherent reasoning capabilities. This study not only highlights the performance characteristics and trade-offs of different models in fault localization tasks, but also offers valuable insights into the strengths of current LLMs and strategies for improving fault localization effectiveness.
title	Large Language Models for Fault Localization: An Empirical Study
topic	Software Engineering
url	https://arxiv.org/abs/2510.20521

Similar Items