Internformat: :: Library Catalog

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Rawal, Ruchit, Chiang, Jeffrey Yang Fan, Shen, Chihao, Tian, Jeffery Siyuan, Mahajan, Aastha, Goldstein, Tom, Chen, Yizheng
Format:	Preprint
Veröffentlicht:	2025
Schlagworte:	Software Engineering Artificial Intelligence
Online-Zugang:	https://arxiv.org/abs/2510.13859
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

_version_	1866914095989522432
author	Rawal, Ruchit Chiang, Jeffrey Yang Fan Shen, Chihao Tian, Jeffery Siyuan Mahajan, Aastha Goldstein, Tom Chen, Yizheng
author_facet	Rawal, Ruchit Chiang, Jeffrey Yang Fan Shen, Chihao Tian, Jeffery Siyuan Mahajan, Aastha Goldstein, Tom Chen, Yizheng
contents	AI coding assistants powered by large language models (LLMs) have transformed software development, significantly boosting productivity. While existing benchmarks evaluate the correctness and security of LLM-generated code, they are typically limited to single-turn tasks that do not reflect the iterative nature of real-world development. We introduce MT-Sec, the first benchmark to systematically evaluate both correctness and security in multi-turn coding scenarios. We construct this using a synthetic data pipeline that transforms existing single-turn tasks into semantically aligned multi-turn interaction sequences, allowing reuse of original test suites while modeling the complexity of real-world coding processes. We evaluate 32 open- and closed-source models, and three agent-scaffolding on MT-Sec and observe a consistent 20-27% drop in "correct and secure" outputs from single-turn to multi-turn settings -- even among state-of-the-art models. Beyond full-program generation, we also evaluate models on multi-turn code-diff generation -- an unexplored yet practically relevant setting -- and find that models perform worse here, with increased rates of functionally incorrect and insecure outputs. Finally, we find that while agent scaffoldings boost single-turn code generation performance, they are not quite as effective in multi-turn evaluations. Together, these findings highlight the need for benchmarks that jointly evaluate correctness and security in multi-turn, real-world coding workflows.
format	Preprint
id	arxiv_https___arxiv_org_abs_2510_13859
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Benchmarking Correctness and Security in Multi-Turn Code Generation Rawal, Ruchit Chiang, Jeffrey Yang Fan Shen, Chihao Tian, Jeffery Siyuan Mahajan, Aastha Goldstein, Tom Chen, Yizheng Software Engineering Artificial Intelligence AI coding assistants powered by large language models (LLMs) have transformed software development, significantly boosting productivity. While existing benchmarks evaluate the correctness and security of LLM-generated code, they are typically limited to single-turn tasks that do not reflect the iterative nature of real-world development. We introduce MT-Sec, the first benchmark to systematically evaluate both correctness and security in multi-turn coding scenarios. We construct this using a synthetic data pipeline that transforms existing single-turn tasks into semantically aligned multi-turn interaction sequences, allowing reuse of original test suites while modeling the complexity of real-world coding processes. We evaluate 32 open- and closed-source models, and three agent-scaffolding on MT-Sec and observe a consistent 20-27% drop in "correct and secure" outputs from single-turn to multi-turn settings -- even among state-of-the-art models. Beyond full-program generation, we also evaluate models on multi-turn code-diff generation -- an unexplored yet practically relevant setting -- and find that models perform worse here, with increased rates of functionally incorrect and insecure outputs. Finally, we find that while agent scaffoldings boost single-turn code generation performance, they are not quite as effective in multi-turn evaluations. Together, these findings highlight the need for benchmarks that jointly evaluate correctness and security in multi-turn, real-world coding workflows.
title	Benchmarking Correctness and Security in Multi-Turn Code Generation
topic	Software Engineering Artificial Intelligence
url	https://arxiv.org/abs/2510.13859

Ähnliche Einträge