Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Rawal, Ruchit, Chiang, Jeffrey Yang Fan, Shen, Chihao, Tian, Jeffery Siyuan, Mahajan, Aastha, Goldstein, Tom, Chen, Yizheng
Format: Preprint
Veröffentlicht: 2025
Schlagworte:
Online-Zugang:https://arxiv.org/abs/2510.13859
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
_version_ 1866914095989522432
author Rawal, Ruchit
Chiang, Jeffrey Yang Fan
Shen, Chihao
Tian, Jeffery Siyuan
Mahajan, Aastha
Goldstein, Tom
Chen, Yizheng
author_facet Rawal, Ruchit
Chiang, Jeffrey Yang Fan
Shen, Chihao
Tian, Jeffery Siyuan
Mahajan, Aastha
Goldstein, Tom
Chen, Yizheng
contents AI coding assistants powered by large language models (LLMs) have transformed software development, significantly boosting productivity. While existing benchmarks evaluate the correctness and security of LLM-generated code, they are typically limited to single-turn tasks that do not reflect the iterative nature of real-world development. We introduce MT-Sec, the first benchmark to systematically evaluate both correctness and security in multi-turn coding scenarios. We construct this using a synthetic data pipeline that transforms existing single-turn tasks into semantically aligned multi-turn interaction sequences, allowing reuse of original test suites while modeling the complexity of real-world coding processes. We evaluate 32 open- and closed-source models, and three agent-scaffolding on MT-Sec and observe a consistent 20-27% drop in "correct and secure" outputs from single-turn to multi-turn settings -- even among state-of-the-art models. Beyond full-program generation, we also evaluate models on multi-turn code-diff generation -- an unexplored yet practically relevant setting -- and find that models perform worse here, with increased rates of functionally incorrect and insecure outputs. Finally, we find that while agent scaffoldings boost single-turn code generation performance, they are not quite as effective in multi-turn evaluations. Together, these findings highlight the need for benchmarks that jointly evaluate correctness and security in multi-turn, real-world coding workflows.
format Preprint
id arxiv_https___arxiv_org_abs_2510_13859
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle Benchmarking Correctness and Security in Multi-Turn Code Generation
Rawal, Ruchit
Chiang, Jeffrey Yang Fan
Shen, Chihao
Tian, Jeffery Siyuan
Mahajan, Aastha
Goldstein, Tom
Chen, Yizheng
Software Engineering
Artificial Intelligence
AI coding assistants powered by large language models (LLMs) have transformed software development, significantly boosting productivity. While existing benchmarks evaluate the correctness and security of LLM-generated code, they are typically limited to single-turn tasks that do not reflect the iterative nature of real-world development. We introduce MT-Sec, the first benchmark to systematically evaluate both correctness and security in multi-turn coding scenarios. We construct this using a synthetic data pipeline that transforms existing single-turn tasks into semantically aligned multi-turn interaction sequences, allowing reuse of original test suites while modeling the complexity of real-world coding processes. We evaluate 32 open- and closed-source models, and three agent-scaffolding on MT-Sec and observe a consistent 20-27% drop in "correct and secure" outputs from single-turn to multi-turn settings -- even among state-of-the-art models. Beyond full-program generation, we also evaluate models on multi-turn code-diff generation -- an unexplored yet practically relevant setting -- and find that models perform worse here, with increased rates of functionally incorrect and insecure outputs. Finally, we find that while agent scaffoldings boost single-turn code generation performance, they are not quite as effective in multi-turn evaluations. Together, these findings highlight the need for benchmarks that jointly evaluate correctness and security in multi-turn, real-world coding workflows.
title Benchmarking Correctness and Security in Multi-Turn Code Generation
topic Software Engineering
Artificial Intelligence
url https://arxiv.org/abs/2510.13859