Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Golubev, Alexander, Trofimova, Maria, Polezhaev, Sergei, Badertdinov, Ibragim, Nekrashevich, Maksim, Shevtsov, Anton, Karasik, Simon, Abramov, Sergey, Andriushchenko, Andrei, Fisin, Filipp, Skvortsov, Sergei, Yangel, Boris
Format:	Preprint
Published:	2025
Subjects:	Machine Learning Computation and Language Software Engineering
Online Access:	https://arxiv.org/abs/2508.03501
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866911203391963136
author	Golubev, Alexander Trofimova, Maria Polezhaev, Sergei Badertdinov, Ibragim Nekrashevich, Maksim Shevtsov, Anton Karasik, Simon Abramov, Sergey Andriushchenko, Andrei Fisin, Filipp Skvortsov, Sergei Yangel, Boris
author_facet	Golubev, Alexander Trofimova, Maria Polezhaev, Sergei Badertdinov, Ibragim Nekrashevich, Maksim Shevtsov, Anton Karasik, Simon Abramov, Sergey Andriushchenko, Andrei Fisin, Filipp Skvortsov, Sergei Yangel, Boris
contents	Research on applications of reinforcement learning (RL) to large language models has mostly been focused on single-turn problems, such as mathematical reasoning or single-shot code generation. While these problems can be viewed as token-level multi-turn Markov decision processes (MDPs), this view corresponds to a degenerate case of multi-turn interaction where the environment provides no feedback. This contrasts with many real-world domains, such as software engineering (SWE), which require rich multi-turn interactions with a stateful environment that responds to each action with a non-trivial observation. To bridge this gap, we demonstrate the successful application of RL to this general regime. Our methodology begins with rejection fine-tuning (RFT) using execution feedback to train a policy to follow instructions and formatting effectively, followed by a synchronous RL pipeline using DAPO for iterative improvement. Applying this pipeline to Qwen2.5-72B-Instruct, we increase its Pass@1 on the SWE-bench Verified benchmark from 11% to 39%, substantially improving upon the 20% RFT baseline. On the May and June splits of SWE-rebench, the resulting agent achieves Pass@1 of 35% and 31% respectively, competitive with even larger models such as DeepSeek-V3-0324 or Qwen3-235B-A22B, demonstrating that our methodology offers a practical approach for training capable agents for multi-turn interactive tasks using open-weight models.
format	Preprint
id	arxiv_https___arxiv_org_abs_2508_03501
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Training Long-Context, Multi-Turn Software Engineering Agents with Reinforcement Learning Golubev, Alexander Trofimova, Maria Polezhaev, Sergei Badertdinov, Ibragim Nekrashevich, Maksim Shevtsov, Anton Karasik, Simon Abramov, Sergey Andriushchenko, Andrei Fisin, Filipp Skvortsov, Sergei Yangel, Boris Machine Learning Computation and Language Software Engineering Research on applications of reinforcement learning (RL) to large language models has mostly been focused on single-turn problems, such as mathematical reasoning or single-shot code generation. While these problems can be viewed as token-level multi-turn Markov decision processes (MDPs), this view corresponds to a degenerate case of multi-turn interaction where the environment provides no feedback. This contrasts with many real-world domains, such as software engineering (SWE), which require rich multi-turn interactions with a stateful environment that responds to each action with a non-trivial observation. To bridge this gap, we demonstrate the successful application of RL to this general regime. Our methodology begins with rejection fine-tuning (RFT) using execution feedback to train a policy to follow instructions and formatting effectively, followed by a synchronous RL pipeline using DAPO for iterative improvement. Applying this pipeline to Qwen2.5-72B-Instruct, we increase its Pass@1 on the SWE-bench Verified benchmark from 11% to 39%, substantially improving upon the 20% RFT baseline. On the May and June splits of SWE-rebench, the resulting agent achieves Pass@1 of 35% and 31% respectively, competitive with even larger models such as DeepSeek-V3-0324 or Qwen3-235B-A22B, demonstrating that our methodology offers a practical approach for training capable agents for multi-turn interactive tasks using open-weight models.
title	Training Long-Context, Multi-Turn Software Engineering Agents with Reinforcement Learning
topic	Machine Learning Computation and Language Software Engineering
url	https://arxiv.org/abs/2508.03501

Similar Items