Saved in:
Bibliographic Details
Main Authors: Golubev, Alexander, Trofimova, Maria, Polezhaev, Sergei, Badertdinov, Ibragim, Nekrashevich, Maksim, Shevtsov, Anton, Karasik, Simon, Abramov, Sergey, Andriushchenko, Andrei, Fisin, Filipp, Skvortsov, Sergei, Yangel, Boris
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2508.03501
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866911203391963136
author Golubev, Alexander
Trofimova, Maria
Polezhaev, Sergei
Badertdinov, Ibragim
Nekrashevich, Maksim
Shevtsov, Anton
Karasik, Simon
Abramov, Sergey
Andriushchenko, Andrei
Fisin, Filipp
Skvortsov, Sergei
Yangel, Boris
author_facet Golubev, Alexander
Trofimova, Maria
Polezhaev, Sergei
Badertdinov, Ibragim
Nekrashevich, Maksim
Shevtsov, Anton
Karasik, Simon
Abramov, Sergey
Andriushchenko, Andrei
Fisin, Filipp
Skvortsov, Sergei
Yangel, Boris
contents Research on applications of reinforcement learning (RL) to large language models has mostly been focused on single-turn problems, such as mathematical reasoning or single-shot code generation. While these problems can be viewed as token-level multi-turn Markov decision processes (MDPs), this view corresponds to a degenerate case of multi-turn interaction where the environment provides no feedback. This contrasts with many real-world domains, such as software engineering (SWE), which require rich multi-turn interactions with a stateful environment that responds to each action with a non-trivial observation. To bridge this gap, we demonstrate the successful application of RL to this general regime. Our methodology begins with rejection fine-tuning (RFT) using execution feedback to train a policy to follow instructions and formatting effectively, followed by a synchronous RL pipeline using DAPO for iterative improvement. Applying this pipeline to Qwen2.5-72B-Instruct, we increase its Pass@1 on the SWE-bench Verified benchmark from 11% to 39%, substantially improving upon the 20% RFT baseline. On the May and June splits of SWE-rebench, the resulting agent achieves Pass@1 of 35% and 31% respectively, competitive with even larger models such as DeepSeek-V3-0324 or Qwen3-235B-A22B, demonstrating that our methodology offers a practical approach for training capable agents for multi-turn interactive tasks using open-weight models.
format Preprint
id arxiv_https___arxiv_org_abs_2508_03501
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle Training Long-Context, Multi-Turn Software Engineering Agents with Reinforcement Learning
Golubev, Alexander
Trofimova, Maria
Polezhaev, Sergei
Badertdinov, Ibragim
Nekrashevich, Maksim
Shevtsov, Anton
Karasik, Simon
Abramov, Sergey
Andriushchenko, Andrei
Fisin, Filipp
Skvortsov, Sergei
Yangel, Boris
Machine Learning
Computation and Language
Software Engineering
Research on applications of reinforcement learning (RL) to large language models has mostly been focused on single-turn problems, such as mathematical reasoning or single-shot code generation. While these problems can be viewed as token-level multi-turn Markov decision processes (MDPs), this view corresponds to a degenerate case of multi-turn interaction where the environment provides no feedback. This contrasts with many real-world domains, such as software engineering (SWE), which require rich multi-turn interactions with a stateful environment that responds to each action with a non-trivial observation. To bridge this gap, we demonstrate the successful application of RL to this general regime. Our methodology begins with rejection fine-tuning (RFT) using execution feedback to train a policy to follow instructions and formatting effectively, followed by a synchronous RL pipeline using DAPO for iterative improvement. Applying this pipeline to Qwen2.5-72B-Instruct, we increase its Pass@1 on the SWE-bench Verified benchmark from 11% to 39%, substantially improving upon the 20% RFT baseline. On the May and June splits of SWE-rebench, the resulting agent achieves Pass@1 of 35% and 31% respectively, competitive with even larger models such as DeepSeek-V3-0324 or Qwen3-235B-A22B, demonstrating that our methodology offers a practical approach for training capable agents for multi-turn interactive tasks using open-weight models.
title Training Long-Context, Multi-Turn Software Engineering Agents with Reinforcement Learning
topic Machine Learning
Computation and Language
Software Engineering
url https://arxiv.org/abs/2508.03501