Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Sijwali, Suryansh Singh, Saha, Suman
Format:	Preprint
Published:	2026
Subjects:	Cryptography and Security
Online Access:	https://arxiv.org/abs/2601.01184
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866918271102484480
author	Sijwali, Suryansh Singh Saha, Suman
author_facet	Sijwali, Suryansh Singh Saha, Suman
contents	Large Language Models (LLMs) can generate plausible code, but in settings that require exact stdin/stdout behavior they frequently produce programs that compile yet fail tests, and in some cases they introduce security-sensitive patterns. This paper presents SecureCodeRL, a reinforcement learning (RL) pipeline for security-aware code generation that optimizes a combined reward R = αRfunc + \b{eta}Rsec. The key idea is a partial-credit functional reward that assigns intermediate scores for syntactic validity, successful execution, and producing output, reducing reward sparsity that otherwise stalls learning on competitive programming style tasks. I evaluate supervised fine-tuning (SFT) and PPO variants on a small held-out prompt set from APPS+ and observe that PPO with partial credit (using a continued-training variant) improves syntax validity from 45% (SFT) to 60% and achieves the only non-zero test success signal in this pilot evaluation (5% at-least-one-test-pass), while remaining 100% clean under Bandit static analysis. Although Bandit findings were absent in this small evaluation, the security term is integrated into training to discourage insecure shortcuts when they appear.
format	Preprint
id	arxiv_https___arxiv_org_abs_2601_01184
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	SecureCodeRL: Security-Aware Reinforcement Learning for Code Generation with Partial-Credit Rewards Sijwali, Suryansh Singh Saha, Suman Cryptography and Security Large Language Models (LLMs) can generate plausible code, but in settings that require exact stdin/stdout behavior they frequently produce programs that compile yet fail tests, and in some cases they introduce security-sensitive patterns. This paper presents SecureCodeRL, a reinforcement learning (RL) pipeline for security-aware code generation that optimizes a combined reward R = αRfunc + \b{eta}Rsec. The key idea is a partial-credit functional reward that assigns intermediate scores for syntactic validity, successful execution, and producing output, reducing reward sparsity that otherwise stalls learning on competitive programming style tasks. I evaluate supervised fine-tuning (SFT) and PPO variants on a small held-out prompt set from APPS+ and observe that PPO with partial credit (using a continued-training variant) improves syntax validity from 45% (SFT) to 60% and achieves the only non-zero test success signal in this pilot evaluation (5% at-least-one-test-pass), while remaining 100% clean under Bandit static analysis. Although Bandit findings were absent in this small evaluation, the security term is integrated into training to discourage insecure shortcuts when they appear.
title	SecureCodeRL: Security-Aware Reinforcement Learning for Code Generation with Partial-Credit Rewards
topic	Cryptography and Security
url	https://arxiv.org/abs/2601.01184

Similar Items