Saved in:
| Main Authors: | , |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2601.01184 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866918271102484480 |
|---|---|
| author | Sijwali, Suryansh Singh Saha, Suman |
| author_facet | Sijwali, Suryansh Singh Saha, Suman |
| contents | Large Language Models (LLMs) can generate plausible code, but in settings that require exact stdin/stdout behavior they frequently produce programs that compile yet fail tests, and in some cases they introduce security-sensitive patterns. This paper presents SecureCodeRL, a reinforcement learning (RL) pipeline for security-aware code generation that optimizes a combined reward R = αRfunc + \b{eta}Rsec. The key idea is a partial-credit functional reward that assigns intermediate scores for syntactic validity, successful execution, and producing output, reducing reward sparsity that otherwise stalls learning on competitive programming style tasks. I evaluate supervised fine-tuning (SFT) and PPO variants on a small held-out prompt set from APPS+ and observe that PPO with partial credit (using a continued-training variant) improves syntax validity from 45% (SFT) to 60% and achieves the only non-zero test success signal in this pilot evaluation (5% at-least-one-test-pass), while remaining 100% clean under Bandit static analysis. Although Bandit findings were absent in this small evaluation, the security term is integrated into training to discourage insecure shortcuts when they appear. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2601_01184 |
| institution | arXiv |
| publishDate | 2026 |
| record_format | arxiv |
| spellingShingle | SecureCodeRL: Security-Aware Reinforcement Learning for Code Generation with Partial-Credit Rewards Sijwali, Suryansh Singh Saha, Suman Cryptography and Security Large Language Models (LLMs) can generate plausible code, but in settings that require exact stdin/stdout behavior they frequently produce programs that compile yet fail tests, and in some cases they introduce security-sensitive patterns. This paper presents SecureCodeRL, a reinforcement learning (RL) pipeline for security-aware code generation that optimizes a combined reward R = αRfunc + \b{eta}Rsec. The key idea is a partial-credit functional reward that assigns intermediate scores for syntactic validity, successful execution, and producing output, reducing reward sparsity that otherwise stalls learning on competitive programming style tasks. I evaluate supervised fine-tuning (SFT) and PPO variants on a small held-out prompt set from APPS+ and observe that PPO with partial credit (using a continued-training variant) improves syntax validity from 45% (SFT) to 60% and achieves the only non-zero test success signal in this pilot evaluation (5% at-least-one-test-pass), while remaining 100% clean under Bandit static analysis. Although Bandit findings were absent in this small evaluation, the security term is integrated into training to discourage insecure shortcuts when they appear. |
| title | SecureCodeRL: Security-Aware Reinforcement Learning for Code Generation with Partial-Credit Rewards |
| topic | Cryptography and Security |
| url | https://arxiv.org/abs/2601.01184 |