Saved in:
Bibliographic Details
Main Author: Zhan, Anthony
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2510.04019
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866911412723384320
author Zhan, Anthony
author_facet Zhan, Anthony
contents Diffusion large language models (dLLMs), which offer a promising alternative to traditional autoregressive LLMs, have recently shown strong results in pretraining. However, due to their lack of tractable sequence-level likelihoods, they have yet to benefit from modern LLM post-training techniques such as reinforcement learning (RL), limiting their real-world applicability. Existing attempts at dLLM post-training rely on heuristic approximations or lower bounds of the true likelihood. In this work, we propose Amortized Group Relative Policy Optimization (AGRPO), a policy gradient algorithm that leverages the multi-step Markovian nature of dLLM generation, optimizing individual denoising steps rather than entire sequences. We demonstrate AGRPO's effectiveness on different math and reasoning tasks, achieving +9.9\% absolute gain on GSM8K, +4.6\% on MATH-500, +59.4\% on Countdown, and +69.7\% on Sudoku over the base LLaDA model, improving upon comparable dLLM RL methods such as diffu-GRPO. Furthermore, we analyze how post-training gains persist across different inference configurations, revealing that models trained with AGRPO can sample 4x faster with minimal performance sacrifices.
format Preprint
id arxiv_https___arxiv_org_abs_2510_04019
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle Simple Policy Gradients for Reasoning with Diffusion Language Models
Zhan, Anthony
Machine Learning
Artificial Intelligence
Computation and Language
Diffusion large language models (dLLMs), which offer a promising alternative to traditional autoregressive LLMs, have recently shown strong results in pretraining. However, due to their lack of tractable sequence-level likelihoods, they have yet to benefit from modern LLM post-training techniques such as reinforcement learning (RL), limiting their real-world applicability. Existing attempts at dLLM post-training rely on heuristic approximations or lower bounds of the true likelihood. In this work, we propose Amortized Group Relative Policy Optimization (AGRPO), a policy gradient algorithm that leverages the multi-step Markovian nature of dLLM generation, optimizing individual denoising steps rather than entire sequences. We demonstrate AGRPO's effectiveness on different math and reasoning tasks, achieving +9.9\% absolute gain on GSM8K, +4.6\% on MATH-500, +59.4\% on Countdown, and +69.7\% on Sudoku over the base LLaDA model, improving upon comparable dLLM RL methods such as diffu-GRPO. Furthermore, we analyze how post-training gains persist across different inference configurations, revealing that models trained with AGRPO can sample 4x faster with minimal performance sacrifices.
title Simple Policy Gradients for Reasoning with Diffusion Language Models
topic Machine Learning
Artificial Intelligence
Computation and Language
url https://arxiv.org/abs/2510.04019