Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Author:	Zhan, Anthony
Format:	Preprint
Published:	2025
Subjects:	Machine Learning Artificial Intelligence Computation and Language
Online Access:	https://arxiv.org/abs/2510.04019
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866911412723384320
author	Zhan, Anthony
author_facet	Zhan, Anthony
contents	Diffusion large language models (dLLMs), which offer a promising alternative to traditional autoregressive LLMs, have recently shown strong results in pretraining. However, due to their lack of tractable sequence-level likelihoods, they have yet to benefit from modern LLM post-training techniques such as reinforcement learning (RL), limiting their real-world applicability. Existing attempts at dLLM post-training rely on heuristic approximations or lower bounds of the true likelihood. In this work, we propose Amortized Group Relative Policy Optimization (AGRPO), a policy gradient algorithm that leverages the multi-step Markovian nature of dLLM generation, optimizing individual denoising steps rather than entire sequences. We demonstrate AGRPO's effectiveness on different math and reasoning tasks, achieving +9.9\% absolute gain on GSM8K, +4.6\% on MATH-500, +59.4\% on Countdown, and +69.7\% on Sudoku over the base LLaDA model, improving upon comparable dLLM RL methods such as diffu-GRPO. Furthermore, we analyze how post-training gains persist across different inference configurations, revealing that models trained with AGRPO can sample 4x faster with minimal performance sacrifices.
format	Preprint
id	arxiv_https___arxiv_org_abs_2510_04019
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Simple Policy Gradients for Reasoning with Diffusion Language Models Zhan, Anthony Machine Learning Artificial Intelligence Computation and Language Diffusion large language models (dLLMs), which offer a promising alternative to traditional autoregressive LLMs, have recently shown strong results in pretraining. However, due to their lack of tractable sequence-level likelihoods, they have yet to benefit from modern LLM post-training techniques such as reinforcement learning (RL), limiting their real-world applicability. Existing attempts at dLLM post-training rely on heuristic approximations or lower bounds of the true likelihood. In this work, we propose Amortized Group Relative Policy Optimization (AGRPO), a policy gradient algorithm that leverages the multi-step Markovian nature of dLLM generation, optimizing individual denoising steps rather than entire sequences. We demonstrate AGRPO's effectiveness on different math and reasoning tasks, achieving +9.9\% absolute gain on GSM8K, +4.6\% on MATH-500, +59.4\% on Countdown, and +69.7\% on Sudoku over the base LLaDA model, improving upon comparable dLLM RL methods such as diffu-GRPO. Furthermore, we analyze how post-training gains persist across different inference configurations, revealing that models trained with AGRPO can sample 4x faster with minimal performance sacrifices.
title	Simple Policy Gradients for Reasoning with Diffusion Language Models
topic	Machine Learning Artificial Intelligence Computation and Language
url	https://arxiv.org/abs/2510.04019

Similar Items