Saved in:
Bibliographic Details
Main Authors: Wei, Chenxing, Kang, Jiazhen, Wang, Hong, Zhang, Jianqing, Jiang, Hao, Xu, Xiaolong, Sun, Ningyuan, He, Ying, Yu, F. Richard, Shu, Yao, Jiang, Bo
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2603.01563
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866918365845520384
author Wei, Chenxing
Kang, Jiazhen
Wang, Hong
Zhang, Jianqing
Jiang, Hao
Xu, Xiaolong
Sun, Ningyuan
He, Ying
Yu, F. Richard
Shu, Yao
Jiang, Bo
author_facet Wei, Chenxing
Kang, Jiazhen
Wang, Hong
Zhang, Jianqing
Jiang, Hao
Xu, Xiaolong
Sun, Ningyuan
He, Ying
Yu, F. Richard
Shu, Yao
Jiang, Bo
contents Reinforcement Learning with Verifiable Rewards (RLVR) has achieved remarkable success in improving autoregressive models, especially in domains requiring correctness like mathematical reasoning and code generation. However, directly applying such paradigms to Diffusion Large Language Models (dLLMs) is fundamentally hindered by the intractability of exact likelihood computation, which forces existing methods to rely on high-variance approximations. To bridge this gap, we propose Likelihood-Free Policy Optimization (LFPO), a native framework that maps the concept of vector field flow matching to the discrete token space. Specifically, LFPO formulates alignment as geometric velocity rectification, which directly optimizes denoising logits via contrastive updates. This design effectively bypasses the errors inherent in likelihood approximation, yielding the precise gradient estimation. Furthermore, LFPO enforce consistency by predicting final solutions from intermediate steps, effectively straightening the probability flow to enable high-quality generation with significantly fewer iterations. Extensive experiments demonstrate that LFPO not only outperforms state-of-the-art baselines on code and reasoning benchmarks but also accelerates inference by approximately 20% through reduced diffusion steps.
format Preprint
id arxiv_https___arxiv_org_abs_2603_01563
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle LFPO: Likelihood-Free Policy Optimization for Masked Diffusion Models
Wei, Chenxing
Kang, Jiazhen
Wang, Hong
Zhang, Jianqing
Jiang, Hao
Xu, Xiaolong
Sun, Ningyuan
He, Ying
Yu, F. Richard
Shu, Yao
Jiang, Bo
Machine Learning
Artificial Intelligence
Reinforcement Learning with Verifiable Rewards (RLVR) has achieved remarkable success in improving autoregressive models, especially in domains requiring correctness like mathematical reasoning and code generation. However, directly applying such paradigms to Diffusion Large Language Models (dLLMs) is fundamentally hindered by the intractability of exact likelihood computation, which forces existing methods to rely on high-variance approximations. To bridge this gap, we propose Likelihood-Free Policy Optimization (LFPO), a native framework that maps the concept of vector field flow matching to the discrete token space. Specifically, LFPO formulates alignment as geometric velocity rectification, which directly optimizes denoising logits via contrastive updates. This design effectively bypasses the errors inherent in likelihood approximation, yielding the precise gradient estimation. Furthermore, LFPO enforce consistency by predicting final solutions from intermediate steps, effectively straightening the probability flow to enable high-quality generation with significantly fewer iterations. Extensive experiments demonstrate that LFPO not only outperforms state-of-the-art baselines on code and reasoning benchmarks but also accelerates inference by approximately 20% through reduced diffusion steps.
title LFPO: Likelihood-Free Policy Optimization for Masked Diffusion Models
topic Machine Learning
Artificial Intelligence
url https://arxiv.org/abs/2603.01563