Saved in:
| Main Authors: | Wang, Zezhou, Zhang, Ziyun, Zhang, Xiaoyi, Qian, Zhuzhong, Lu, Yan |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2601.05787 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
InfiniteWeb: Scalable Web Environment Synthesis for GUI Agent Training
by: Zhang, Ziyun, et al.
Published: (2026)
by: Zhang, Ziyun, et al.
Published: (2026)
On-Policy RL Meets Off-Policy Experts: Harmonizing Supervised Fine-Tuning and Reinforcement Learning via Dynamic Weighting
by: Zhang, Wenhao, et al.
Published: (2025)
by: Zhang, Wenhao, et al.
Published: (2025)
Off-OAB: Off-Policy Policy Gradient Method with Optimal Action-Dependent Baseline
by: Meng, Wenjia, et al.
Published: (2024)
by: Meng, Wenjia, et al.
Published: (2024)
Soft Policy Optimization: Online Off-Policy RL for Sequence Models
by: Cohen, Taco, et al.
Published: (2025)
by: Cohen, Taco, et al.
Published: (2025)
Learning to Reason under Off-Policy Guidance
by: Yan, Jianhao, et al.
Published: (2025)
by: Yan, Jianhao, et al.
Published: (2025)
BAPO: Stabilizing Off-Policy Reinforcement Learning for LLMs via Balanced Policy Optimization with Adaptive Clipping
by: Xi, Zhiheng, et al.
Published: (2025)
by: Xi, Zhiheng, et al.
Published: (2025)
InfiGUI-G1: Advancing GUI Grounding with Adaptive Exploration Policy Optimization
by: Liu, Yuhang, et al.
Published: (2025)
by: Liu, Yuhang, et al.
Published: (2025)
Towards Off-Policy Reinforcement Learning for Ranking Policies with Human Feedback
by: Xiao, Teng, et al.
Published: (2024)
by: Xiao, Teng, et al.
Published: (2024)
Breaking the Curse of Repulsion: Optimistic Distributionally Robust Policy Optimization for Off-Policy Generative Recommendation
by: Jiang, Jie, et al.
Published: (2026)
by: Jiang, Jie, et al.
Published: (2026)
Maximum Entropy Reinforcement Learning with Diffusion Policy
by: Dong, Xiaoyi, et al.
Published: (2025)
by: Dong, Xiaoyi, et al.
Published: (2025)
Learn where to Click from Yourself: On-Policy Self-Distillation for GUI Grounding
by: Zhang, Yan, et al.
Published: (2026)
by: Zhang, Yan, et al.
Published: (2026)
Diagnosing Live Within-Policy Instruction Conflicts in LLM Agents with Witnessed Resolution Profiles
by: Yan, Lu, et al.
Published: (2026)
by: Yan, Lu, et al.
Published: (2026)
Exploratory Memory-Augmented LLM Agent via Hybrid On- and Off-Policy Optimization
by: Liu, Zeyuan, et al.
Published: (2026)
by: Liu, Zeyuan, et al.
Published: (2026)
EAPO: Enhancing Policy Optimization with On-Demand Expert Assistance
by: Song, Siyao, et al.
Published: (2025)
by: Song, Siyao, et al.
Published: (2025)
Generalized Policy Gradient with History-Aware Decision Transformer for Reliable Routing over Graph Signals
by: Wei, Xing, et al.
Published: (2025)
by: Wei, Xing, et al.
Published: (2025)
Large Language Model Post-Training: A Unified View of Off-Policy and On-Policy Learning
by: Zhao, Shiwan, et al.
Published: (2026)
by: Zhao, Shiwan, et al.
Published: (2026)
Data Poisoning Attacks on Off-Policy Policy Evaluation Methods
by: Lobo, Elita, et al.
Published: (2024)
by: Lobo, Elita, et al.
Published: (2024)
Off-Policy Correction For Multi-Agent Reinforcement Learning
by: Zawalski, Michał, et al.
Published: (2021)
by: Zawalski, Michał, et al.
Published: (2021)
From Novice to Expert: LLM Agent Policy Optimization via Step-wise Reinforcement Learning
by: Deng, Zhirui, et al.
Published: (2024)
by: Deng, Zhirui, et al.
Published: (2024)
Building Autonomous GUI Navigation via Agentic-Q Estimation and Step-Wise Policy Optimization
by: Wang, Yibo, et al.
Published: (2026)
by: Wang, Yibo, et al.
Published: (2026)
Zero-Shot Off-Policy Learning
by: Asadulaev, Arip, et al.
Published: (2026)
by: Asadulaev, Arip, et al.
Published: (2026)
Style-Preserving Policy Optimization for Game Agents
by: Li, Lingfeng, et al.
Published: (2025)
by: Li, Lingfeng, et al.
Published: (2025)
Clustering Context in Off-Policy Evaluation
by: Guzman-Olivares, Daniel, et al.
Published: (2025)
by: Guzman-Olivares, Daniel, et al.
Published: (2025)
Concept-driven Off Policy Evaluation
by: Majumdar, Ritam, et al.
Published: (2024)
by: Majumdar, Ritam, et al.
Published: (2024)
Trust the Batch, On- or Off-Policy: Adaptive Policy Optimization for RL Post-Training
by: Fakoor, Rasool, et al.
Published: (2026)
by: Fakoor, Rasool, et al.
Published: (2026)
Kernel Metric Learning for In-Sample Off-Policy Evaluation of Deterministic RL Policies
by: Lee, Haanvid, et al.
Published: (2024)
by: Lee, Haanvid, et al.
Published: (2024)
Turning Sand to Gold: Recycling Data to Bridge On-Policy and Off-Policy Learning via Causal Bound
by: Fiskus, Tal, et al.
Published: (2025)
by: Fiskus, Tal, et al.
Published: (2025)
Policy of Thoughts: Scaling LLM Reasoning via Test-time Policy Evolution
by: Jiao, Zhengbo, et al.
Published: (2026)
by: Jiao, Zhengbo, et al.
Published: (2026)
Statistical Tractability of Off-policy Evaluation of History-dependent Policies in POMDPs
by: Zhang, Yuheng, et al.
Published: (2025)
by: Zhang, Yuheng, et al.
Published: (2025)
Recycling Failures: Salvaging Exploration in RLVR via Fine-Grained Off-Policy Guidance
by: Ren, Yanwei, et al.
Published: (2026)
by: Ren, Yanwei, et al.
Published: (2026)
Adaptive Layerwise Perturbation: Unifying Off-Policy Corrections for LLM RL
by: Ye, Chenlu, et al.
Published: (2026)
by: Ye, Chenlu, et al.
Published: (2026)
Real-Time Diffusion Policies for Games: Enhancing Consistency Policies with Q-Ensembles
by: Zhang, Ruoqi, et al.
Published: (2025)
by: Zhang, Ruoqi, et al.
Published: (2025)
Reinforcing Language Agents via Policy Optimization with Action Decomposition
by: Wen, Muning, et al.
Published: (2024)
by: Wen, Muning, et al.
Published: (2024)
Automated Off-Policy Estimator Selection via Supervised Learning
by: Felicioni, Nicolò, et al.
Published: (2024)
by: Felicioni, Nicolò, et al.
Published: (2024)
Stable Asynchrony: Variance-Controlled Off-Policy RL for LLMs
by: Huang, Luke J., et al.
Published: (2026)
by: Huang, Luke J., et al.
Published: (2026)
Milestone-Guided Policy Learning for Long-Horizon Language Agents
by: Wang, Zixuan, et al.
Published: (2026)
by: Wang, Zixuan, et al.
Published: (2026)
D2PPO: Diffusion Policy Policy Optimization with Dispersive Loss
by: Zou, Guowei, et al.
Published: (2025)
by: Zou, Guowei, et al.
Published: (2025)
Agent-Pro: Learning to Evolve via Policy-Level Reflection and Optimization
by: Zhang, Wenqi, et al.
Published: (2024)
by: Zhang, Wenqi, et al.
Published: (2024)
Multi-Agent Guided Policy Optimization
by: Li, Yueheng, et al.
Published: (2025)
by: Li, Yueheng, et al.
Published: (2025)
Selective Off-Policy Reference Tuning with Plan Guidance
by: Le, Duc Anh, et al.
Published: (2026)
by: Le, Duc Anh, et al.
Published: (2026)
Similar Items
-
InfiniteWeb: Scalable Web Environment Synthesis for GUI Agent Training
by: Zhang, Ziyun, et al.
Published: (2026) -
On-Policy RL Meets Off-Policy Experts: Harmonizing Supervised Fine-Tuning and Reinforcement Learning via Dynamic Weighting
by: Zhang, Wenhao, et al.
Published: (2025) -
Off-OAB: Off-Policy Policy Gradient Method with Optimal Action-Dependent Baseline
by: Meng, Wenjia, et al.
Published: (2024) -
Soft Policy Optimization: Online Off-Policy RL for Sequence Models
by: Cohen, Taco, et al.
Published: (2025) -
Learning to Reason under Off-Policy Guidance
by: Yan, Jianhao, et al.
Published: (2025)