Saved in:
| Main Authors: | Zhou, Runlong, Du, Simon S., Li, Beibin |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2402.12621 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
The Crucial Role of Samplers in Online Direct Preference Optimization
by: Shi, Ruizhe, et al.
Published: (2024)
by: Shi, Ruizhe, et al.
Published: (2024)
DuoGuard: A Two-Player RL-Driven Framework for Multilingual LLM Guardrails
by: Deng, Yihe, et al.
Published: (2025)
by: Deng, Yihe, et al.
Published: (2025)
Offline RL by Reward-Weighted Fine-Tuning for Conversation Optimization
by: Mukherjee, Subhojyoti, et al.
Published: (2025)
by: Mukherjee, Subhojyoti, et al.
Published: (2025)
REA-RL: Reflection-Aware Online Reinforcement Learning for Efficient Reasoning
by: Deng, Hexuan, et al.
Published: (2025)
by: Deng, Hexuan, et al.
Published: (2025)
Confidence Is All You Need: Few-Shot RL Fine-Tuning of Language Models
by: Li, Pengyi, et al.
Published: (2025)
by: Li, Pengyi, et al.
Published: (2025)
Generative Model for Small Molecules with Latent Space RL Fine-Tuning to Protein Targets
by: Sob, Ulrich A. Mbou, et al.
Published: (2024)
by: Sob, Ulrich A. Mbou, et al.
Published: (2024)
Ignore the KL Penalty! Boosting Exploration on Critical Tokens to Enhance RL Fine-Tuning
by: Vassoyan, Jean, et al.
Published: (2025)
by: Vassoyan, Jean, et al.
Published: (2025)
Learn Hard Problems During RL with Reference Guided Fine-tuning
by: Wu, Yangzhen, et al.
Published: (2026)
by: Wu, Yangzhen, et al.
Published: (2026)
Alchemist: Towards the Design of Efficient Online Continual Learning System
by: Huang, Yuyang, et al.
Published: (2025)
by: Huang, Yuyang, et al.
Published: (2025)
Beyond Markovian: Reflective Exploration via Bayes-Adaptive RL for LLM Reasoning
by: Zhang, Shenao, et al.
Published: (2025)
by: Zhang, Shenao, et al.
Published: (2025)
Understanding the Performance Gap in Preference Learning: A Dichotomy of RLHF and DPO
by: Shi, Ruizhe, et al.
Published: (2025)
by: Shi, Ruizhe, et al.
Published: (2025)
CASCADE Your Datasets for Cross-Mode Knowledge Retrieval of Language Models
by: Zhou, Runlong, et al.
Published: (2025)
by: Zhou, Runlong, et al.
Published: (2025)
RL Grokking Recipe: How Does RL Unlock and Transfer New Algorithms in LLMs?
by: Sun, Yiyou, et al.
Published: (2025)
by: Sun, Yiyou, et al.
Published: (2025)
On-Policy RL with Optimal Reward Baseline
by: Hao, Yaru, et al.
Published: (2025)
by: Hao, Yaru, et al.
Published: (2025)
MobileGUI-RL: Advancing Mobile GUI Agent through Reinforcement Learning in Online Environment
by: Shi, Yucheng, et al.
Published: (2025)
by: Shi, Yucheng, et al.
Published: (2025)
GLIDE-RL: Grounded Language Instruction through DEmonstration in RL
by: Kharyal, Chaitanya, et al.
Published: (2024)
by: Kharyal, Chaitanya, et al.
Published: (2024)
Hard Prompts Made Interpretable: Sparse Entropy Regularization for Prompt Tuning with RL
by: Choi, Yunseon, et al.
Published: (2024)
by: Choi, Yunseon, et al.
Published: (2024)
ContextRL: Enhancing MLLM's Knowledge Discovery Efficiency with Context-Augmented RL
by: Lu, Xingyu, et al.
Published: (2026)
by: Lu, Xingyu, et al.
Published: (2026)
Policy Agnostic RL: Offline RL and Online RL Fine-Tuning of Any Class and Backbone
by: Mark, Max Sobol, et al.
Published: (2024)
by: Mark, Max Sobol, et al.
Published: (2024)
Diagnosing and Mitigating System Bias in Self-Rewarding RL
by: Tan, Chuyi, et al.
Published: (2025)
by: Tan, Chuyi, et al.
Published: (2025)
Blockwise Advantage Estimation for Multi-Objective RL with Verifiable Rewards
by: Pavlenko, Kirill, et al.
Published: (2026)
by: Pavlenko, Kirill, et al.
Published: (2026)
LIMR: Less is More for RL Scaling
by: Li, Xuefeng, et al.
Published: (2025)
by: Li, Xuefeng, et al.
Published: (2025)
UFO-RL: Uncertainty-Focused Optimization for Efficient Reinforcement Learning Data Selection
by: Zhao, Yang, et al.
Published: (2025)
by: Zhao, Yang, et al.
Published: (2025)
Large Language Models as Agents in Two-Player Games
by: Liu, Yang, et al.
Published: (2024)
by: Liu, Yang, et al.
Published: (2024)
Compositional preference models for aligning LMs
by: Go, Dongyoung, et al.
Published: (2023)
by: Go, Dongyoung, et al.
Published: (2023)
ReflectDiffu:Reflect between Emotion-intent Contagion and Mimicry for Empathetic Response Generation via a RL-Diffusion Framework
by: Yuan, Jiahao, et al.
Published: (2024)
by: Yuan, Jiahao, et al.
Published: (2024)
DISA: Offline Importance Sampling for Distribution-Matching LLM-RL
by: Wang, Shaobo, et al.
Published: (2026)
by: Wang, Shaobo, et al.
Published: (2026)
ReLook: Vision-Grounded RL with a Multimodal LLM Critic for Agentic Web Coding
by: Li, Yuhang, et al.
Published: (2025)
by: Li, Yuhang, et al.
Published: (2025)
TreeRL: LLM Reinforcement Learning with On-Policy Tree Search
by: Hou, Zhenyu, et al.
Published: (2025)
by: Hou, Zhenyu, et al.
Published: (2025)
Scaling In-Context Online Learning Capability of LLMs via Cross-Episode Meta-RL
by: Lin, Xiaofeng, et al.
Published: (2026)
by: Lin, Xiaofeng, et al.
Published: (2026)
Small Language Models for Application Interactions: A Case Study
by: Li, Beibin, et al.
Published: (2024)
by: Li, Beibin, et al.
Published: (2024)
Attention as a Compass: Efficient Exploration for Process-Supervised RL in Reasoning Models
by: Liu, Runze, et al.
Published: (2025)
by: Liu, Runze, et al.
Published: (2025)
Endless Terminals: Scaling RL Environments for Terminal Agents
by: Gandhi, Kanishk, et al.
Published: (2026)
by: Gandhi, Kanishk, et al.
Published: (2026)
Enabling Approximate Joint Sampling in Diffusion LMs
by: Bansal, Parikshit, et al.
Published: (2025)
by: Bansal, Parikshit, et al.
Published: (2025)
SPA-RL: Reinforcing LLM Agents via Stepwise Progress Attribution
by: Wang, Hanlin, et al.
Published: (2025)
by: Wang, Hanlin, et al.
Published: (2025)
Prioritized Replay for RL Post-training
by: Fatemi, Mehdi
Published: (2026)
by: Fatemi, Mehdi
Published: (2026)
Chart-RL: Generalized Chart Comprehension via Reinforcement Learning with Verifiable Rewards
by: Zhang, Xin, et al.
Published: (2026)
by: Zhang, Xin, et al.
Published: (2026)
When Sharpening Becomes Collapse: Sampling Bias and Semantic Coupling in RL with Verifiable Rewards
by: Fan, Mingyuan, et al.
Published: (2026)
by: Fan, Mingyuan, et al.
Published: (2026)
Internalizing World Models via Self-Play Finetuning for Agentic RL
by: Chen, Shiqi, et al.
Published: (2025)
by: Chen, Shiqi, et al.
Published: (2025)
FlowRL: Matching Reward Distributions for LLM Reasoning
by: Zhu, Xuekai, et al.
Published: (2025)
by: Zhu, Xuekai, et al.
Published: (2025)
Similar Items
-
The Crucial Role of Samplers in Online Direct Preference Optimization
by: Shi, Ruizhe, et al.
Published: (2024) -
DuoGuard: A Two-Player RL-Driven Framework for Multilingual LLM Guardrails
by: Deng, Yihe, et al.
Published: (2025) -
Offline RL by Reward-Weighted Fine-Tuning for Conversation Optimization
by: Mukherjee, Subhojyoti, et al.
Published: (2025) -
REA-RL: Reflection-Aware Online Reinforcement Learning for Efficient Reasoning
by: Deng, Hexuan, et al.
Published: (2025) -
Confidence Is All You Need: Few-Shot RL Fine-Tuning of Language Models
by: Li, Pengyi, et al.
Published: (2025)