:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Zhou, Runlong, Du, Simon S., Li, Beibin
Format:	Preprint
Published:	2024
Subjects:	Machine Learning Computation and Language
Online Access:	https://arxiv.org/abs/2402.12621
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

The Crucial Role of Samplers in Online Direct Preference Optimization
by: Shi, Ruizhe, et al.
Published: (2024)

DuoGuard: A Two-Player RL-Driven Framework for Multilingual LLM Guardrails
by: Deng, Yihe, et al.
Published: (2025)

Offline RL by Reward-Weighted Fine-Tuning for Conversation Optimization
by: Mukherjee, Subhojyoti, et al.
Published: (2025)

REA-RL: Reflection-Aware Online Reinforcement Learning for Efficient Reasoning
by: Deng, Hexuan, et al.
Published: (2025)

Confidence Is All You Need: Few-Shot RL Fine-Tuning of Language Models
by: Li, Pengyi, et al.
Published: (2025)

Generative Model for Small Molecules with Latent Space RL Fine-Tuning to Protein Targets
by: Sob, Ulrich A. Mbou, et al.
Published: (2024)

Ignore the KL Penalty! Boosting Exploration on Critical Tokens to Enhance RL Fine-Tuning
by: Vassoyan, Jean, et al.
Published: (2025)

Learn Hard Problems During RL with Reference Guided Fine-tuning
by: Wu, Yangzhen, et al.
Published: (2026)

Alchemist: Towards the Design of Efficient Online Continual Learning System
by: Huang, Yuyang, et al.
Published: (2025)

Beyond Markovian: Reflective Exploration via Bayes-Adaptive RL for LLM Reasoning
by: Zhang, Shenao, et al.
Published: (2025)

Understanding the Performance Gap in Preference Learning: A Dichotomy of RLHF and DPO
by: Shi, Ruizhe, et al.
Published: (2025)

CASCADE Your Datasets for Cross-Mode Knowledge Retrieval of Language Models
by: Zhou, Runlong, et al.
Published: (2025)

RL Grokking Recipe: How Does RL Unlock and Transfer New Algorithms in LLMs?
by: Sun, Yiyou, et al.
Published: (2025)

On-Policy RL with Optimal Reward Baseline
by: Hao, Yaru, et al.
Published: (2025)

MobileGUI-RL: Advancing Mobile GUI Agent through Reinforcement Learning in Online Environment
by: Shi, Yucheng, et al.
Published: (2025)

GLIDE-RL: Grounded Language Instruction through DEmonstration in RL
by: Kharyal, Chaitanya, et al.
Published: (2024)

Hard Prompts Made Interpretable: Sparse Entropy Regularization for Prompt Tuning with RL
by: Choi, Yunseon, et al.
Published: (2024)

ContextRL: Enhancing MLLM's Knowledge Discovery Efficiency with Context-Augmented RL
by: Lu, Xingyu, et al.
Published: (2026)

Policy Agnostic RL: Offline RL and Online RL Fine-Tuning of Any Class and Backbone
by: Mark, Max Sobol, et al.
Published: (2024)

Diagnosing and Mitigating System Bias in Self-Rewarding RL
by: Tan, Chuyi, et al.
Published: (2025)

Blockwise Advantage Estimation for Multi-Objective RL with Verifiable Rewards
by: Pavlenko, Kirill, et al.
Published: (2026)

LIMR: Less is More for RL Scaling
by: Li, Xuefeng, et al.
Published: (2025)

UFO-RL: Uncertainty-Focused Optimization for Efficient Reinforcement Learning Data Selection
by: Zhao, Yang, et al.
Published: (2025)

Large Language Models as Agents in Two-Player Games
by: Liu, Yang, et al.
Published: (2024)

Compositional preference models for aligning LMs
by: Go, Dongyoung, et al.
Published: (2023)

ReflectDiffu:Reflect between Emotion-intent Contagion and Mimicry for Empathetic Response Generation via a RL-Diffusion Framework
by: Yuan, Jiahao, et al.
Published: (2024)

DISA: Offline Importance Sampling for Distribution-Matching LLM-RL
by: Wang, Shaobo, et al.
Published: (2026)

ReLook: Vision-Grounded RL with a Multimodal LLM Critic for Agentic Web Coding
by: Li, Yuhang, et al.
Published: (2025)

TreeRL: LLM Reinforcement Learning with On-Policy Tree Search
by: Hou, Zhenyu, et al.
Published: (2025)

Scaling In-Context Online Learning Capability of LLMs via Cross-Episode Meta-RL
by: Lin, Xiaofeng, et al.
Published: (2026)

Small Language Models for Application Interactions: A Case Study
by: Li, Beibin, et al.
Published: (2024)

Attention as a Compass: Efficient Exploration for Process-Supervised RL in Reasoning Models
by: Liu, Runze, et al.
Published: (2025)

Endless Terminals: Scaling RL Environments for Terminal Agents
by: Gandhi, Kanishk, et al.
Published: (2026)

Enabling Approximate Joint Sampling in Diffusion LMs
by: Bansal, Parikshit, et al.
Published: (2025)

SPA-RL: Reinforcing LLM Agents via Stepwise Progress Attribution
by: Wang, Hanlin, et al.
Published: (2025)

Prioritized Replay for RL Post-training
by: Fatemi, Mehdi
Published: (2026)

Chart-RL: Generalized Chart Comprehension via Reinforcement Learning with Verifiable Rewards
by: Zhang, Xin, et al.
Published: (2026)

When Sharpening Becomes Collapse: Sampling Bias and Semantic Coupling in RL with Verifiable Rewards
by: Fan, Mingyuan, et al.
Published: (2026)

Internalizing World Models via Self-Play Finetuning for Agentic RL
by: Chen, Shiqi, et al.
Published: (2025)

FlowRL: Matching Reward Distributions for LLM Reasoning
by: Zhu, Xuekai, et al.
Published: (2025)