Saved in:
| Main Authors: | Liu, Xiaoqian, Wang, Ke, Li, Yongbin, Wu, Yuchuan, Ma, Wentao, Kong, Aobo, Huang, Fei, Jiao, Jianbin, Zhang, Junge |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2502.12486 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Agentic Reinforcement Learning with Implicit Step Rewards
by: Liu, Xiaoqian, et al.
Published: (2025)
by: Liu, Xiaoqian, et al.
Published: (2025)
SDPO: Segment-Level Direct Preference Optimization for Social Agents
by: Kong, Aobo, et al.
Published: (2025)
by: Kong, Aobo, et al.
Published: (2025)
CPO: Addressing Reward Ambiguity in Role-playing Dialogue via Comparative Policy Optimization
by: Ye, Xinge, et al.
Published: (2025)
by: Ye, Xinge, et al.
Published: (2025)
FlowBench: Revisiting and Benchmarking Workflow-Guided Planning for LLM-based Agents
by: Xiao, Ruixuan, et al.
Published: (2024)
by: Xiao, Ruixuan, et al.
Published: (2024)
Enhancing the General Agent Capabilities of Low-Parameter LLMs through Tuning and Multi-Branch Reasoning
by: Zhou, Qinhao, et al.
Published: (2024)
by: Zhou, Qinhao, et al.
Published: (2024)
EPO: Entropy-regularized Policy Optimization for LLM Agents Reinforcement Learning
by: Xu, Wujiang, et al.
Published: (2025)
by: Xu, Wujiang, et al.
Published: (2025)
MOA: Multi-Objective Alignment for Role-Playing Agents
by: Liao, Chonghua, et al.
Published: (2025)
by: Liao, Chonghua, et al.
Published: (2025)
Self-supervised Pretraining for Decision Foundation Model: Formulation, Pipeline and Challenges
by: Liu, Xiaoqian, et al.
Published: (2023)
by: Liu, Xiaoqian, et al.
Published: (2023)
TimeHC-RL: Temporal-aware Hierarchical Cognitive Reinforcement Learning for Enhancing LLMs' Social Intelligence
by: Hou, Guiyang, et al.
Published: (2025)
by: Hou, Guiyang, et al.
Published: (2025)
Reverse Preference Optimization for Complex Instruction Following
by: Huang, Xiang, et al.
Published: (2025)
by: Huang, Xiang, et al.
Published: (2025)
Fine-Tuning Language Models with Reward Learning on Policy
by: Lang, Hao, et al.
Published: (2024)
by: Lang, Hao, et al.
Published: (2024)
Position: Foundation Agents as the Paradigm Shift for Decision Making
by: Liu, Xiaoqian, et al.
Published: (2024)
by: Liu, Xiaoqian, et al.
Published: (2024)
Improving Factual Consistency of News Summarization by Contrastive Preference Optimization
by: Feng, Huawen, et al.
Published: (2023)
by: Feng, Huawen, et al.
Published: (2023)
IOPO: Empowering LLMs with Complex Instruction Following via Input-Output Preference Optimization
by: Zhang, Xinghua, et al.
Published: (2024)
by: Zhang, Xinghua, et al.
Published: (2024)
Adaptive Social Learning via Mode Policy Optimization for Language Agents
by: Wang, Minzheng, et al.
Published: (2025)
by: Wang, Minzheng, et al.
Published: (2025)
SpokenWOZ: A Large-Scale Speech-Text Benchmark for Spoken Task-Oriented Dialogue Agents
by: Si, Shuzheng, et al.
Published: (2023)
by: Si, Shuzheng, et al.
Published: (2023)
RL-PLUS: Countering Capability Boundary Collapse of LLMs in Reinforcement Learning with Hybrid-policy Optimization
by: Dong, Yihong, et al.
Published: (2025)
by: Dong, Yihong, et al.
Published: (2025)
A Simple "Motivation" Can Enhance Reinforcement Finetuning of Large Reasoning Models
by: Zhang, Junjie, et al.
Published: (2025)
by: Zhang, Junjie, et al.
Published: (2025)
Masked Thought: Simply Masking Partial Reasoning Steps Can Improve Mathematical Reasoning Learning of Language Models
by: Chen, Changyu, et al.
Published: (2024)
by: Chen, Changyu, et al.
Published: (2024)
Format-Adapter: Improving Reasoning Capability of LLMs by Adapting Suitable Format
by: Wang, Dingzirui, et al.
Published: (2025)
by: Wang, Dingzirui, et al.
Published: (2025)
BAPO: Stabilizing Off-Policy Reinforcement Learning for LLMs via Balanced Policy Optimization with Adaptive Clipping
by: Xi, Zhiheng, et al.
Published: (2025)
by: Xi, Zhiheng, et al.
Published: (2025)
Persistent Backdoor Attacks under Continual Fine-Tuning of LLMs
by: Cui, Jing, et al.
Published: (2025)
by: Cui, Jing, et al.
Published: (2025)
Calibration-Aware Policy Optimization for Reasoning LLMs
by: Wang, Ziqi, et al.
Published: (2026)
by: Wang, Ziqi, et al.
Published: (2026)
Training with Harnesses: On-Policy Harness Self-Distillation for Complex Reasoning
by: Zhao, Zhengyang, et al.
Published: (2026)
by: Zhao, Zhengyang, et al.
Published: (2026)
Training LLMs for EHR-Based Reasoning Tasks via Reinforcement Learning
by: Lin, Jiacheng, et al.
Published: (2025)
by: Lin, Jiacheng, et al.
Published: (2025)
Fortify the Shortest Stave in Attention: Enhancing Context Awareness of Large Language Models for Effective Tool Use
by: Chen, Yuhan, et al.
Published: (2023)
by: Chen, Yuhan, et al.
Published: (2023)
HiPO: Hybrid Policy Optimization for Dynamic Reasoning in LLMs
by: Deng, Ken, et al.
Published: (2025)
by: Deng, Ken, et al.
Published: (2025)
ReTool: Reinforcement Learning for Strategic Tool Use in LLMs
by: Feng, Jiazhan, et al.
Published: (2025)
by: Feng, Jiazhan, et al.
Published: (2025)
StructRAG: Boosting Knowledge Intensive Reasoning of LLMs via Inference-time Hybrid Information Structurization
by: Li, Zhuoqun, et al.
Published: (2024)
by: Li, Zhuoqun, et al.
Published: (2024)
Med-R$^3$: Enhancing Medical Retrieval-Augmented Reasoning of LLMs via Progressive Reinforcement Learning
by: Lu, Keer, et al.
Published: (2025)
by: Lu, Keer, et al.
Published: (2025)
Unlocking Reasoning Capabilities in LLMs via Reinforcement Learning Exploration
by: Deng, Wenhao, et al.
Published: (2025)
by: Deng, Wenhao, et al.
Published: (2025)
RLKD: Distilling LLMs' Reasoning via Reinforcement Learning
by: Xu, Shicheng, et al.
Published: (2025)
by: Xu, Shicheng, et al.
Published: (2025)
Mind the Gap: Data Rewriting for Stable Off-Policy Supervised Fine-Tuning
by: Zhao, Shiwan, et al.
Published: (2025)
by: Zhao, Shiwan, et al.
Published: (2025)
Beyond Reasoning: Reinforcement Learning Unlocks Parametric Knowledge in LLMs
by: Yang, Wanli, et al.
Published: (2026)
by: Yang, Wanli, et al.
Published: (2026)
P-GenRM: Personalized Generative Reward Model with Test-time User-based Scaling
by: Zhang, Pinyi, et al.
Published: (2026)
by: Zhang, Pinyi, et al.
Published: (2026)
The Imperative of Conversation Analysis in the Era of LLMs: A Survey of Tasks, Techniques, and Trends
by: Zhang, Xinghua, et al.
Published: (2024)
by: Zhang, Xinghua, et al.
Published: (2024)
Debate Helps Weak-to-Strong Generalization
by: Lang, Hao, et al.
Published: (2025)
by: Lang, Hao, et al.
Published: (2025)
Selective Weak-to-Strong Generalization
by: Lang, Hao, et al.
Published: (2025)
by: Lang, Hao, et al.
Published: (2025)
Safe Reinforcement Learning with Free-form Natural Language Constraints and Pre-Trained Language Models
by: Lou, Xingzhou, et al.
Published: (2024)
by: Lou, Xingzhou, et al.
Published: (2024)
MAPEX: A Multi-Agent Pipeline for Keyphrase Extraction
by: Zhang, Liting, et al.
Published: (2025)
by: Zhang, Liting, et al.
Published: (2025)
Similar Items
-
Agentic Reinforcement Learning with Implicit Step Rewards
by: Liu, Xiaoqian, et al.
Published: (2025) -
SDPO: Segment-Level Direct Preference Optimization for Social Agents
by: Kong, Aobo, et al.
Published: (2025) -
CPO: Addressing Reward Ambiguity in Role-playing Dialogue via Comparative Policy Optimization
by: Ye, Xinge, et al.
Published: (2025) -
FlowBench: Revisiting and Benchmarking Workflow-Guided Planning for LLM-based Agents
by: Xiao, Ruixuan, et al.
Published: (2024) -
Enhancing the General Agent Capabilities of Low-Parameter LLMs through Tuning and Multi-Branch Reasoning
by: Zhou, Qinhao, et al.
Published: (2024)