Saved in:
| Main Authors: | Zhang, Shijie, Guo, Xiang, Guo, Rujun, Liu, Shaoyu, Wang, Xiaozhao, Jiang, Guanjun, Zhang, Kevin |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2602.10006 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
ETR: Outcome-Guided Elastic Trust Regions for Policy Optimization
by: Zhang, Shijie, et al.
Published: (2026)
by: Zhang, Shijie, et al.
Published: (2026)
From Item-Only to Query-Item: Query-Conditioned Generative Search with QGS in Quark
by: Song, Yanglong, et al.
Published: (2026)
by: Song, Yanglong, et al.
Published: (2026)
CLPO: Curriculum Learning meets Policy Optimization for LLM Reasoning
by: Zhang, Shijie, et al.
Published: (2025)
by: Zhang, Shijie, et al.
Published: (2025)
PNAct: Crafting Backdoor Attacks in Safe Reinforcement Learning
by: Guo, Weiran, et al.
Published: (2025)
by: Guo, Weiran, et al.
Published: (2025)
CRAFT: Calibrated Reasoning with Answer-Faithful Traces via Reinforcement Learning for Multi-Hop Question Answering
by: Liu, Yu, et al.
Published: (2026)
by: Liu, Yu, et al.
Published: (2026)
Harnessing Reasoning Trajectories for Hallucination Detection via Answer-agreement Representation Shaping
by: Zhang, Jianxiong, et al.
Published: (2026)
by: Zhang, Jianxiong, et al.
Published: (2026)
PEARL: Training Socratic Tutors with Pedagogically Aligned Reinforcement Learning
by: Chang, Qikai, et al.
Published: (2026)
by: Chang, Qikai, et al.
Published: (2026)
Historically Relevant Event Structuring for Temporal Knowledge Graph Reasoning
by: Zhang, Jinchuan, et al.
Published: (2024)
by: Zhang, Jinchuan, et al.
Published: (2024)
SOREL: A Stochastic Algorithm for Spectral Risks Minimization
by: Ge, Yuze, et al.
Published: (2024)
by: Ge, Yuze, et al.
Published: (2024)
Search Self-play: Pushing the Frontier of Agent Capability without Supervision
by: Lu, Hongliang, et al.
Published: (2025)
by: Lu, Hongliang, et al.
Published: (2025)
KL-Regularized Reinforcement Learning is Designed to Mode Collapse
by: GX-Chen, Anthony, et al.
Published: (2025)
by: GX-Chen, Anthony, et al.
Published: (2025)
DreamerAD: Efficient Reinforcement Learning via Latent World Model for Autonomous Driving
by: Yang, Pengxuan, et al.
Published: (2026)
by: Yang, Pengxuan, et al.
Published: (2026)
Class-Balanced and Reinforced Active Learning on Graphs
by: Yu, Chengcheng, et al.
Published: (2024)
by: Yu, Chengcheng, et al.
Published: (2024)
Supervised Reinforcement Learning: From Expert Trajectories to Step-wise Reasoning
by: Deng, Yihe, et al.
Published: (2025)
by: Deng, Yihe, et al.
Published: (2025)
A First Guess is Rarely the Final Answer: Learning to Search in the Traveling Salesperson Problem
by: Garmendia, Andoni Irazusta
Published: (2026)
by: Garmendia, Andoni Irazusta
Published: (2026)
TimeMaster: Training Time-Series Multimodal LLMs to Reason via Reinforcement Learning
by: Zhang, Junru, et al.
Published: (2025)
by: Zhang, Junru, et al.
Published: (2025)
Balanced Multimodal Learning: An Unidirectional Dynamic Interaction Perspective
by: Wang, Shijie, et al.
Published: (2025)
by: Wang, Shijie, et al.
Published: (2025)
Revisiting Reinforcement Learning with Verifiable Rewards from a Contrastive Perspective
by: Zhang, Feng, et al.
Published: (2026)
by: Zhang, Feng, et al.
Published: (2026)
AlphaAlign: Incentivizing Safety Alignment with Extremely Simplified Reinforcement Learning
by: Zhang, Yi, et al.
Published: (2025)
by: Zhang, Yi, et al.
Published: (2025)
Energy-Balanced Hyperspherical Graph Representation Learning via Structural Binding and Entropic Dispersion
by: Chen, Rui, et al.
Published: (2025)
by: Chen, Rui, et al.
Published: (2025)
Bridging Dynamics Gaps via Diffusion Schrödinger Bridge for Cross-Domain Reinforcement Learning
by: Zhang, Hanping, et al.
Published: (2026)
by: Zhang, Hanping, et al.
Published: (2026)
MARTI-MARS$^2$: Scaling Multi-Agent Self-Search via Reinforcement Learning for Code Generation
by: Wang, Shijie, et al.
Published: (2026)
by: Wang, Shijie, et al.
Published: (2026)
BAPO: Stabilizing Off-Policy Reinforcement Learning for LLMs via Balanced Policy Optimization with Adaptive Clipping
by: Xi, Zhiheng, et al.
Published: (2025)
by: Xi, Zhiheng, et al.
Published: (2025)
Content Moderation in TV Search: Balancing Policy Compliance, Relevance, and User Experience
by: Hande, Adeep, et al.
Published: (2025)
by: Hande, Adeep, et al.
Published: (2025)
Scalable Reinforcement Learning-based Neural Architecture Search
by: Cassimon, Amber, et al.
Published: (2024)
by: Cassimon, Amber, et al.
Published: (2024)
Balancing the Reasoning Load: Difficulty-Differentiated Policy Optimization with Length Redistribution for Efficient and Robust Reinforcement Learning
by: Xia, Yinan, et al.
Published: (2026)
by: Xia, Yinan, et al.
Published: (2026)
Towards Trustworthy Multimodal Moderation via Policy-Aligned Reasoning and Hierarchical Labeling
by: Li, Anqi, et al.
Published: (2025)
by: Li, Anqi, et al.
Published: (2025)
Skill-based Safe Reinforcement Learning with Risk Planning
by: Zhang, Hanping, et al.
Published: (2025)
by: Zhang, Hanping, et al.
Published: (2025)
Reinforcement Learning-Guided Semi-Supervised Learning
by: Heidari, Marzi, et al.
Published: (2024)
by: Heidari, Marzi, et al.
Published: (2024)
Less Noise, More Voice: Reinforcement Learning for Reasoning via Instruction Purification
by: Guo, Yiju, et al.
Published: (2026)
by: Guo, Yiju, et al.
Published: (2026)
Re3: Learning to Balance Relevance & Recency for Temporal Information Retrieval
by: Cao, Jiawei, et al.
Published: (2025)
by: Cao, Jiawei, et al.
Published: (2025)
Refining Hybrid Genetic Search for CVRP via Reinforcement Learning-Finetuned LLM
by: Zhu, Rongjie, et al.
Published: (2025)
by: Zhu, Rongjie, et al.
Published: (2025)
TTVS: Boosting Self-Exploring Reinforcement Learning via Test-time Variational Synthesis
by: Bai, Sikai, et al.
Published: (2026)
by: Bai, Sikai, et al.
Published: (2026)
DRARL: Disengagement-Reason-Augmented Reinforcement Learning for Efficient Improvement of Autonomous Driving Policy
by: Zhou, Weitao, et al.
Published: (2025)
by: Zhou, Weitao, et al.
Published: (2025)
LEAF: Language-EEG Aligned Foundation Model for Brain-Computer Interfaces
by: Jiang, Muyun, et al.
Published: (2025)
by: Jiang, Muyun, et al.
Published: (2025)
Aligning AI-driven discovery with human intuition
by: Zhang, Kevin, et al.
Published: (2024)
by: Zhang, Kevin, et al.
Published: (2024)
Skill-Enhanced Reinforcement Learning Acceleration from Heterogeneous Demonstrations
by: Zhang, Hanping, et al.
Published: (2024)
by: Zhang, Hanping, et al.
Published: (2024)
Right Question is Already Half the Answer: Fully Unsupervised LLM Reasoning Incentivization
by: Zhang, Qingyang, et al.
Published: (2025)
by: Zhang, Qingyang, et al.
Published: (2025)
Learning to Clean: Reinforcement Learning for Noisy Label Correction
by: Heidari, Marzi, et al.
Published: (2025)
by: Heidari, Marzi, et al.
Published: (2025)
LLM-Driven Policy Diffusion: Enhancing Generalization in Offline Reinforcement Learning
by: Zhang, Hanping, et al.
Published: (2025)
by: Zhang, Hanping, et al.
Published: (2025)
Similar Items
-
ETR: Outcome-Guided Elastic Trust Regions for Policy Optimization
by: Zhang, Shijie, et al.
Published: (2026) -
From Item-Only to Query-Item: Query-Conditioned Generative Search with QGS in Quark
by: Song, Yanglong, et al.
Published: (2026) -
CLPO: Curriculum Learning meets Policy Optimization for LLM Reasoning
by: Zhang, Shijie, et al.
Published: (2025) -
PNAct: Crafting Backdoor Attacks in Safe Reinforcement Learning
by: Guo, Weiran, et al.
Published: (2025) -
CRAFT: Calibrated Reasoning with Answer-Faithful Traces via Reinforcement Learning for Multi-Hop Question Answering
by: Liu, Yu, et al.
Published: (2026)