Guardado en:
| Autores principales: | Xi, Zhiheng, Liao, Chenyang, Li, Guanyu, Yang, Yajie, Chen, Wenxiang, Zhang, Zhihao, Wang, Binghai, Jin, Senjie, Zhou, Yuhao, Guan, Jian, Wu, Wei, Ji, Tao, Gui, Tao, Zhang, Qi, Huang, Xuanjing |
|---|---|
| Formato: | Preprint |
| Publicado: |
2025
|
| Materias: | |
| Acceso en línea: | https://arxiv.org/abs/2511.08325 |
| Etiquetas: |
Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!
|
Ejemplares similares
Self-Polish: Enhance Reasoning in Large Language Models via Problem Refinement
por: Xi, Zhiheng, et al.
Publicado: (2023)
por: Xi, Zhiheng, et al.
Publicado: (2023)
Better Process Supervision with Bi-directional Rewarding Signals
por: Chen, Wenxiang, et al.
Publicado: (2025)
por: Chen, Wenxiang, et al.
Publicado: (2025)
AgentV-RL: Scaling Reward Modeling with Agentic Verifier
por: Zhang, Jiazheng, et al.
Publicado: (2026)
por: Zhang, Jiazheng, et al.
Publicado: (2026)
Subspace Defense: Discarding Adversarial Perturbations by Learning a Subspace for Clean Signals
por: Zheng, Rui, et al.
Publicado: (2024)
por: Zheng, Rui, et al.
Publicado: (2024)
EliteKV: Scalable KV Cache Compression via RoPE Frequency Selection and Joint Low-Rank Projection
por: Zhou, Yuhao, et al.
Publicado: (2025)
por: Zhou, Yuhao, et al.
Publicado: (2025)
Enhancing LLM-based Search Agents via Contribution Weighted Group Relative Policy Optimization
por: Wang, Junzhe, et al.
Publicado: (2026)
por: Wang, Junzhe, et al.
Publicado: (2026)
SciAgentGym: Benchmarking Multi-Step Scientific Tool-use in LLM Agents
por: Shen, Yujiong, et al.
Publicado: (2026)
por: Shen, Yujiong, et al.
Publicado: (2026)
Why Reinforcement Fine-Tuning Enables MLLMs Preserve Prior Knowledge Better: A Data Perspective
por: Zhang, Zhihao, et al.
Publicado: (2025)
por: Zhang, Zhihao, et al.
Publicado: (2025)
Parrot: A Training Pipeline Enhances Both Program CoT and Natural Language CoT for Reasoning
por: Jin, Senjie, et al.
Publicado: (2025)
por: Jin, Senjie, et al.
Publicado: (2025)
AgentGym-RL: Training LLM Agents for Long-Horizon Decision Making through Multi-Turn Reinforcement Learning
por: Xi, Zhiheng, et al.
Publicado: (2025)
por: Xi, Zhiheng, et al.
Publicado: (2025)
MM-Doc-R1: Training Agents for Long Document Visual Question Answering through Multi-turn Reinforcement Learning
por: Lin, Jiahang, et al.
Publicado: (2026)
por: Lin, Jiahang, et al.
Publicado: (2026)
AgentGym: Evolving Large Language Model-based Agents across Diverse Environments
por: Xi, Zhiheng, et al.
Publicado: (2024)
por: Xi, Zhiheng, et al.
Publicado: (2024)
RMB: Comprehensively Benchmarking Reward Models in LLM Alignment
por: Zhou, Enyu, et al.
Publicado: (2024)
por: Zhou, Enyu, et al.
Publicado: (2024)
Can RL Improve Generalization of LLM Agents? An Empirical Study
por: Xi, Zhiheng, et al.
Publicado: (2026)
por: Xi, Zhiheng, et al.
Publicado: (2026)
FunPRM: Function-as-Step Process Reward Model with Meta Reward Correction for Code Generation
por: Zhang, Ruiyi, et al.
Publicado: (2026)
por: Zhang, Ruiyi, et al.
Publicado: (2026)
RoCoIns: Enhancing Robustness of Large Language Models through Code-Style Instructions
por: Zhang, Yuansen, et al.
Publicado: (2024)
por: Zhang, Yuansen, et al.
Publicado: (2024)
Unveiling Linguistic Regions in Large Language Models
por: Zhang, Zhihao, et al.
Publicado: (2024)
por: Zhang, Zhihao, et al.
Publicado: (2024)
Self-Consistency of the Internal Reward Models Improves Self-Rewarding Language Models
por: Zhou, Xin, et al.
Publicado: (2025)
por: Zhou, Xin, et al.
Publicado: (2025)
Secrets of RLHF in Large Language Models Part II: Reward Modeling
por: Wang, Binghai, et al.
Publicado: (2024)
por: Wang, Binghai, et al.
Publicado: (2024)
GroundedPRM: Tree-Guided and Fidelity-Aware Process Reward Modeling for Step-Level Reasoning
por: Zhang, Yao, et al.
Publicado: (2025)
por: Zhang, Yao, et al.
Publicado: (2025)
DreamPRM-Code: Function-as-Step Process Reward Model with Label Correction for LLM Coding
por: Zhang, Ruiyi, et al.
Publicado: (2025)
por: Zhang, Ruiyi, et al.
Publicado: (2025)
FreePRM: Training Process Reward Models Without Ground Truth Process Labels
por: Sun, Lin, et al.
Publicado: (2025)
por: Sun, Lin, et al.
Publicado: (2025)
From Scores to Preferences: Redefining MOS Benchmarking for Speech Quality Reward Modeling
por: Cao, Yifei, et al.
Publicado: (2025)
por: Cao, Yifei, et al.
Publicado: (2025)
JFTA-Bench: Evaluate LLM's Ability of Tracking and Analyzing Malfunctions Using Fault Trees
por: Wang, Yuhui, et al.
Publicado: (2026)
por: Wang, Yuhui, et al.
Publicado: (2026)
Outcome Accuracy is Not Enough: Aligning the Reasoning Process of Reward Models
por: Wang, Binghai, et al.
Publicado: (2026)
por: Wang, Binghai, et al.
Publicado: (2026)
LLaMA Beyond English: An Empirical Study on Language Capability Transfer
por: Zhao, Jun, et al.
Publicado: (2024)
por: Zhao, Jun, et al.
Publicado: (2024)
Agentic Harness Engineering: Observability-Driven Automatic Evolution of Coding-Agent Harnesses
por: Lin, Jiahang, et al.
Publicado: (2026)
por: Lin, Jiahang, et al.
Publicado: (2026)
StepCoder: Improve Code Generation with Reinforcement Learning from Compiler Feedback
por: Dou, Shihan, et al.
Publicado: (2024)
por: Dou, Shihan, et al.
Publicado: (2024)
The Role of Entropy in Visual Grounding: Analysis and Optimization
por: Li, Shuo, et al.
Publicado: (2025)
por: Li, Shuo, et al.
Publicado: (2025)
StepMathAgent: A Step-Wise Agent for Evaluating Mathematical Processes through Tree-of-Error
por: Yang, Shu-Xun, et al.
Publicado: (2025)
por: Yang, Shu-Xun, et al.
Publicado: (2025)
Training Large Language Models for Reasoning through Reverse Curriculum Reinforcement Learning
por: Xi, Zhiheng, et al.
Publicado: (2024)
por: Xi, Zhiheng, et al.
Publicado: (2024)
Unlocking the Essence of Beauty: Advanced Aesthetic Reasoning with Relative-Absolute Policy Optimization
por: Liu, Boyang, et al.
Publicado: (2025)
por: Liu, Boyang, et al.
Publicado: (2025)
TRAD: Enhancing LLM Agents with Step-Wise Thought Retrieval and Aligned Decision
por: Zhou, Ruiwen, et al.
Publicado: (2024)
por: Zhou, Ruiwen, et al.
Publicado: (2024)
LongAgent: Scaling Language Models to 128k Context through Multi-Agent Collaboration
por: Zhao, Jun, et al.
Publicado: (2024)
por: Zhao, Jun, et al.
Publicado: (2024)
DreamPRM: Domain-Reweighted Process Reward Model for Multimodal Reasoning
por: Cao, Qi, et al.
Publicado: (2025)
por: Cao, Qi, et al.
Publicado: (2025)
Steering LLMs via Scalable Interactive Oversight
por: Zhou, Enyu, et al.
Publicado: (2026)
por: Zhou, Enyu, et al.
Publicado: (2026)
Effective Length Extrapolation via Dimension-Wise Positional Embeddings Manipulation
por: Lu, Yi, et al.
Publicado: (2025)
por: Lu, Yi, et al.
Publicado: (2025)
UniPaint: Unified Space-time Video Inpainting via Mixture-of-Experts
por: Wan, Zhen, et al.
Publicado: (2024)
por: Wan, Zhen, et al.
Publicado: (2024)
R-PRM: Reasoning-Driven Process Reward Modeling
por: She, Shuaijie, et al.
Publicado: (2025)
por: She, Shuaijie, et al.
Publicado: (2025)
Counteracting Matthew Effect in Self-Improvement of LVLMs through Head-Tail Re-balancing
por: Guo, Xin, et al.
Publicado: (2025)
por: Guo, Xin, et al.
Publicado: (2025)
Ejemplares similares
-
Self-Polish: Enhance Reasoning in Large Language Models via Problem Refinement
por: Xi, Zhiheng, et al.
Publicado: (2023) -
Better Process Supervision with Bi-directional Rewarding Signals
por: Chen, Wenxiang, et al.
Publicado: (2025) -
AgentV-RL: Scaling Reward Modeling with Agentic Verifier
por: Zhang, Jiazheng, et al.
Publicado: (2026) -
Subspace Defense: Discarding Adversarial Perturbations by Learning a Subspace for Clean Signals
por: Zheng, Rui, et al.
Publicado: (2024) -
EliteKV: Scalable KV Cache Compression via RoPE Frequency Selection and Joint Low-Rank Projection
por: Zhou, Yuhao, et al.
Publicado: (2025)