Guardado en:
| Autores principales: | Tao, Leitian, Kulikov, Ilia, Saha, Swarnadeep, Wang, Tianlu, Xu, Jing, Li, Sharon, Weston, Jason E, Yu, Ping |
|---|---|
| Formato: | Preprint |
| Publicado: |
2025
|
| Materias: | |
| Acceso en línea: | https://arxiv.org/abs/2510.07242 |
| Etiquetas: |
Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!
|
Ejemplares similares
J1: Incentivizing Thinking in LLM-as-a-Judge via Reinforcement Learning
por: Whitehouse, Chenxi, et al.
Publicado: (2025)
por: Whitehouse, Chenxi, et al.
Publicado: (2025)
Jointly Reinforcing Diversity and Quality in Language Model Generations
por: Li, Tianjian, et al.
Publicado: (2025)
por: Li, Tianjian, et al.
Publicado: (2025)
OptimalThinkingBench: Evaluating Over and Underthinking in LLMs
por: Aggarwal, Pranjal, et al.
Publicado: (2025)
por: Aggarwal, Pranjal, et al.
Publicado: (2025)
Bridging Offline and Online Reinforcement Learning for LLMs
por: Lanchantin, Jack, et al.
Publicado: (2025)
por: Lanchantin, Jack, et al.
Publicado: (2025)
The Majority is not always right: RL training for solution aggregation
por: Zhao, Wenting, et al.
Publicado: (2025)
por: Zhao, Wenting, et al.
Publicado: (2025)
Distilling System 2 into System 1
por: Yu, Ping, et al.
Publicado: (2024)
por: Yu, Ping, et al.
Publicado: (2024)
Limited Preference Data? Learning Better Reward Model with Latent Space Synthesis
por: Tao, Leitian, et al.
Publicado: (2025)
por: Tao, Leitian, et al.
Publicado: (2025)
Learning to Plan & Reason for Evaluation with Thinking-LLM-as-a-Judge
por: Saha, Swarnadeep, et al.
Publicado: (2025)
por: Saha, Swarnadeep, et al.
Publicado: (2025)
Branch-Solve-Merge Improves Large Language Model Evaluation and Generation
por: Saha, Swarnadeep, et al.
Publicado: (2023)
por: Saha, Swarnadeep, et al.
Publicado: (2023)
The Era of Real-World Human Interaction: RL from User Conversations
por: Jin, Chuanyang, et al.
Publicado: (2025)
por: Jin, Chuanyang, et al.
Publicado: (2025)
R.I.P.: Better Models by Survival of the Fittest Prompts
por: Yu, Ping, et al.
Publicado: (2025)
por: Yu, Ping, et al.
Publicado: (2025)
CoT-Self-Instruct: Building high-quality synthetic prompts for reasoning and non-reasoning tasks
por: Yu, Ping, et al.
Publicado: (2025)
por: Yu, Ping, et al.
Publicado: (2025)
Self-Improving Pretraining: using post-trained models to pretrain better models
por: Tan, Ellen Xiaoqing, et al.
Publicado: (2026)
por: Tan, Ellen Xiaoqing, et al.
Publicado: (2026)
LLM Pretraining with Continuous Concepts
por: Tack, Jihoon, et al.
Publicado: (2025)
por: Tack, Jihoon, et al.
Publicado: (2025)
Following Length Constraints in Instructions
por: Yuan, Weizhe, et al.
Publicado: (2024)
por: Yuan, Weizhe, et al.
Publicado: (2024)
Sparse and Dense Retrievers Learn Better Together: Joint Sparse-Dense Optimization for Text-Image Retrieval
por: Song, Jonghyun, et al.
Publicado: (2025)
por: Song, Jonghyun, et al.
Publicado: (2025)
ReConcile: Round-Table Conference Improves Reasoning via Consensus among Diverse LLMs
por: Chen, Justin Chih-Yao, et al.
Publicado: (2023)
por: Chen, Justin Chih-Yao, et al.
Publicado: (2023)
Better Alignment with Instruction Back-and-Forth Translation
por: Nguyen, Thao, et al.
Publicado: (2024)
por: Nguyen, Thao, et al.
Publicado: (2024)
Adaptive Decoding via Latent Preference Optimization
por: Dhuliawala, Shehzaad, et al.
Publicado: (2024)
por: Dhuliawala, Shehzaad, et al.
Publicado: (2024)
Diverse Preference Optimization
por: Lanchantin, Jack, et al.
Publicado: (2025)
por: Lanchantin, Jack, et al.
Publicado: (2025)
Self-Taught Evaluators
por: Wang, Tianlu, et al.
Publicado: (2024)
por: Wang, Tianlu, et al.
Publicado: (2024)
Two Minds Better Than One: Collaborative Reward Modeling for LLM Alignment
por: Zhang, Jiazheng, et al.
Publicado: (2025)
por: Zhang, Jiazheng, et al.
Publicado: (2025)
SqueezeLLM: Dense-and-Sparse Quantization
por: Kim, Sehoon, et al.
Publicado: (2023)
por: Kim, Sehoon, et al.
Publicado: (2023)
When Self-Belief Misleads: Active Label Acquisition for Reinforcement Learning with Verifiable Rewards
por: Wang, Li, et al.
Publicado: (2026)
por: Wang, Li, et al.
Publicado: (2026)
Beyond Imitation: Recovering Dense Rewards from Demonstrations
por: Li, Jiangnan, et al.
Publicado: (2025)
por: Li, Jiangnan, et al.
Publicado: (2025)
Alternating Reinforcement Learning for Rubric-Based Reward Modeling in Non-Verifiable LLM Post-Training
por: Xu, Ran, et al.
Publicado: (2026)
por: Xu, Ran, et al.
Publicado: (2026)
Text2Reward: Reward Shaping with Language Models for Reinforcement Learning
por: Xie, Tianbao, et al.
Publicado: (2023)
por: Xie, Tianbao, et al.
Publicado: (2023)
System-1.x: Learning to Balance Fast and Slow Planning with Language Models
por: Saha, Swarnadeep, et al.
Publicado: (2024)
por: Saha, Swarnadeep, et al.
Publicado: (2024)
Post-training an LLM for RAG? Train on Self-Generated Demonstrations
por: Finlayson, Matthew, et al.
Publicado: (2025)
por: Finlayson, Matthew, et al.
Publicado: (2025)
From Decoding to Meta-Generation: Inference-time Algorithms for Large Language Models
por: Welleck, Sean, et al.
Publicado: (2024)
por: Welleck, Sean, et al.
Publicado: (2024)
ProgRM: Build Better GUI Agents with Progress Rewards
por: Zhang, Danyang, et al.
Publicado: (2025)
por: Zhang, Danyang, et al.
Publicado: (2025)
DECO: Sparse Mixture-of-Experts with Dense-Comparable Performance on End-Side Devices
por: Song, Chenyang, et al.
Publicado: (2026)
por: Song, Chenyang, et al.
Publicado: (2026)
RESTRAIN: From Spurious Votes to Signals -- Self-Driven RL with Self-Penalization
por: Yu, Zhaoning, et al.
Publicado: (2025)
por: Yu, Zhaoning, et al.
Publicado: (2025)
Reinforcement Learning with Conditional Expectation Reward
por: Xiao, Changyi, et al.
Publicado: (2026)
por: Xiao, Changyi, et al.
Publicado: (2026)
Process Reinforcement through Implicit Rewards
por: Cui, Ganqu, et al.
Publicado: (2025)
por: Cui, Ganqu, et al.
Publicado: (2025)
Sentence-level Reward Model can Generalize Better for Aligning LLM from Human Preference
por: Qiu, Wenjie, et al.
Publicado: (2025)
por: Qiu, Wenjie, et al.
Publicado: (2025)
Enhancing One-shot Pruned Pre-trained Language Models through Sparse-Dense-Sparse Mechanism
por: Li, Guanchen, et al.
Publicado: (2024)
por: Li, Guanchen, et al.
Publicado: (2024)
Multi-Token Attention
por: Golovneva, Olga, et al.
Publicado: (2025)
por: Golovneva, Olga, et al.
Publicado: (2025)
Dynamic Reward Adjustment in Multi-Reward Reinforcement Learning for Counselor Reflection Generation
por: Min, Do June, et al.
Publicado: (2024)
por: Min, Do June, et al.
Publicado: (2024)
Self-Consistency Preference Optimization
por: Prasad, Archiki, et al.
Publicado: (2024)
por: Prasad, Archiki, et al.
Publicado: (2024)
Ejemplares similares
-
J1: Incentivizing Thinking in LLM-as-a-Judge via Reinforcement Learning
por: Whitehouse, Chenxi, et al.
Publicado: (2025) -
Jointly Reinforcing Diversity and Quality in Language Model Generations
por: Li, Tianjian, et al.
Publicado: (2025) -
OptimalThinkingBench: Evaluating Over and Underthinking in LLMs
por: Aggarwal, Pranjal, et al.
Publicado: (2025) -
Bridging Offline and Online Reinforcement Learning for LLMs
por: Lanchantin, Jack, et al.
Publicado: (2025) -
The Majority is not always right: RL training for solution aggregation
por: Zhao, Wenting, et al.
Publicado: (2025)