:: Library Catalog

Imagen de Portada

Guardado en:

Detalles Bibliográficos
Autores principales:	Tao, Leitian, Kulikov, Ilia, Saha, Swarnadeep, Wang, Tianlu, Xu, Jing, Li, Sharon, Weston, Jason E, Yu, Ping
Formato:	Preprint
Publicado:	2025
Materias:	Computation and Language Machine Learning
Acceso en línea:	https://arxiv.org/abs/2510.07242
Etiquetas:	Agregar Etiqueta Sin Etiquetas, Sea el primero en etiquetar este registro!

Ejemplares similares

J1: Incentivizing Thinking in LLM-as-a-Judge via Reinforcement Learning
por: Whitehouse, Chenxi, et al.
Publicado: (2025)

Jointly Reinforcing Diversity and Quality in Language Model Generations
por: Li, Tianjian, et al.
Publicado: (2025)

OptimalThinkingBench: Evaluating Over and Underthinking in LLMs
por: Aggarwal, Pranjal, et al.
Publicado: (2025)

Bridging Offline and Online Reinforcement Learning for LLMs
por: Lanchantin, Jack, et al.
Publicado: (2025)

The Majority is not always right: RL training for solution aggregation
por: Zhao, Wenting, et al.
Publicado: (2025)

Distilling System 2 into System 1
por: Yu, Ping, et al.
Publicado: (2024)

Limited Preference Data? Learning Better Reward Model with Latent Space Synthesis
por: Tao, Leitian, et al.
Publicado: (2025)

Learning to Plan & Reason for Evaluation with Thinking-LLM-as-a-Judge
por: Saha, Swarnadeep, et al.
Publicado: (2025)

Branch-Solve-Merge Improves Large Language Model Evaluation and Generation
por: Saha, Swarnadeep, et al.
Publicado: (2023)

The Era of Real-World Human Interaction: RL from User Conversations
por: Jin, Chuanyang, et al.
Publicado: (2025)

R.I.P.: Better Models by Survival of the Fittest Prompts
por: Yu, Ping, et al.
Publicado: (2025)

CoT-Self-Instruct: Building high-quality synthetic prompts for reasoning and non-reasoning tasks
por: Yu, Ping, et al.
Publicado: (2025)

Self-Improving Pretraining: using post-trained models to pretrain better models
por: Tan, Ellen Xiaoqing, et al.
Publicado: (2026)

LLM Pretraining with Continuous Concepts
por: Tack, Jihoon, et al.
Publicado: (2025)

Following Length Constraints in Instructions
por: Yuan, Weizhe, et al.
Publicado: (2024)

Sparse and Dense Retrievers Learn Better Together: Joint Sparse-Dense Optimization for Text-Image Retrieval
por: Song, Jonghyun, et al.
Publicado: (2025)

ReConcile: Round-Table Conference Improves Reasoning via Consensus among Diverse LLMs
por: Chen, Justin Chih-Yao, et al.
Publicado: (2023)

Better Alignment with Instruction Back-and-Forth Translation
por: Nguyen, Thao, et al.
Publicado: (2024)

Adaptive Decoding via Latent Preference Optimization
por: Dhuliawala, Shehzaad, et al.
Publicado: (2024)

Diverse Preference Optimization
por: Lanchantin, Jack, et al.
Publicado: (2025)

Self-Taught Evaluators
por: Wang, Tianlu, et al.
Publicado: (2024)

Two Minds Better Than One: Collaborative Reward Modeling for LLM Alignment
por: Zhang, Jiazheng, et al.
Publicado: (2025)

SqueezeLLM: Dense-and-Sparse Quantization
por: Kim, Sehoon, et al.
Publicado: (2023)

When Self-Belief Misleads: Active Label Acquisition for Reinforcement Learning with Verifiable Rewards
por: Wang, Li, et al.
Publicado: (2026)

Beyond Imitation: Recovering Dense Rewards from Demonstrations
por: Li, Jiangnan, et al.
Publicado: (2025)

Alternating Reinforcement Learning for Rubric-Based Reward Modeling in Non-Verifiable LLM Post-Training
por: Xu, Ran, et al.
Publicado: (2026)

Text2Reward: Reward Shaping with Language Models for Reinforcement Learning
por: Xie, Tianbao, et al.
Publicado: (2023)

System-1.x: Learning to Balance Fast and Slow Planning with Language Models
por: Saha, Swarnadeep, et al.
Publicado: (2024)

Post-training an LLM for RAG? Train on Self-Generated Demonstrations
por: Finlayson, Matthew, et al.
Publicado: (2025)

From Decoding to Meta-Generation: Inference-time Algorithms for Large Language Models
por: Welleck, Sean, et al.
Publicado: (2024)

ProgRM: Build Better GUI Agents with Progress Rewards
por: Zhang, Danyang, et al.
Publicado: (2025)

DECO: Sparse Mixture-of-Experts with Dense-Comparable Performance on End-Side Devices
por: Song, Chenyang, et al.
Publicado: (2026)

RESTRAIN: From Spurious Votes to Signals -- Self-Driven RL with Self-Penalization
por: Yu, Zhaoning, et al.
Publicado: (2025)

Reinforcement Learning with Conditional Expectation Reward
por: Xiao, Changyi, et al.
Publicado: (2026)

Process Reinforcement through Implicit Rewards
por: Cui, Ganqu, et al.
Publicado: (2025)

Sentence-level Reward Model can Generalize Better for Aligning LLM from Human Preference
por: Qiu, Wenjie, et al.
Publicado: (2025)

Enhancing One-shot Pruned Pre-trained Language Models through Sparse-Dense-Sparse Mechanism
por: Li, Guanchen, et al.
Publicado: (2024)

Multi-Token Attention
por: Golovneva, Olga, et al.
Publicado: (2025)

Dynamic Reward Adjustment in Multi-Reward Reinforcement Learning for Counselor Reflection Generation
por: Min, Do June, et al.
Publicado: (2024)

Self-Consistency Preference Optimization
por: Prasad, Archiki, et al.
Publicado: (2024)