Saved in:
| Main Authors: | Yu, Ping, Xu, Jing, Weston, Jason, Kulikov, Ilia |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2407.06023 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
J1: Incentivizing Thinking in LLM-as-a-Judge via Reinforcement Learning
by: Whitehouse, Chenxi, et al.
Published: (2025)
by: Whitehouse, Chenxi, et al.
Published: (2025)
CoT-Self-Instruct: Building high-quality synthetic prompts for reasoning and non-reasoning tasks
by: Yu, Ping, et al.
Published: (2025)
by: Yu, Ping, et al.
Published: (2025)
Self-Taught Evaluators
by: Wang, Tianlu, et al.
Published: (2024)
by: Wang, Tianlu, et al.
Published: (2024)
Self-Improving Pretraining: using post-trained models to pretrain better models
by: Tan, Ellen Xiaoqing, et al.
Published: (2026)
by: Tan, Ellen Xiaoqing, et al.
Published: (2026)
System-Level Natural Language Feedback
by: Yuan, Weizhe, et al.
Published: (2023)
by: Yuan, Weizhe, et al.
Published: (2023)
Some things are more CRINGE than others: Iterative Preference Optimization with the Pairwise Cringe Loss
by: Xu, Jing, et al.
Published: (2023)
by: Xu, Jing, et al.
Published: (2023)
Following Length Constraints in Instructions
by: Yuan, Weizhe, et al.
Published: (2024)
by: Yuan, Weizhe, et al.
Published: (2024)
R.I.P.: Better Models by Survival of the Fittest Prompts
by: Yu, Ping, et al.
Published: (2025)
by: Yu, Ping, et al.
Published: (2025)
Reasoning over mathematical objects: on-policy reward modeling and test time aggregation
by: Aggarwal, Pranjal, et al.
Published: (2026)
by: Aggarwal, Pranjal, et al.
Published: (2026)
Adaptive Decoding via Latent Preference Optimization
by: Dhuliawala, Shehzaad, et al.
Published: (2024)
by: Dhuliawala, Shehzaad, et al.
Published: (2024)
Diverse Preference Optimization
by: Lanchantin, Jack, et al.
Published: (2025)
by: Lanchantin, Jack, et al.
Published: (2025)
Self-Rewarding Language Models
by: Yuan, Weizhe, et al.
Published: (2024)
by: Yuan, Weizhe, et al.
Published: (2024)
Contextual Position Encoding: Learning to Count What's Important
by: Golovneva, Olga, et al.
Published: (2024)
by: Golovneva, Olga, et al.
Published: (2024)
Hybrid Reinforcement: When Reward Is Sparse, It's Better to Be Dense
by: Tao, Leitian, et al.
Published: (2025)
by: Tao, Leitian, et al.
Published: (2025)
Meta-Rewarding Language Models: Self-Improving Alignment with LLM-as-a-Meta-Judge
by: Wu, Tianhao, et al.
Published: (2024)
by: Wu, Tianhao, et al.
Published: (2024)
Reverse Training to Nurse the Reversal Curse
by: Golovneva, Olga, et al.
Published: (2024)
by: Golovneva, Olga, et al.
Published: (2024)
Post-training an LLM for RAG? Train on Self-Generated Demonstrations
by: Finlayson, Matthew, et al.
Published: (2025)
by: Finlayson, Matthew, et al.
Published: (2025)
Learning to Plan & Reason for Evaluation with Thinking-LLM-as-a-Judge
by: Saha, Swarnadeep, et al.
Published: (2025)
by: Saha, Swarnadeep, et al.
Published: (2025)
Self-Challenging Language Model Agents
by: Zhou, Yifei, et al.
Published: (2025)
by: Zhou, Yifei, et al.
Published: (2025)
Source2Synth: Synthetic Data Generation and Curation Grounded in Real Data Sources
by: Lupidi, Alisia, et al.
Published: (2024)
by: Lupidi, Alisia, et al.
Published: (2024)
The Majority is not always right: RL training for solution aggregation
by: Zhao, Wenting, et al.
Published: (2025)
by: Zhao, Wenting, et al.
Published: (2025)
Thinking LLMs: General Instruction Following with Thought Generation
by: Wu, Tianhao, et al.
Published: (2024)
by: Wu, Tianhao, et al.
Published: (2024)
Self-Consistency Preference Optimization
by: Prasad, Archiki, et al.
Published: (2024)
by: Prasad, Archiki, et al.
Published: (2024)
Iterative Reasoning Preference Optimization
by: Pang, Richard Yuanzhe, et al.
Published: (2024)
by: Pang, Richard Yuanzhe, et al.
Published: (2024)
The Era of Real-World Human Interaction: RL from User Conversations
by: Jin, Chuanyang, et al.
Published: (2025)
by: Jin, Chuanyang, et al.
Published: (2025)
Step Rejection Fine-Tuning: A Practical Distillation Recipe
by: Slinko, Igor, et al.
Published: (2026)
by: Slinko, Igor, et al.
Published: (2026)
StepWiser: Stepwise Generative Judges for Wiser Reasoning
by: Xiong, Wei, et al.
Published: (2025)
by: Xiong, Wei, et al.
Published: (2025)
Multi-Stage Balanced Distillation: Addressing Long-Tail Challenges in Sequence-Level Knowledge Distillation
by: Zhou, Yuhang, et al.
Published: (2024)
by: Zhou, Yuhang, et al.
Published: (2024)
Predict the Next Word: Humans exhibit uncertainty in this task and language models _____
by: Ilia, Evgenia, et al.
Published: (2024)
by: Ilia, Evgenia, et al.
Published: (2024)
Branch-Solve-Merge Improves Large Language Model Evaluation and Generation
by: Saha, Swarnadeep, et al.
Published: (2023)
by: Saha, Swarnadeep, et al.
Published: (2023)
Knowledge Distillation of LLM for Automatic Scoring of Science Education Assessments
by: Latif, Ehsan, et al.
Published: (2023)
by: Latif, Ehsan, et al.
Published: (2023)
Benchmarking Concept-Spilling Across Languages in LLMs
by: Badanin, Ilia, et al.
Published: (2026)
by: Badanin, Ilia, et al.
Published: (2026)
An Overview of Large Language Models for Statisticians
by: Ji, Wenlong, et al.
Published: (2025)
by: Ji, Wenlong, et al.
Published: (2025)
Distillation Scaling Laws
by: Busbridge, Dan, et al.
Published: (2025)
by: Busbridge, Dan, et al.
Published: (2025)
Better Alignment with Instruction Back-and-Forth Translation
by: Nguyen, Thao, et al.
Published: (2024)
by: Nguyen, Thao, et al.
Published: (2024)
Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLM
by: Sukhbaatar, Sainbayar, et al.
Published: (2024)
by: Sukhbaatar, Sainbayar, et al.
Published: (2024)
NaturalThoughts: Selecting and Distilling Reasoning Traces for General Reasoning Tasks
by: Li, Yang, et al.
Published: (2025)
by: Li, Yang, et al.
Published: (2025)
OptimalThinkingBench: Evaluating Over and Underthinking in LLMs
by: Aggarwal, Pranjal, et al.
Published: (2025)
by: Aggarwal, Pranjal, et al.
Published: (2025)
EmoDistill: Offline Emotion Skill Distillation for Language Model Agents in Adversarial Negotiation
by: Long, Yunbo, et al.
Published: (2026)
by: Long, Yunbo, et al.
Published: (2026)
S^2tory: Story Spine Distillation for Movie Script Summarization
by: Lu, Mingzhe, et al.
Published: (2026)
by: Lu, Mingzhe, et al.
Published: (2026)
Similar Items
-
J1: Incentivizing Thinking in LLM-as-a-Judge via Reinforcement Learning
by: Whitehouse, Chenxi, et al.
Published: (2025) -
CoT-Self-Instruct: Building high-quality synthetic prompts for reasoning and non-reasoning tasks
by: Yu, Ping, et al.
Published: (2025) -
Self-Taught Evaluators
by: Wang, Tianlu, et al.
Published: (2024) -
Self-Improving Pretraining: using post-trained models to pretrain better models
by: Tan, Ellen Xiaoqing, et al.
Published: (2026) -
System-Level Natural Language Feedback
by: Yuan, Weizhe, et al.
Published: (2023)