Saved in:
| Main Authors: | Lehnert, Lucas, Sukhbaatar, Sainbayar, Su, DiJia, Zheng, Qinqing, Mcvay, Paul, Rabbat, Michael, Tian, Yuandong |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2402.14083 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Dualformer: Controllable Fast and Slow Thinking by Learning with Randomized Reasoning Traces
by: Su, DiJia, et al.
Published: (2024)
by: Su, DiJia, et al.
Published: (2024)
Token Assorted: Mixing Latent and Text Tokens for Improved Language Model Reasoning
by: Su, DiJia, et al.
Published: (2025)
by: Su, DiJia, et al.
Published: (2025)
Training Large Language Models to Reason in a Continuous Latent Space
by: Hao, Shibo, et al.
Published: (2024)
by: Hao, Shibo, et al.
Published: (2024)
GaLore 2: Large-Scale LLM Pre-Training by Gradient Low-Rank Projection
by: Su, DiJia, et al.
Published: (2025)
by: Su, DiJia, et al.
Published: (2025)
Diffusion World Model: Future Modeling Beyond Step-by-Step Rollout for Offline Reinforcement Learning
by: Ding, Zihan, et al.
Published: (2024)
by: Ding, Zihan, et al.
Published: (2024)
Meta-Rewarding Language Models: Self-Improving Alignment with LLM-as-a-Meta-Judge
by: Wu, Tianhao, et al.
Published: (2024)
by: Wu, Tianhao, et al.
Published: (2024)
R.I.P.: Better Models by Survival of the Fittest Prompts
by: Yu, Ping, et al.
Published: (2025)
by: Yu, Ping, et al.
Published: (2025)
Contextual Position Encoding: Learning to Count What's Important
by: Golovneva, Olga, et al.
Published: (2024)
by: Golovneva, Olga, et al.
Published: (2024)
Some things are more CRINGE than others: Iterative Preference Optimization with the Pairwise Cringe Loss
by: Xu, Jing, et al.
Published: (2023)
by: Xu, Jing, et al.
Published: (2023)
Reverse Training to Nurse the Reversal Curse
by: Golovneva, Olga, et al.
Published: (2024)
by: Golovneva, Olga, et al.
Published: (2024)
Towards General-Purpose Model-Free Reinforcement Learning
by: Fujimoto, Scott, et al.
Published: (2025)
by: Fujimoto, Scott, et al.
Published: (2025)
Self-Challenging Language Model Agents
by: Zhou, Yifei, et al.
Published: (2025)
by: Zhou, Yifei, et al.
Published: (2025)
Examining Reasoning LLMs-as-Judges in Non-Verifiable LLM Post-Training
by: Liu, Yixin, et al.
Published: (2026)
by: Liu, Yixin, et al.
Published: (2026)
SWEET-RL: Training Multi-Turn LLM Agents on Collaborative Reasoning Tasks
by: Zhou, Yifei, et al.
Published: (2025)
by: Zhou, Yifei, et al.
Published: (2025)
Thinking LLMs: General Instruction Following with Thought Generation
by: Wu, Tianhao, et al.
Published: (2024)
by: Wu, Tianhao, et al.
Published: (2024)
Provable Scaling Laws of Feature Emergence from Learning Dynamics of Grokking
by: Tian, Yuandong
Published: (2025)
by: Tian, Yuandong
Published: (2025)
Iterative Reasoning Preference Optimization
by: Pang, Richard Yuanzhe, et al.
Published: (2024)
by: Pang, Richard Yuanzhe, et al.
Published: (2024)
SPG: Sandwiched Policy Gradient for Masked Diffusion Language Models
by: Wang, Chenyu, et al.
Published: (2025)
by: Wang, Chenyu, et al.
Published: (2025)
Step-KTO: Optimizing Mathematical Reasoning through Stepwise Binary Feedback
by: Lin, Yen-Ting, et al.
Published: (2025)
by: Lin, Yen-Ting, et al.
Published: (2025)
StepWiser: Stepwise Generative Judges for Wiser Reasoning
by: Xiong, Wei, et al.
Published: (2025)
by: Xiong, Wei, et al.
Published: (2025)
Sail into the Headwind: Alignment via Robust Rewards and Dynamic Labels against Reward Hacking
by: Rashidinejad, Paria, et al.
Published: (2024)
by: Rashidinejad, Paria, et al.
Published: (2024)
Composing Global Solutions to Reasoning Tasks via Algebraic Objects in Neural Nets
by: Tian, Yuandong
Published: (2024)
by: Tian, Yuandong
Published: (2024)
Self-Rewarding Language Models
by: Yuan, Weizhe, et al.
Published: (2024)
by: Yuan, Weizhe, et al.
Published: (2024)
JoMA: Demystifying Multilayer Transformers via JOint Dynamics of MLP and Attention
by: Tian, Yuandong, et al.
Published: (2023)
by: Tian, Yuandong, et al.
Published: (2023)
Multi-Token Attention
by: Golovneva, Olga, et al.
Published: (2025)
by: Golovneva, Olga, et al.
Published: (2025)
The Path Not Taken: RLVR Provably Learns Off the Principals
by: Zhu, Hanqing, et al.
Published: (2025)
by: Zhu, Hanqing, et al.
Published: (2025)
CoT-Self-Instruct: Building high-quality synthetic prompts for reasoning and non-reasoning tasks
by: Yu, Ping, et al.
Published: (2025)
by: Yu, Ping, et al.
Published: (2025)
Smaller Abstract State Spaces Enable Cross-Scale Generalization in Reinforcement Learning
by: Mustakim, Nasehatul, et al.
Published: (2026)
by: Mustakim, Nasehatul, et al.
Published: (2026)
Searching Large Neighborhoods for Integer Linear Programs with Contrastive Learning
by: Huang, Taoan, et al.
Published: (2023)
by: Huang, Taoan, et al.
Published: (2023)
Dual RL: Unification and New Methods for Reinforcement and Imitation Learning
by: Sikchi, Harshit, et al.
Published: (2023)
by: Sikchi, Harshit, et al.
Published: (2023)
LLM-Guided Semantic Bootstrapping for Interpretable Text Classification with Tsetlin Machines
by: Gao, Jiechao, et al.
Published: (2026)
by: Gao, Jiechao, et al.
Published: (2026)
Self-Consistency Preference Optimization
by: Prasad, Archiki, et al.
Published: (2024)
by: Prasad, Archiki, et al.
Published: (2024)
Stochastic activations
by: Lomeli, Maria, et al.
Published: (2025)
by: Lomeli, Maria, et al.
Published: (2025)
Bootstrapping Human-Like Planning via LLMs
by: Porfirio, David, et al.
Published: (2025)
by: Porfirio, David, et al.
Published: (2025)
Scalable Option Learning in High-Throughput Environments
by: Henaff, Mikael, et al.
Published: (2025)
by: Henaff, Mikael, et al.
Published: (2025)
SearchGym: Bootstrapping Real-World Search Agents via Cost-Effective and High-Fidelity Environment Simulation
by: Zhang, Xichen, et al.
Published: (2026)
by: Zhang, Xichen, et al.
Published: (2026)
Bootstrapping LLMs via Preference-Based Policy Optimization
by: Jia, Chen
Published: (2025)
by: Jia, Chen
Published: (2025)
Adaptive Decoding via Latent Preference Optimization
by: Dhuliawala, Shehzaad, et al.
Published: (2024)
by: Dhuliawala, Shehzaad, et al.
Published: (2024)
Online Intrinsic Rewards for Decision Making Agents from Large Language Model Feedback
by: Zheng, Qinqing, et al.
Published: (2024)
by: Zheng, Qinqing, et al.
Published: (2024)
Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLM
by: Sukhbaatar, Sainbayar, et al.
Published: (2024)
by: Sukhbaatar, Sainbayar, et al.
Published: (2024)
Similar Items
-
Dualformer: Controllable Fast and Slow Thinking by Learning with Randomized Reasoning Traces
by: Su, DiJia, et al.
Published: (2024) -
Token Assorted: Mixing Latent and Text Tokens for Improved Language Model Reasoning
by: Su, DiJia, et al.
Published: (2025) -
Training Large Language Models to Reason in a Continuous Latent Space
by: Hao, Shibo, et al.
Published: (2024) -
GaLore 2: Large-Scale LLM Pre-Training by Gradient Low-Rank Projection
by: Su, DiJia, et al.
Published: (2025) -
Diffusion World Model: Future Modeling Beyond Step-by-Step Rollout for Offline Reinforcement Learning
by: Ding, Zihan, et al.
Published: (2024)