Saved in:
| Main Authors: | Chen, Zhuotong, Liu, Fang, Zhu, Jennifer, Du, Wanyu, Qi, Yanjun |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2411.05875 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Optimizing Anytime Reasoning via Budget Relative Policy Optimization
by: Qi, Penghui, et al.
Published: (2025)
by: Qi, Penghui, et al.
Published: (2025)
Preference Optimization via Contrastive Divergence: Your Reward Model is Secretly an NLL Estimator
by: Chen, Zhuotong, et al.
Published: (2025)
by: Chen, Zhuotong, et al.
Published: (2025)
Dynamic Model Merging Made Slim
by: Du, Guodong, et al.
Published: (2026)
by: Du, Guodong, et al.
Published: (2026)
Less is More for Improving Automatic Evaluation of Factual Consistency
by: Wang, Tong, et al.
Published: (2024)
by: Wang, Tong, et al.
Published: (2024)
T-REG: Preference Optimization with Token-Level Reward Regularization
by: Zhou, Wenxuan, et al.
Published: (2024)
by: Zhou, Wenxuan, et al.
Published: (2024)
Iterative Length-Regularized Direct Preference Optimization: A Case Study on Improving 7B Language Models to GPT-4 Level
by: Liu, Jie, et al.
Published: (2024)
by: Liu, Jie, et al.
Published: (2024)
TSO: Self-Training with Scaled Preference Optimization
by: Chen, Kaihui, et al.
Published: (2024)
by: Chen, Kaihui, et al.
Published: (2024)
Direct Nash Optimization: Teaching Language Models to Self-Improve with General Preferences
by: Rosset, Corby, et al.
Published: (2024)
by: Rosset, Corby, et al.
Published: (2024)
Entropy Controllable Direct Preference Optimization
by: Omura, Motoki, et al.
Published: (2024)
by: Omura, Motoki, et al.
Published: (2024)
TaeBench: Improving Quality of Toxic Adversarial Examples
by: Zhu, Xuan, et al.
Published: (2024)
by: Zhu, Xuan, et al.
Published: (2024)
Alignment through Meta-Weighted Online Sampling: Bridging the Gap between Data Generation and Preference Optimization
by: Yang, Junming, et al.
Published: (2025)
by: Yang, Junming, et al.
Published: (2025)
Token-Budget-Aware LLM Reasoning
by: Han, Tingxu, et al.
Published: (2024)
by: Han, Tingxu, et al.
Published: (2024)
Preference Optimization by Estimating the Ratio of the Data Distribution
by: Kim, Yeongmin, et al.
Published: (2025)
by: Kim, Yeongmin, et al.
Published: (2025)
APIGen-MT: Agentic Pipeline for Multi-Turn Data Generation via Simulated Agent-Human Interplay
by: Prabhakar, Akshara, et al.
Published: (2025)
by: Prabhakar, Akshara, et al.
Published: (2025)
Towards Robust Alignment of Language Models: Distributionally Robustifying Direct Preference Optimization
by: Wu, Junkang, et al.
Published: (2024)
by: Wu, Junkang, et al.
Published: (2024)
Correcting the Mythos of KL-Regularization: Direct Alignment without Overoptimization via Chi-Squared Preference Optimization
by: Huang, Audrey, et al.
Published: (2024)
by: Huang, Audrey, et al.
Published: (2024)
On the Role of Preference Variance in Preference Optimization
by: Guo, Jiacheng, et al.
Published: (2025)
by: Guo, Jiacheng, et al.
Published: (2025)
TB or Not TB: Coverage-Driven Direct Preference Optimization for Verilog Stimulus Generation
by: Nadimi, Bardia, et al.
Published: (2025)
by: Nadimi, Bardia, et al.
Published: (2025)
Self-Play Only Evolves When Self-Synthetic Pipeline Ensures Learnable Information Gain
by: Liu, Wei, et al.
Published: (2026)
by: Liu, Wei, et al.
Published: (2026)
Less is More: Improving LLM Alignment via Preference Data Selection
by: Deng, Xun, et al.
Published: (2025)
by: Deng, Xun, et al.
Published: (2025)
Zero-knowledge LLM hallucination detection and mitigation through fine-grained cross-model consistency
by: Goel, Aman, et al.
Published: (2025)
by: Goel, Aman, et al.
Published: (2025)
From Noisy Traces to Stable Gradients: Bias-Variance Optimized Preference Optimization for Aligning Large Reasoning Models
by: Zhu, Mingkang, et al.
Published: (2025)
by: Zhu, Mingkang, et al.
Published: (2025)
MPPO: Multi Pair-wise Preference Optimization for LLMs with Arbitrary Negative Samples
by: Xie, Shuo, et al.
Published: (2024)
by: Xie, Shuo, et al.
Published: (2024)
Difficulty-Based Preference Data Selection by DPO Implicit Reward Gap
by: Qi, Xuan, et al.
Published: (2025)
by: Qi, Xuan, et al.
Published: (2025)
Towards Data-Centric RLHF: Simple Metrics for Preference Dataset Comparison
by: Shen, Judy Hanwen, et al.
Published: (2024)
by: Shen, Judy Hanwen, et al.
Published: (2024)
Stable Preference Optimization: A Bilevel Approach to Catastrophic Preference Shift
by: Jian, Chengtao, et al.
Published: (2025)
by: Jian, Chengtao, et al.
Published: (2025)
Knapsack RL: Unlocking Exploration of LLMs via Optimizing Budget Allocation
by: Li, Ziniu, et al.
Published: (2025)
by: Li, Ziniu, et al.
Published: (2025)
TGDPO: Harnessing Token-Level Reward Guidance for Enhancing Direct Preference Optimization
by: Zhu, Mingkang, et al.
Published: (2025)
by: Zhu, Mingkang, et al.
Published: (2025)
WPO: Enhancing RLHF with Weighted Preference Optimization
by: Zhou, Wenxuan, et al.
Published: (2024)
by: Zhou, Wenxuan, et al.
Published: (2024)
Towards Building a Robust Toxicity Predictor
by: Bespalov, Dmitriy, et al.
Published: (2024)
by: Bespalov, Dmitriy, et al.
Published: (2024)
The Larger the Better? Improved LLM Code-Generation via Budget Reallocation
by: Hassid, Michael, et al.
Published: (2024)
by: Hassid, Michael, et al.
Published: (2024)
PerPO: Perceptual Preference Optimization via Discriminative Rewarding
by: Zhu, Zining, et al.
Published: (2025)
by: Zhu, Zining, et al.
Published: (2025)
Geometric-Averaged Preference Optimization for Soft Preference Labels
by: Furuta, Hiroki, et al.
Published: (2024)
by: Furuta, Hiroki, et al.
Published: (2024)
IDEA Prune: An Integrated Enlarge-and-Prune Pipeline in Generative Language Model Pretraining
by: Li, Yixiao, et al.
Published: (2025)
by: Li, Yixiao, et al.
Published: (2025)
Online DPO: Online Direct Preference Optimization with Fast-Slow Chasing
by: Qi, Biqing, et al.
Published: (2024)
by: Qi, Biqing, et al.
Published: (2024)
Orthogonal Finetuning for Direct Preference Optimization
by: Yang, Chenxu, et al.
Published: (2024)
by: Yang, Chenxu, et al.
Published: (2024)
Beyond Scalar Reward Model: Learning Generative Judge from Preference Data
by: Ye, Ziyi, et al.
Published: (2024)
by: Ye, Ziyi, et al.
Published: (2024)
FLAMES: Improving LLM Math Reasoning via a Fine-Grained Analysis of the Data Synthesis Pipeline
by: Seegmiller, Parker, et al.
Published: (2025)
by: Seegmiller, Parker, et al.
Published: (2025)
Improving LLM General Preference Alignment via Optimistic Online Mirror Descent
by: Zhang, Yuheng, et al.
Published: (2025)
by: Zhang, Yuheng, et al.
Published: (2025)
Multi-Preference Optimization: Generalizing DPO via Set-Level Contrasts
by: Gupta, Taneesh, et al.
Published: (2024)
by: Gupta, Taneesh, et al.
Published: (2024)
Similar Items
-
Optimizing Anytime Reasoning via Budget Relative Policy Optimization
by: Qi, Penghui, et al.
Published: (2025) -
Preference Optimization via Contrastive Divergence: Your Reward Model is Secretly an NLL Estimator
by: Chen, Zhuotong, et al.
Published: (2025) -
Dynamic Model Merging Made Slim
by: Du, Guodong, et al.
Published: (2026) -
Less is More for Improving Automatic Evaluation of Factual Consistency
by: Wang, Tong, et al.
Published: (2024) -
T-REG: Preference Optimization with Token-Level Reward Regularization
by: Zhou, Wenxuan, et al.
Published: (2024)