:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Chen, Zhuotong, Liu, Fang, Zhu, Jennifer, Du, Wanyu, Qi, Yanjun
Format:	Preprint
Published:	2024
Subjects:	Machine Learning Artificial Intelligence Computation and Language
Online Access:	https://arxiv.org/abs/2411.05875
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Optimizing Anytime Reasoning via Budget Relative Policy Optimization
by: Qi, Penghui, et al.
Published: (2025)

Preference Optimization via Contrastive Divergence: Your Reward Model is Secretly an NLL Estimator
by: Chen, Zhuotong, et al.
Published: (2025)

Dynamic Model Merging Made Slim
by: Du, Guodong, et al.
Published: (2026)

Less is More for Improving Automatic Evaluation of Factual Consistency
by: Wang, Tong, et al.
Published: (2024)

T-REG: Preference Optimization with Token-Level Reward Regularization
by: Zhou, Wenxuan, et al.
Published: (2024)

Iterative Length-Regularized Direct Preference Optimization: A Case Study on Improving 7B Language Models to GPT-4 Level
by: Liu, Jie, et al.
Published: (2024)

TSO: Self-Training with Scaled Preference Optimization
by: Chen, Kaihui, et al.
Published: (2024)

Direct Nash Optimization: Teaching Language Models to Self-Improve with General Preferences
by: Rosset, Corby, et al.
Published: (2024)

Entropy Controllable Direct Preference Optimization
by: Omura, Motoki, et al.
Published: (2024)

TaeBench: Improving Quality of Toxic Adversarial Examples
by: Zhu, Xuan, et al.
Published: (2024)

Alignment through Meta-Weighted Online Sampling: Bridging the Gap between Data Generation and Preference Optimization
by: Yang, Junming, et al.
Published: (2025)

Token-Budget-Aware LLM Reasoning
by: Han, Tingxu, et al.
Published: (2024)

Preference Optimization by Estimating the Ratio of the Data Distribution
by: Kim, Yeongmin, et al.
Published: (2025)

APIGen-MT: Agentic Pipeline for Multi-Turn Data Generation via Simulated Agent-Human Interplay
by: Prabhakar, Akshara, et al.
Published: (2025)

Towards Robust Alignment of Language Models: Distributionally Robustifying Direct Preference Optimization
by: Wu, Junkang, et al.
Published: (2024)

Correcting the Mythos of KL-Regularization: Direct Alignment without Overoptimization via Chi-Squared Preference Optimization
by: Huang, Audrey, et al.
Published: (2024)

On the Role of Preference Variance in Preference Optimization
by: Guo, Jiacheng, et al.
Published: (2025)

TB or Not TB: Coverage-Driven Direct Preference Optimization for Verilog Stimulus Generation
by: Nadimi, Bardia, et al.
Published: (2025)

Self-Play Only Evolves When Self-Synthetic Pipeline Ensures Learnable Information Gain
by: Liu, Wei, et al.
Published: (2026)

Less is More: Improving LLM Alignment via Preference Data Selection
by: Deng, Xun, et al.
Published: (2025)

Zero-knowledge LLM hallucination detection and mitigation through fine-grained cross-model consistency
by: Goel, Aman, et al.
Published: (2025)

From Noisy Traces to Stable Gradients: Bias-Variance Optimized Preference Optimization for Aligning Large Reasoning Models
by: Zhu, Mingkang, et al.
Published: (2025)

MPPO: Multi Pair-wise Preference Optimization for LLMs with Arbitrary Negative Samples
by: Xie, Shuo, et al.
Published: (2024)

Difficulty-Based Preference Data Selection by DPO Implicit Reward Gap
by: Qi, Xuan, et al.
Published: (2025)

Towards Data-Centric RLHF: Simple Metrics for Preference Dataset Comparison
by: Shen, Judy Hanwen, et al.
Published: (2024)

Stable Preference Optimization: A Bilevel Approach to Catastrophic Preference Shift
by: Jian, Chengtao, et al.
Published: (2025)

Knapsack RL: Unlocking Exploration of LLMs via Optimizing Budget Allocation
by: Li, Ziniu, et al.
Published: (2025)

TGDPO: Harnessing Token-Level Reward Guidance for Enhancing Direct Preference Optimization
by: Zhu, Mingkang, et al.
Published: (2025)

WPO: Enhancing RLHF with Weighted Preference Optimization
by: Zhou, Wenxuan, et al.
Published: (2024)

Towards Building a Robust Toxicity Predictor
by: Bespalov, Dmitriy, et al.
Published: (2024)

The Larger the Better? Improved LLM Code-Generation via Budget Reallocation
by: Hassid, Michael, et al.
Published: (2024)

PerPO: Perceptual Preference Optimization via Discriminative Rewarding
by: Zhu, Zining, et al.
Published: (2025)

Geometric-Averaged Preference Optimization for Soft Preference Labels
by: Furuta, Hiroki, et al.
Published: (2024)

IDEA Prune: An Integrated Enlarge-and-Prune Pipeline in Generative Language Model Pretraining
by: Li, Yixiao, et al.
Published: (2025)

Online DPO: Online Direct Preference Optimization with Fast-Slow Chasing
by: Qi, Biqing, et al.
Published: (2024)

Orthogonal Finetuning for Direct Preference Optimization
by: Yang, Chenxu, et al.
Published: (2024)

Beyond Scalar Reward Model: Learning Generative Judge from Preference Data
by: Ye, Ziyi, et al.
Published: (2024)

FLAMES: Improving LLM Math Reasoning via a Fine-Grained Analysis of the Data Synthesis Pipeline
by: Seegmiller, Parker, et al.
Published: (2025)

Improving LLM General Preference Alignment via Optimistic Online Mirror Descent
by: Zhang, Yuheng, et al.
Published: (2025)

Multi-Preference Optimization: Generalizing DPO via Set-Level Contrasts
by: Gupta, Taneesh, et al.
Published: (2024)