:: Library Catalog

Íomhá chlúdaigh

Sábháilte in:

Sonraí bibleagrafaíochta
Príomhchruthaitheoirí:	Huang, Zixuan, Ban, Yikun, Fu, Lean, Li, Xiaojie, Dai, Zhongxiang, Li, Jianxin, Wang, Deqing
Formáid:	Preprint
Foilsithe / Cruthaithe:	2025
Ábhair:	Machine Learning Artificial Intelligence
Rochtain ar líne:	https://arxiv.org/abs/2506.17252
Clibeanna:	Cuir clib leis Níl clibeanna ann, Bí ar an gcéad duine le clib a chur leis an taifead seo!

Míreanna comhchosúla

FedPOB: Sample-Efficient Federated Prompt Optimization via Bandits
de réir: Lu, Pingchen, et al.
Foilsithe / Cruthaithe: (2025)

T-POP: Test-Time Personalization with Online Preference Feedback
de réir: Qu, Zikun, et al.
Foilsithe / Cruthaithe: (2025)

ActiveDPO: Active Direct Preference Optimization for Sample-Efficient Alignment
de réir: Lin, Xiaoqiang, et al.
Foilsithe / Cruthaithe: (2025)

Policy Improvement Reinforcement Learning
de réir: Wang, Huaiyang, et al.
Foilsithe / Cruthaithe: (2026)

Refining Adaptive Zeroth-Order Optimization at Ease
de réir: Shu, Yao, et al.
Foilsithe / Cruthaithe: (2025)

LLMBoost: Make Large Language Models Stronger with Boosting
de réir: Chen, Zehao, et al.
Foilsithe / Cruthaithe: (2025)

Neural Dueling Bandits: Preference-Based Optimization with Human Feedback
de réir: Verma, Arun, et al.
Foilsithe / Cruthaithe: (2024)

Real-Time Aligned Reward Model beyond Semantics
de réir: Huang, Zixuan, et al.
Foilsithe / Cruthaithe: (2026)

Preference as Reward, Maximum Preference Optimization with Importance Sampling
de réir: Jiang, Zaifan, et al.
Foilsithe / Cruthaithe: (2023)

Adaptive Defense against Harmful Fine-Tuning for Large Language Models via Bayesian Data Scheduler
de réir: Hu, Zixuan, et al.
Foilsithe / Cruthaithe: (2025)

Contextual Rollout Bandits for Reinforcement Learning with Verifiable Rewards
de réir: Lu, Xiaodong, et al.
Foilsithe / Cruthaithe: (2026)

ADPO: Anchored Direct Preference Optimization
de réir: Zixian, Wang
Foilsithe / Cruthaithe: (2025)

AlphaDPO: Adaptive Reward Margin for Direct Preference Optimization
de réir: Wu, Junkang, et al.
Foilsithe / Cruthaithe: (2024)

Risk-aware Direct Preference Optimization under Nested Risk Measure
de réir: Zhang, Lijun, et al.
Foilsithe / Cruthaithe: (2025)

Continuous-Utility Direct Preference Optimization
de réir: Mohsin, Muhammad Ahmed, et al.
Foilsithe / Cruthaithe: (2026)

Batch Bayesian Active Learning with Partial Batch Label Sampling
de réir: Hu, Kangping, et al.
Foilsithe / Cruthaithe: (2025)

Sample-Efficiency in Multi-Batch Reinforcement Learning: The Need for Dimension-Dependent Adaptivity
de réir: Johnson, Emmeran, et al.
Foilsithe / Cruthaithe: (2023)

Weak-Driven Learning: How Weak Agents make Strong Agents Stronger
de réir: Chen, Zehao, et al.
Foilsithe / Cruthaithe: (2026)

$β$-DPO: Direct Preference Optimization with Dynamic $β$
de réir: Wu, Junkang, et al.
Foilsithe / Cruthaithe: (2024)

Aligning CodeLLMs with Direct Preference Optimization
de réir: Miao, Yibo, et al.
Foilsithe / Cruthaithe: (2024)

Beyond Training: Optimizing Reinforcement Learning Based Job Shop Scheduling Through Adaptive Action Sampling
de réir: de Puiseau, Constantin Waubert, et al.
Foilsithe / Cruthaithe: (2024)

Filtered Direct Preference Optimization
de réir: Morimura, Tetsuro, et al.
Foilsithe / Cruthaithe: (2024)

Direct Preference Optimization with an Offset
de réir: Amini, Afra, et al.
Foilsithe / Cruthaithe: (2024)

Orthogonal Finetuning for Direct Preference Optimization
de réir: Yang, Chenxu, et al.
Foilsithe / Cruthaithe: (2024)

Trust the Batch, On- or Off-Policy: Adaptive Policy Optimization for RL Post-Training
de réir: Fakoor, Rasool, et al.
Foilsithe / Cruthaithe: (2026)

Online DPO: Online Direct Preference Optimization with Fast-Slow Chasing
de réir: Qi, Biqing, et al.
Foilsithe / Cruthaithe: (2024)

Prompt Optimization with Human Feedback
de réir: Lin, Xiaoqiang, et al.
Foilsithe / Cruthaithe: (2024)

Beyond State-Wise Mirror Descent: Offline Policy Optimization with Parametric Policies
de réir: Li, Xiang, et al.
Foilsithe / Cruthaithe: (2026)

Beyond One-Preference-Fits-All Alignment: Multi-Objective Direct Preference Optimization
de réir: Zhou, Zhanhui, et al.
Foilsithe / Cruthaithe: (2023)

MPPO: Multi Pair-wise Preference Optimization for LLMs with Arbitrary Negative Samples
de réir: Xie, Shuo, et al.
Foilsithe / Cruthaithe: (2024)

Entropy Controllable Direct Preference Optimization
de réir: Omura, Motoki, et al.
Foilsithe / Cruthaithe: (2024)

Power Scheduler: A Batch Size and Token Number Agnostic Learning Rate Scheduler
de réir: Shen, Yikang, et al.
Foilsithe / Cruthaithe: (2024)

On the Role of Preference Variance in Preference Optimization
de réir: Guo, Jiacheng, et al.
Foilsithe / Cruthaithe: (2025)

MASPOB: Bandit-Based Prompt Optimization for Multi-Agent Systems with Graph Neural Networks
de réir: Hong, Zhi, et al.
Foilsithe / Cruthaithe: (2026)

BOTS: Batch Bayesian Optimization of Extended Thompson Sampling for Severely Episode-Limited RL Settings
de réir: Karine, Karine, et al.
Foilsithe / Cruthaithe: (2024)

PageRank Bandits for Link Prediction
de réir: Ban, Yikun, et al.
Foilsithe / Cruthaithe: (2024)

Can Graph Neural Networks Learn Language with Extremely Weak Text Supervision?
de réir: Li, Zihao, et al.
Foilsithe / Cruthaithe: (2024)

KL Penalty Control via Perturbation for Direct Preference Optimization
de réir: Lee, Sangkyu, et al.
Foilsithe / Cruthaithe: (2025)

MDPO: Multi-Granularity Direct Preference Optimization for Mathematical Reasoning
de réir: Lin, Yunze
Foilsithe / Cruthaithe: (2025)

C2-DPO: Constrained Controlled Direct Preference Optimization
de réir: Asadi, Kavosh, et al.
Foilsithe / Cruthaithe: (2025)