Sábháilte in:
| Príomhchruthaitheoirí: | Huang, Zixuan, Ban, Yikun, Fu, Lean, Li, Xiaojie, Dai, Zhongxiang, Li, Jianxin, Wang, Deqing |
|---|---|
| Formáid: | Preprint |
| Foilsithe / Cruthaithe: |
2025
|
| Ábhair: | |
| Rochtain ar líne: | https://arxiv.org/abs/2506.17252 |
| Clibeanna: |
Cuir clib leis
Níl clibeanna ann, Bí ar an gcéad duine le clib a chur leis an taifead seo!
|
Míreanna comhchosúla
FedPOB: Sample-Efficient Federated Prompt Optimization via Bandits
de réir: Lu, Pingchen, et al.
Foilsithe / Cruthaithe: (2025)
de réir: Lu, Pingchen, et al.
Foilsithe / Cruthaithe: (2025)
T-POP: Test-Time Personalization with Online Preference Feedback
de réir: Qu, Zikun, et al.
Foilsithe / Cruthaithe: (2025)
de réir: Qu, Zikun, et al.
Foilsithe / Cruthaithe: (2025)
ActiveDPO: Active Direct Preference Optimization for Sample-Efficient Alignment
de réir: Lin, Xiaoqiang, et al.
Foilsithe / Cruthaithe: (2025)
de réir: Lin, Xiaoqiang, et al.
Foilsithe / Cruthaithe: (2025)
Policy Improvement Reinforcement Learning
de réir: Wang, Huaiyang, et al.
Foilsithe / Cruthaithe: (2026)
de réir: Wang, Huaiyang, et al.
Foilsithe / Cruthaithe: (2026)
Refining Adaptive Zeroth-Order Optimization at Ease
de réir: Shu, Yao, et al.
Foilsithe / Cruthaithe: (2025)
de réir: Shu, Yao, et al.
Foilsithe / Cruthaithe: (2025)
LLMBoost: Make Large Language Models Stronger with Boosting
de réir: Chen, Zehao, et al.
Foilsithe / Cruthaithe: (2025)
de réir: Chen, Zehao, et al.
Foilsithe / Cruthaithe: (2025)
Neural Dueling Bandits: Preference-Based Optimization with Human Feedback
de réir: Verma, Arun, et al.
Foilsithe / Cruthaithe: (2024)
de réir: Verma, Arun, et al.
Foilsithe / Cruthaithe: (2024)
Real-Time Aligned Reward Model beyond Semantics
de réir: Huang, Zixuan, et al.
Foilsithe / Cruthaithe: (2026)
de réir: Huang, Zixuan, et al.
Foilsithe / Cruthaithe: (2026)
Preference as Reward, Maximum Preference Optimization with Importance Sampling
de réir: Jiang, Zaifan, et al.
Foilsithe / Cruthaithe: (2023)
de réir: Jiang, Zaifan, et al.
Foilsithe / Cruthaithe: (2023)
Adaptive Defense against Harmful Fine-Tuning for Large Language Models via Bayesian Data Scheduler
de réir: Hu, Zixuan, et al.
Foilsithe / Cruthaithe: (2025)
de réir: Hu, Zixuan, et al.
Foilsithe / Cruthaithe: (2025)
Contextual Rollout Bandits for Reinforcement Learning with Verifiable Rewards
de réir: Lu, Xiaodong, et al.
Foilsithe / Cruthaithe: (2026)
de réir: Lu, Xiaodong, et al.
Foilsithe / Cruthaithe: (2026)
ADPO: Anchored Direct Preference Optimization
de réir: Zixian, Wang
Foilsithe / Cruthaithe: (2025)
de réir: Zixian, Wang
Foilsithe / Cruthaithe: (2025)
AlphaDPO: Adaptive Reward Margin for Direct Preference Optimization
de réir: Wu, Junkang, et al.
Foilsithe / Cruthaithe: (2024)
de réir: Wu, Junkang, et al.
Foilsithe / Cruthaithe: (2024)
Risk-aware Direct Preference Optimization under Nested Risk Measure
de réir: Zhang, Lijun, et al.
Foilsithe / Cruthaithe: (2025)
de réir: Zhang, Lijun, et al.
Foilsithe / Cruthaithe: (2025)
Continuous-Utility Direct Preference Optimization
de réir: Mohsin, Muhammad Ahmed, et al.
Foilsithe / Cruthaithe: (2026)
de réir: Mohsin, Muhammad Ahmed, et al.
Foilsithe / Cruthaithe: (2026)
Batch Bayesian Active Learning with Partial Batch Label Sampling
de réir: Hu, Kangping, et al.
Foilsithe / Cruthaithe: (2025)
de réir: Hu, Kangping, et al.
Foilsithe / Cruthaithe: (2025)
Sample-Efficiency in Multi-Batch Reinforcement Learning: The Need for Dimension-Dependent Adaptivity
de réir: Johnson, Emmeran, et al.
Foilsithe / Cruthaithe: (2023)
de réir: Johnson, Emmeran, et al.
Foilsithe / Cruthaithe: (2023)
Weak-Driven Learning: How Weak Agents make Strong Agents Stronger
de réir: Chen, Zehao, et al.
Foilsithe / Cruthaithe: (2026)
de réir: Chen, Zehao, et al.
Foilsithe / Cruthaithe: (2026)
$β$-DPO: Direct Preference Optimization with Dynamic $β$
de réir: Wu, Junkang, et al.
Foilsithe / Cruthaithe: (2024)
de réir: Wu, Junkang, et al.
Foilsithe / Cruthaithe: (2024)
Aligning CodeLLMs with Direct Preference Optimization
de réir: Miao, Yibo, et al.
Foilsithe / Cruthaithe: (2024)
de réir: Miao, Yibo, et al.
Foilsithe / Cruthaithe: (2024)
Beyond Training: Optimizing Reinforcement Learning Based Job Shop Scheduling Through Adaptive Action Sampling
de réir: de Puiseau, Constantin Waubert, et al.
Foilsithe / Cruthaithe: (2024)
de réir: de Puiseau, Constantin Waubert, et al.
Foilsithe / Cruthaithe: (2024)
Filtered Direct Preference Optimization
de réir: Morimura, Tetsuro, et al.
Foilsithe / Cruthaithe: (2024)
de réir: Morimura, Tetsuro, et al.
Foilsithe / Cruthaithe: (2024)
Direct Preference Optimization with an Offset
de réir: Amini, Afra, et al.
Foilsithe / Cruthaithe: (2024)
de réir: Amini, Afra, et al.
Foilsithe / Cruthaithe: (2024)
Orthogonal Finetuning for Direct Preference Optimization
de réir: Yang, Chenxu, et al.
Foilsithe / Cruthaithe: (2024)
de réir: Yang, Chenxu, et al.
Foilsithe / Cruthaithe: (2024)
Trust the Batch, On- or Off-Policy: Adaptive Policy Optimization for RL Post-Training
de réir: Fakoor, Rasool, et al.
Foilsithe / Cruthaithe: (2026)
de réir: Fakoor, Rasool, et al.
Foilsithe / Cruthaithe: (2026)
Online DPO: Online Direct Preference Optimization with Fast-Slow Chasing
de réir: Qi, Biqing, et al.
Foilsithe / Cruthaithe: (2024)
de réir: Qi, Biqing, et al.
Foilsithe / Cruthaithe: (2024)
Prompt Optimization with Human Feedback
de réir: Lin, Xiaoqiang, et al.
Foilsithe / Cruthaithe: (2024)
de réir: Lin, Xiaoqiang, et al.
Foilsithe / Cruthaithe: (2024)
Beyond State-Wise Mirror Descent: Offline Policy Optimization with Parametric Policies
de réir: Li, Xiang, et al.
Foilsithe / Cruthaithe: (2026)
de réir: Li, Xiang, et al.
Foilsithe / Cruthaithe: (2026)
Beyond One-Preference-Fits-All Alignment: Multi-Objective Direct Preference Optimization
de réir: Zhou, Zhanhui, et al.
Foilsithe / Cruthaithe: (2023)
de réir: Zhou, Zhanhui, et al.
Foilsithe / Cruthaithe: (2023)
MPPO: Multi Pair-wise Preference Optimization for LLMs with Arbitrary Negative Samples
de réir: Xie, Shuo, et al.
Foilsithe / Cruthaithe: (2024)
de réir: Xie, Shuo, et al.
Foilsithe / Cruthaithe: (2024)
Entropy Controllable Direct Preference Optimization
de réir: Omura, Motoki, et al.
Foilsithe / Cruthaithe: (2024)
de réir: Omura, Motoki, et al.
Foilsithe / Cruthaithe: (2024)
Power Scheduler: A Batch Size and Token Number Agnostic Learning Rate Scheduler
de réir: Shen, Yikang, et al.
Foilsithe / Cruthaithe: (2024)
de réir: Shen, Yikang, et al.
Foilsithe / Cruthaithe: (2024)
On the Role of Preference Variance in Preference Optimization
de réir: Guo, Jiacheng, et al.
Foilsithe / Cruthaithe: (2025)
de réir: Guo, Jiacheng, et al.
Foilsithe / Cruthaithe: (2025)
MASPOB: Bandit-Based Prompt Optimization for Multi-Agent Systems with Graph Neural Networks
de réir: Hong, Zhi, et al.
Foilsithe / Cruthaithe: (2026)
de réir: Hong, Zhi, et al.
Foilsithe / Cruthaithe: (2026)
BOTS: Batch Bayesian Optimization of Extended Thompson Sampling for Severely Episode-Limited RL Settings
de réir: Karine, Karine, et al.
Foilsithe / Cruthaithe: (2024)
de réir: Karine, Karine, et al.
Foilsithe / Cruthaithe: (2024)
PageRank Bandits for Link Prediction
de réir: Ban, Yikun, et al.
Foilsithe / Cruthaithe: (2024)
de réir: Ban, Yikun, et al.
Foilsithe / Cruthaithe: (2024)
Can Graph Neural Networks Learn Language with Extremely Weak Text Supervision?
de réir: Li, Zihao, et al.
Foilsithe / Cruthaithe: (2024)
de réir: Li, Zihao, et al.
Foilsithe / Cruthaithe: (2024)
KL Penalty Control via Perturbation for Direct Preference Optimization
de réir: Lee, Sangkyu, et al.
Foilsithe / Cruthaithe: (2025)
de réir: Lee, Sangkyu, et al.
Foilsithe / Cruthaithe: (2025)
MDPO: Multi-Granularity Direct Preference Optimization for Mathematical Reasoning
de réir: Lin, Yunze
Foilsithe / Cruthaithe: (2025)
de réir: Lin, Yunze
Foilsithe / Cruthaithe: (2025)
C2-DPO: Constrained Controlled Direct Preference Optimization
de réir: Asadi, Kavosh, et al.
Foilsithe / Cruthaithe: (2025)
de réir: Asadi, Kavosh, et al.
Foilsithe / Cruthaithe: (2025)
Míreanna comhchosúla
-
FedPOB: Sample-Efficient Federated Prompt Optimization via Bandits
de réir: Lu, Pingchen, et al.
Foilsithe / Cruthaithe: (2025) -
T-POP: Test-Time Personalization with Online Preference Feedback
de réir: Qu, Zikun, et al.
Foilsithe / Cruthaithe: (2025) -
ActiveDPO: Active Direct Preference Optimization for Sample-Efficient Alignment
de réir: Lin, Xiaoqiang, et al.
Foilsithe / Cruthaithe: (2025) -
Policy Improvement Reinforcement Learning
de réir: Wang, Huaiyang, et al.
Foilsithe / Cruthaithe: (2026) -
Refining Adaptive Zeroth-Order Optimization at Ease
de réir: Shu, Yao, et al.
Foilsithe / Cruthaithe: (2025)