Saved in:
| Main Authors: | Zhang, Yuheng, Huo, Mingyue, Zhu, Minghao, Zhang, Mengxue, Jiang, Nan |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2604.02686 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Beyond State-Wise Mirror Descent: Offline Policy Optimization with Parametric Policies
by: Li, Xiang, et al.
Published: (2026)
by: Li, Xiang, et al.
Published: (2026)
Statistical Tractability of Off-policy Evaluation of History-dependent Policies in POMDPs
by: Zhang, Yuheng, et al.
Published: (2025)
by: Zhang, Yuheng, et al.
Published: (2025)
On the Curses of Future and History in Future-dependent Value Functions for Off-policy Evaluation
by: Zhang, Yuheng, et al.
Published: (2024)
by: Zhang, Yuheng, et al.
Published: (2024)
Rethinking Importance Sampling in LLM Policy Optimization: A Cumulative Token Perspective
by: Zhang, Yuheng, et al.
Published: (2026)
by: Zhang, Yuheng, et al.
Published: (2026)
Beyond Reward Hacking: Causal Rewards for Large Language Model Alignment
by: Wang, Chaoqi, et al.
Published: (2025)
by: Wang, Chaoqi, et al.
Published: (2025)
Exploring Token-Space Manipulation in Latent Audio Tokenizers
by: Paissan, Francesco, et al.
Published: (2026)
by: Paissan, Francesco, et al.
Published: (2026)
Iterative Nash Policy Optimization: Aligning LLMs with General Preferences via No-Regret Learning
by: Zhang, Yuheng, et al.
Published: (2024)
by: Zhang, Yuheng, et al.
Published: (2024)
Beyond Semantics: The Unreasonable Effectiveness of Reasonless Intermediate Tokens
by: Valmeekam, Karthik, et al.
Published: (2025)
by: Valmeekam, Karthik, et al.
Published: (2025)
Preference Poisoning Attacks on Reward Model Learning
by: Wu, Junlin, et al.
Published: (2024)
by: Wu, Junlin, et al.
Published: (2024)
ClusterKV: Manipulating LLM KV Cache in Semantic Space for Recallable Compression
by: Liu, Guangda, et al.
Published: (2024)
by: Liu, Guangda, et al.
Published: (2024)
T-REG: Preference Optimization with Token-Level Reward Regularization
by: Zhou, Wenxuan, et al.
Published: (2024)
by: Zhou, Wenxuan, et al.
Published: (2024)
Beyond Hidden-Layer Manipulation: Semantically-Aware Logit Interventions for Debiasing LLMs
by: Xia, Wei
Published: (2025)
by: Xia, Wei
Published: (2025)
Design Conditions for Intra-Group Learning of Sequence-Level Rewards: Token Gradient Cancellation
by: Ding, Fei, et al.
Published: (2026)
by: Ding, Fei, et al.
Published: (2026)
From Values to Tokens: An LLM-Driven Framework for Context-aware Time Series Forecasting via Symbolic Discretization
by: Tao, Xiaoyu, et al.
Published: (2025)
by: Tao, Xiaoyu, et al.
Published: (2025)
Policy Filtration for RLHF to Mitigate Noise in Reward Models
by: Zhang, Chuheng, et al.
Published: (2024)
by: Zhang, Chuheng, et al.
Published: (2024)
Beyond Correctness: Harmonizing Process and Outcome Rewards through RL Training
by: Ye, Chenlu, et al.
Published: (2025)
by: Ye, Chenlu, et al.
Published: (2025)
Dynamic Large Concept Models: Latent Reasoning in an Adaptive Semantic Space
by: Qu, Xingwei, et al.
Published: (2025)
by: Qu, Xingwei, et al.
Published: (2025)
Resolving Token-Space Gradient Conflicts: Token Space Manipulation for Transformer-Based Multi-Task Learning
by: Jeong, Wooseong, et al.
Published: (2025)
by: Jeong, Wooseong, et al.
Published: (2025)
Offline Reinforcement Learning in Large State Spaces: Algorithms and Guarantees
by: Jiang, Nan, et al.
Published: (2025)
by: Jiang, Nan, et al.
Published: (2025)
RLHF Workflow: From Reward Modeling to Online RLHF
by: Dong, Hanze, et al.
Published: (2024)
by: Dong, Hanze, et al.
Published: (2024)
InfoRM: Mitigating Reward Hacking in RLHF via Information-Theoretic Reward Modeling
by: Miao, Yuchun, et al.
Published: (2024)
by: Miao, Yuchun, et al.
Published: (2024)
Preference-Based Self-Distillation: Beyond KL Matching via Reward Regularization
by: Yu, Xin, et al.
Published: (2026)
by: Yu, Xin, et al.
Published: (2026)
CHARM: Calibrating Reward Models With Chatbot Arena Scores
by: Zhu, Xiao, et al.
Published: (2025)
by: Zhu, Xiao, et al.
Published: (2025)
Catching a Moving Subspace: Low-Rank Bandits Beyond Stationarity
by: Khosravi, Hamed, et al.
Published: (2026)
by: Khosravi, Hamed, et al.
Published: (2026)
Beyond Rewards in Reinforcement Learning for Cyber Defence
by: Bates, Elizabeth, et al.
Published: (2026)
by: Bates, Elizabeth, et al.
Published: (2026)
Beyond Distribution Sharpening: The Importance of Task Rewards
by: Mittal, Sarthak, et al.
Published: (2026)
by: Mittal, Sarthak, et al.
Published: (2026)
Subwords as Skills: Tokenization for Sparse-Reward Reinforcement Learning
by: Yunis, David, et al.
Published: (2023)
by: Yunis, David, et al.
Published: (2023)
Enhancing Large Multimodal Models with Adaptive Sparsity and KV Cache Compression
by: Zhang, Te, et al.
Published: (2025)
by: Zhang, Te, et al.
Published: (2025)
Beyond Correctness: Confidence-Aware Reward Modeling for Enhancing Large Language Model Reasoning
by: He, Qianxi, et al.
Published: (2025)
by: He, Qianxi, et al.
Published: (2025)
DUET: Optimize Token-Budget Allocation for Reinforcement Learning with Verifiable Rewards
by: Hu, Haoyu, et al.
Published: (2026)
by: Hu, Haoyu, et al.
Published: (2026)
Reinforcement Learning with Promising Tokens for Large Language Models
by: Pang, Jing-Cheng, et al.
Published: (2026)
by: Pang, Jing-Cheng, et al.
Published: (2026)
A Note on Loss Functions and Error Compounding in Model-based Reinforcement Learning
by: Jiang, Nan
Published: (2024)
by: Jiang, Nan
Published: (2024)
Beyond Binary Preferences: A Principled Framework for Reward Modeling with Ordinal Feedback
by: Afsharrad, Amirhossein, et al.
Published: (2026)
by: Afsharrad, Amirhossein, et al.
Published: (2026)
TGDPO: Harnessing Token-Level Reward Guidance for Enhancing Direct Preference Optimization
by: Zhu, Mingkang, et al.
Published: (2025)
by: Zhu, Mingkang, et al.
Published: (2025)
Asymmetric On-Policy Distillation: Bridging Exploitation and Imitation at the Token Level
by: Jia, Nan, et al.
Published: (2026)
by: Jia, Nan, et al.
Published: (2026)
Beyond Relevance: Utility-Centric Retrieval in the LLM Era
by: Zhang, Hengran, et al.
Published: (2026)
by: Zhang, Hengran, et al.
Published: (2026)
Beyond Scalar Rewards: An Axiomatic Framework for Lexicographic MDPs
by: Shakerinava, Mehran, et al.
Published: (2025)
by: Shakerinava, Mehran, et al.
Published: (2025)
TOPReward: Token Probabilities as Hidden Zero-Shot Rewards for Robotics
by: Chen, Shirui, et al.
Published: (2026)
by: Chen, Shirui, et al.
Published: (2026)
Chimera: State Space Models Beyond Sequences
by: Lahoti, Aakash, et al.
Published: (2025)
by: Lahoti, Aakash, et al.
Published: (2025)
Label Distribution Learning from Logical Label
by: Jia, Yuheng, et al.
Published: (2023)
by: Jia, Yuheng, et al.
Published: (2023)
Similar Items
-
Beyond State-Wise Mirror Descent: Offline Policy Optimization with Parametric Policies
by: Li, Xiang, et al.
Published: (2026) -
Statistical Tractability of Off-policy Evaluation of History-dependent Policies in POMDPs
by: Zhang, Yuheng, et al.
Published: (2025) -
On the Curses of Future and History in Future-dependent Value Functions for Off-policy Evaluation
by: Zhang, Yuheng, et al.
Published: (2024) -
Rethinking Importance Sampling in LLM Policy Optimization: A Cumulative Token Perspective
by: Zhang, Yuheng, et al.
Published: (2026) -
Beyond Reward Hacking: Causal Rewards for Large Language Model Alignment
by: Wang, Chaoqi, et al.
Published: (2025)