Saved in:
| Main Authors: | Khan, Rana Muhammad Shahroz, Liu, Zijie, Tan, Zhen, Fleming, Charles, Chen, Tianlong |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2602.03073 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
$\textit{Agents Under Siege}$: Breaking Pragmatic Multi-Agent LLM Systems with Optimized Prompt Attacks
by: Khan, Rana Muhammad Shahroz, et al.
Published: (2025)
by: Khan, Rana Muhammad Shahroz, et al.
Published: (2025)
The Quest for Efficient Reasoning: A Data-Centric Benchmark to CoT Distillation
by: Zhang, Ruichen, et al.
Published: (2025)
by: Zhang, Ruichen, et al.
Published: (2025)
ORAL: Prompting Your Large-Scale LoRAs via Conditional Recurrent Diffusion
by: Khan, Rana Muhammad Shahroz, et al.
Published: (2025)
by: Khan, Rana Muhammad Shahroz, et al.
Published: (2025)
PortLLM: Personalizing Evolving Large Language Models with Training-Free and Portable Model Patches
by: Khan, Rana Muhammad Shahroz, et al.
Published: (2024)
by: Khan, Rana Muhammad Shahroz, et al.
Published: (2024)
EQA-RM: A Generative Embodied Reward Model with Test-time Scaling
by: Chen, Yuhang, et al.
Published: (2025)
by: Chen, Yuhang, et al.
Published: (2025)
Can GRPO Help LLMs Transcend Their Pretraining Origin?
by: Ni, Kangqi, et al.
Published: (2025)
by: Ni, Kangqi, et al.
Published: (2025)
Linear Optimal Partial Transport Embedding
by: Bai, Yikun, et al.
Published: (2023)
by: Bai, Yikun, et al.
Published: (2023)
Generative VS non-Generative Models in Engineering Shape Optimization
by: Usama, Muhammad, et al.
Published: (2024)
by: Usama, Muhammad, et al.
Published: (2024)
SFT-then-RL Outperforms Mixed-Policy Methods for LLM Reasoning
by: Limozin, Alexis, et al.
Published: (2026)
by: Limozin, Alexis, et al.
Published: (2026)
Patch the Distribution Mismatch: RL Rewriting Agent for Stable Off-Policy SFT
by: Wang, Jiacheng, et al.
Published: (2026)
by: Wang, Jiacheng, et al.
Published: (2026)
GEMQ: Global Expert-Level Mixed-Precision Quantization for MoE LLMs
by: Deng, Jianing, et al.
Published: (2026)
by: Deng, Jianing, et al.
Published: (2026)
On the Generalization of SFT: A Reinforcement Learning Perspective with Reward Rectification
by: Wu, Yongliang, et al.
Published: (2025)
by: Wu, Yongliang, et al.
Published: (2025)
Physics-Informed Geometric Operators to Support Surrogate, Dimension Reduction and Generative Models for Engineering Design
by: Khan, Shahroz, et al.
Published: (2024)
by: Khan, Shahroz, et al.
Published: (2024)
Trajectory-Oriented Policy Optimization with Sparse Rewards
by: Wang, Guojian, et al.
Published: (2024)
by: Wang, Guojian, et al.
Published: (2024)
Continual SFT Matches Multimodal RLHF with Negative Supervision
by: Zhu, Ke, et al.
Published: (2024)
by: Zhu, Ke, et al.
Published: (2024)
RLHF in an SFT Way: From Optimal Solution to Reward-Weighted Alignment
by: Du, Yuhao, et al.
Published: (2025)
by: Du, Yuhao, et al.
Published: (2025)
What Do Agents Learn from Trajectory-SFT: Semantics or Interfaces?
by: Gu, Weizheng, et al.
Published: (2026)
by: Gu, Weizheng, et al.
Published: (2026)
Multi-Objective and Mixed-Reward Reinforcement Learning via Reward-Decorrelated Policy Optimization
by: Bai, Yang, et al.
Published: (2026)
by: Bai, Yang, et al.
Published: (2026)
Beyond Redundancy: Diverse and Specialized Multi-Expert Sparse Autoencoder
by: Xu, Zhen, et al.
Published: (2025)
by: Xu, Zhen, et al.
Published: (2025)
Bridging SFT and RL: Dynamic Policy Optimization for Robust Reasoning
by: Zhu, Taojie, et al.
Published: (2026)
by: Zhu, Taojie, et al.
Published: (2026)
Crafting Reversible SFT Behaviors in Large Language Models
by: Lin, Yuping, et al.
Published: (2026)
by: Lin, Yuping, et al.
Published: (2026)
GAC: Noise-Aware Adaptive Mixing for Hybrid SFT-RL Post-Training
by: Hu, Yuelin, et al.
Published: (2026)
by: Hu, Yuelin, et al.
Published: (2026)
Implicit Reward as the Bridge: A Unified View of SFT and DPO Connections
by: Wang, Bo, et al.
Published: (2025)
by: Wang, Bo, et al.
Published: (2025)
Modular Diffusion Policy Training: Decoupling and Recombining Guidance and Diffusion for Offline RL
by: Chen, Zhaoyang, et al.
Published: (2025)
by: Chen, Zhaoyang, et al.
Published: (2025)
Robust Post-Training for Generative Recommenders: Why Exponential Reward-Weighted SFT Outperforms RLHF
by: Chidambaram, Keertana, et al.
Published: (2026)
by: Chidambaram, Keertana, et al.
Published: (2026)
GraphRCG: Self-Conditioned Graph Generation
by: Wang, Song, et al.
Published: (2024)
by: Wang, Song, et al.
Published: (2024)
SFT-GO: Supervised Fine-Tuning with Group Optimization for Large Language Models
by: Kim, Gyuhak, et al.
Published: (2025)
by: Kim, Gyuhak, et al.
Published: (2025)
Procedural-skill SFT across capacity tiers: A W-Shaped pre-SFT Trajectory and Regime-Asymmetric Mechanism on 0.8B-4B Qwen3.5 Models
by: Strozzi, Igor
Published: (2026)
by: Strozzi, Igor
Published: (2026)
QuantMoE-Bench: Examining Post-Training Quantization for Mixture-of-Experts
by: Li, Pingzhi, et al.
Published: (2024)
by: Li, Pingzhi, et al.
Published: (2024)
Value-Free Policy Optimization via Reward Partitioning
by: Faye, Bilal, et al.
Published: (2025)
by: Faye, Bilal, et al.
Published: (2025)
mSFT: Addressing Dataset Mixtures Overfitting Heterogeneously in Multi-task SFT
by: Koh, Woosung, et al.
Published: (2026)
by: Koh, Woosung, et al.
Published: (2026)
Probing to Refine: Reinforcement Distillation of LLMs via Explanatory Inversion
by: Tan, Zhen, et al.
Published: (2026)
by: Tan, Zhen, et al.
Published: (2026)
DOGe: Defensive Output Generation for LLM Protection Against Knowledge Distillation
by: Li, Pingzhi, et al.
Published: (2025)
by: Li, Pingzhi, et al.
Published: (2025)
Q-SFT: Q-Learning for Language Models via Supervised Fine-Tuning
by: Hong, Joey, et al.
Published: (2024)
by: Hong, Joey, et al.
Published: (2024)
Supervised Reward Inference
by: Schwarzer, Will, et al.
Published: (2025)
by: Schwarzer, Will, et al.
Published: (2025)
Self-Supervised On-Policy Distillation for Reasoning Language Models
by: Tan, Zhiquan, et al.
Published: (2026)
by: Tan, Zhiquan, et al.
Published: (2026)
FisherSFT: Data-Efficient Supervised Fine-Tuning of Language Models Using Information Gain
by: Deb, Rohan, et al.
Published: (2025)
by: Deb, Rohan, et al.
Published: (2025)
HopCast: Calibration of Autoregressive Dynamics Models
by: Shahid, Muhammad Bilal, et al.
Published: (2025)
by: Shahid, Muhammad Bilal, et al.
Published: (2025)
Reward-SQL: Boosting Text-to-SQL via Stepwise Reasoning and Process-Supervised Rewards
by: Zhang, Yuxin, et al.
Published: (2025)
by: Zhang, Yuxin, et al.
Published: (2025)
Flow-Based Single-Step Completion for Efficient and Expressive Policy Learning
by: Koirala, Prajwal, et al.
Published: (2025)
by: Koirala, Prajwal, et al.
Published: (2025)
Similar Items
-
$\textit{Agents Under Siege}$: Breaking Pragmatic Multi-Agent LLM Systems with Optimized Prompt Attacks
by: Khan, Rana Muhammad Shahroz, et al.
Published: (2025) -
The Quest for Efficient Reasoning: A Data-Centric Benchmark to CoT Distillation
by: Zhang, Ruichen, et al.
Published: (2025) -
ORAL: Prompting Your Large-Scale LoRAs via Conditional Recurrent Diffusion
by: Khan, Rana Muhammad Shahroz, et al.
Published: (2025) -
PortLLM: Personalizing Evolving Large Language Models with Training-Free and Portable Model Patches
by: Khan, Rana Muhammad Shahroz, et al.
Published: (2024) -
EQA-RM: A Generative Embodied Reward Model with Test-time Scaling
by: Chen, Yuhang, et al.
Published: (2025)