:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Khan, Rana Muhammad Shahroz, Liu, Zijie, Tan, Zhen, Fleming, Charles, Chen, Tianlong
Format:	Preprint
Published:	2026
Subjects:	Machine Learning
Online Access:	https://arxiv.org/abs/2602.03073
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

$\textit{Agents Under Siege}$: Breaking Pragmatic Multi-Agent LLM Systems with Optimized Prompt Attacks
by: Khan, Rana Muhammad Shahroz, et al.
Published: (2025)

The Quest for Efficient Reasoning: A Data-Centric Benchmark to CoT Distillation
by: Zhang, Ruichen, et al.
Published: (2025)

ORAL: Prompting Your Large-Scale LoRAs via Conditional Recurrent Diffusion
by: Khan, Rana Muhammad Shahroz, et al.
Published: (2025)

PortLLM: Personalizing Evolving Large Language Models with Training-Free and Portable Model Patches
by: Khan, Rana Muhammad Shahroz, et al.
Published: (2024)

EQA-RM: A Generative Embodied Reward Model with Test-time Scaling
by: Chen, Yuhang, et al.
Published: (2025)

Can GRPO Help LLMs Transcend Their Pretraining Origin?
by: Ni, Kangqi, et al.
Published: (2025)

Linear Optimal Partial Transport Embedding
by: Bai, Yikun, et al.
Published: (2023)

Generative VS non-Generative Models in Engineering Shape Optimization
by: Usama, Muhammad, et al.
Published: (2024)

SFT-then-RL Outperforms Mixed-Policy Methods for LLM Reasoning
by: Limozin, Alexis, et al.
Published: (2026)

Patch the Distribution Mismatch: RL Rewriting Agent for Stable Off-Policy SFT
by: Wang, Jiacheng, et al.
Published: (2026)

GEMQ: Global Expert-Level Mixed-Precision Quantization for MoE LLMs
by: Deng, Jianing, et al.
Published: (2026)

On the Generalization of SFT: A Reinforcement Learning Perspective with Reward Rectification
by: Wu, Yongliang, et al.
Published: (2025)

Physics-Informed Geometric Operators to Support Surrogate, Dimension Reduction and Generative Models for Engineering Design
by: Khan, Shahroz, et al.
Published: (2024)

Trajectory-Oriented Policy Optimization with Sparse Rewards
by: Wang, Guojian, et al.
Published: (2024)

Continual SFT Matches Multimodal RLHF with Negative Supervision
by: Zhu, Ke, et al.
Published: (2024)

RLHF in an SFT Way: From Optimal Solution to Reward-Weighted Alignment
by: Du, Yuhao, et al.
Published: (2025)

What Do Agents Learn from Trajectory-SFT: Semantics or Interfaces?
by: Gu, Weizheng, et al.
Published: (2026)

Multi-Objective and Mixed-Reward Reinforcement Learning via Reward-Decorrelated Policy Optimization
by: Bai, Yang, et al.
Published: (2026)

Beyond Redundancy: Diverse and Specialized Multi-Expert Sparse Autoencoder
by: Xu, Zhen, et al.
Published: (2025)

Bridging SFT and RL: Dynamic Policy Optimization for Robust Reasoning
by: Zhu, Taojie, et al.
Published: (2026)

Crafting Reversible SFT Behaviors in Large Language Models
by: Lin, Yuping, et al.
Published: (2026)

GAC: Noise-Aware Adaptive Mixing for Hybrid SFT-RL Post-Training
by: Hu, Yuelin, et al.
Published: (2026)

Implicit Reward as the Bridge: A Unified View of SFT and DPO Connections
by: Wang, Bo, et al.
Published: (2025)

Modular Diffusion Policy Training: Decoupling and Recombining Guidance and Diffusion for Offline RL
by: Chen, Zhaoyang, et al.
Published: (2025)

Robust Post-Training for Generative Recommenders: Why Exponential Reward-Weighted SFT Outperforms RLHF
by: Chidambaram, Keertana, et al.
Published: (2026)

GraphRCG: Self-Conditioned Graph Generation
by: Wang, Song, et al.
Published: (2024)

SFT-GO: Supervised Fine-Tuning with Group Optimization for Large Language Models
by: Kim, Gyuhak, et al.
Published: (2025)

Procedural-skill SFT across capacity tiers: A W-Shaped pre-SFT Trajectory and Regime-Asymmetric Mechanism on 0.8B-4B Qwen3.5 Models
by: Strozzi, Igor
Published: (2026)

QuantMoE-Bench: Examining Post-Training Quantization for Mixture-of-Experts
by: Li, Pingzhi, et al.
Published: (2024)

Value-Free Policy Optimization via Reward Partitioning
by: Faye, Bilal, et al.
Published: (2025)

mSFT: Addressing Dataset Mixtures Overfitting Heterogeneously in Multi-task SFT
by: Koh, Woosung, et al.
Published: (2026)

Probing to Refine: Reinforcement Distillation of LLMs via Explanatory Inversion
by: Tan, Zhen, et al.
Published: (2026)

DOGe: Defensive Output Generation for LLM Protection Against Knowledge Distillation
by: Li, Pingzhi, et al.
Published: (2025)

Q-SFT: Q-Learning for Language Models via Supervised Fine-Tuning
by: Hong, Joey, et al.
Published: (2024)

Supervised Reward Inference
by: Schwarzer, Will, et al.
Published: (2025)

Self-Supervised On-Policy Distillation for Reasoning Language Models
by: Tan, Zhiquan, et al.
Published: (2026)

FisherSFT: Data-Efficient Supervised Fine-Tuning of Language Models Using Information Gain
by: Deb, Rohan, et al.
Published: (2025)

HopCast: Calibration of Autoregressive Dynamics Models
by: Shahid, Muhammad Bilal, et al.
Published: (2025)

Reward-SQL: Boosting Text-to-SQL via Stepwise Reasoning and Process-Supervised Rewards
by: Zhang, Yuxin, et al.
Published: (2025)

Flow-Based Single-Step Completion for Efficient and Expressive Policy Learning
by: Koirala, Prajwal, et al.
Published: (2025)