Saved in:
| Main Authors: | Hoang, Duc, Jaiswal, Ajay, Samragh, Mohammad, Cho, Minsik |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2602.03921 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
SPD: Sync-Point Drop for Efficient Tensor Parallelism of Large Language Models
by: Kim, Han-Byul, et al.
Published: (2025)
by: Kim, Han-Byul, et al.
Published: (2025)
TIDE: Every Layer Knows the Token Beneath the Context
by: Jaiswal, Ajay, et al.
Published: (2026)
by: Jaiswal, Ajay, et al.
Published: (2026)
MoEs Are Stronger than You Think: Hyper-Parallel Inference Scaling with RoE
by: Zibakhsh, Soheil, et al.
Published: (2025)
by: Zibakhsh, Soheil, et al.
Published: (2025)
Unmasking On-Policy Distillation: Where It Helps, Where It Hurts, and Why
by: Armandpour, Mohammadreza, et al.
Published: (2026)
by: Armandpour, Mohammadreza, et al.
Published: (2026)
MoE-PHDS: One MoE checkpoint for flexible runtime sparsity
by: Hannah, Lauren. A, et al.
Published: (2025)
by: Hannah, Lauren. A, et al.
Published: (2025)
SpecMoE: A Fast and Efficient Mixture-of-Experts Inference via Self-Assisted Speculative Decoding
by: Bang, Jehyeon, et al.
Published: (2026)
by: Bang, Jehyeon, et al.
Published: (2026)
Scaling Smart: Accelerating Large Language Model Pre-training with Small Model Initialization
by: Samragh, Mohammad, et al.
Published: (2024)
by: Samragh, Mohammad, et al.
Published: (2024)
Finding Fantastic Experts in MoEs: A Unified Study for Expert Dropping Strategies and Observations
by: Jaiswal, Ajay, et al.
Published: (2025)
by: Jaiswal, Ajay, et al.
Published: (2025)
SpecExtend: A Drop-in Enhancement for Speculative Decoding of Long Sequences
by: Cha, Jungyoub, et al.
Published: (2025)
by: Cha, Jungyoub, et al.
Published: (2025)
HiSpec: Hierarchical Speculative Decoding for LLMs
by: Kumar, Avinash, et al.
Published: (2025)
by: Kumar, Avinash, et al.
Published: (2025)
BanditSpec: Adaptive Speculative Decoding via Bandit Algorithms
by: Hou, Yunlong, et al.
Published: (2025)
by: Hou, Yunlong, et al.
Published: (2025)
QuantSpec: Self-Speculative Decoding with Hierarchical Quantized KV Cache
by: Tiwari, Rishabh, et al.
Published: (2025)
by: Tiwari, Rishabh, et al.
Published: (2025)
MemoryLLM: Plug-n-Play Interpretable Feed-Forward Memory for Transformers
by: Jaiswal, Ajay, et al.
Published: (2026)
by: Jaiswal, Ajay, et al.
Published: (2026)
SpecMemo: Speculative Decoding is in Your Pocket
by: Yildirim, Selin, et al.
Published: (2025)
by: Yildirim, Selin, et al.
Published: (2025)
Spec-VLA: Speculative Decoding for Vision-Language-Action Models with Relaxed Acceptance
by: Wang, Songsheng, et al.
Published: (2025)
by: Wang, Songsheng, et al.
Published: (2025)
SpecReason: Fast and Accurate Inference-Time Compute via Speculative Reasoning
by: Pan, Rui, et al.
Published: (2025)
by: Pan, Rui, et al.
Published: (2025)
Speculating Experts Accelerates Inference for Mixture-of-Experts
by: Madan, Vivan, et al.
Published: (2026)
by: Madan, Vivan, et al.
Published: (2026)
DistillSpec: Improving Speculative Decoding via Knowledge Distillation
by: Zhou, Yongchao, et al.
Published: (2023)
by: Zhou, Yongchao, et al.
Published: (2023)
Towards Low-bit Communication for Tensor Parallel LLM Inference
by: Dong, Harry, et al.
Published: (2024)
by: Dong, Harry, et al.
Published: (2024)
ML-SpecQD: Multi-Level Speculative Decoding with Quantized Drafts
by: Georganas, Evangelos, et al.
Published: (2025)
by: Georganas, Evangelos, et al.
Published: (2025)
SpecExit: Accelerating Large Reasoning Model via Speculative Exit
by: Yang, Rubing, et al.
Published: (2025)
by: Yang, Rubing, et al.
Published: (2025)
SpecDec++: Boosting Speculative Decoding via Adaptive Candidate Lengths
by: Huang, Kaixuan, et al.
Published: (2024)
by: Huang, Kaixuan, et al.
Published: (2024)
BubbleSpec: Turning Long-Tail Bubbles into Speculative Rollout Drafts for Synchronous Reinforcement Learning
by: Xu, Yuhang, et al.
Published: (2026)
by: Xu, Yuhang, et al.
Published: (2026)
KnapSpec: Self-Speculative Decoding via Adaptive Layer Selection as a Knapsack Problem
by: Cha, Seongjin, et al.
Published: (2026)
by: Cha, Seongjin, et al.
Published: (2026)
CAS-Spec: Cascade Adaptive Self-Speculative Decoding for On-the-Fly Lossless Inference Acceleration of LLMs
by: Ning, Zhiyuan, et al.
Published: (2025)
by: Ning, Zhiyuan, et al.
Published: (2025)
SpecForge: A Flexible and Efficient Open-Source Training Framework for Speculative Decoding
by: Li, Shenggui, et al.
Published: (2026)
by: Li, Shenggui, et al.
Published: (2026)
SpecBound: Adaptive Bounded Self-Speculation with Layer-wise Confidence Calibration
by: Wen, Zhuofan, et al.
Published: (2026)
by: Wen, Zhuofan, et al.
Published: (2026)
LongSpec: Long-Context Lossless Speculative Decoding with Efficient Drafting and Verification
by: Yang, Penghui, et al.
Published: (2025)
by: Yang, Penghui, et al.
Published: (2025)
LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference
by: Fu, Qichen, et al.
Published: (2024)
by: Fu, Qichen, et al.
Published: (2024)
Recursive Language Models Meet Uncertainty: The Surprising Effectiveness of Self-Reflective Program Search for Long Context
by: Alizadeh, Keivan, et al.
Published: (2026)
by: Alizadeh, Keivan, et al.
Published: (2026)
DynaSpec: Context-aware Dynamic Speculative Sampling for Large-Vocabulary Language Models
by: Zhang, Jinbin, et al.
Published: (2025)
by: Zhang, Jinbin, et al.
Published: (2025)
Do Compressed LLMs Forget Knowledge? An Experimental Study with Practical Implications
by: Hoang, Duc N. M, et al.
Published: (2023)
by: Hoang, Duc N. M, et al.
Published: (2023)
FR-Spec: Accelerating Large-Vocabulary Language Models via Frequency-Ranked Speculative Sampling
by: Zhao, Weilin, et al.
Published: (2025)
by: Zhao, Weilin, et al.
Published: (2025)
Barriers for Learning in an Evolving World: Mathematical Understanding of Loss of Plasticity
by: Joudaki, Amir, et al.
Published: (2025)
by: Joudaki, Amir, et al.
Published: (2025)
Junk DNA Hypothesis: Pruning Small Pre-Trained Weights Irreversibly and Monotonically Impairs "Difficult" Downstream Tasks in LLMs
by: Yin, Lu, et al.
Published: (2023)
by: Yin, Lu, et al.
Published: (2023)
LLaGA: Large Language and Graph Assistant
by: Chen, Runjin, et al.
Published: (2024)
by: Chen, Runjin, et al.
Published: (2024)
Your LLM Knows the Future: Uncovering Its Multi-Token Prediction Potential
by: Samragh, Mohammad, et al.
Published: (2025)
by: Samragh, Mohammad, et al.
Published: (2025)
Accelerating LLM Inference Throughput via Asynchronous KV Cache Prefetching
by: Dong, Yanhao, et al.
Published: (2025)
by: Dong, Yanhao, et al.
Published: (2025)
Explainable AI in Time-Sensitive Scenarios: Prefetched Offline Explanation Model
by: Russo, Fabio Michele, et al.
Published: (2025)
by: Russo, Fabio Michele, et al.
Published: (2025)
SpecKV: Adaptive Speculative Decoding with Compression-Aware Gamma Selection
by: Shukla, Shikhar
Published: (2026)
by: Shukla, Shikhar
Published: (2026)
Similar Items
-
SPD: Sync-Point Drop for Efficient Tensor Parallelism of Large Language Models
by: Kim, Han-Byul, et al.
Published: (2025) -
TIDE: Every Layer Knows the Token Beneath the Context
by: Jaiswal, Ajay, et al.
Published: (2026) -
MoEs Are Stronger than You Think: Hyper-Parallel Inference Scaling with RoE
by: Zibakhsh, Soheil, et al.
Published: (2025) -
Unmasking On-Policy Distillation: Where It Helps, Where It Hurts, and Why
by: Armandpour, Mohammadreza, et al.
Published: (2026) -
MoE-PHDS: One MoE checkpoint for flexible runtime sparsity
by: Hannah, Lauren. A, et al.
Published: (2025)