Saved in:
| Main Authors: | Chang, Chi-Chih, Zhu, Siqi, Zeng, Zhichen, Lin, Haibin, You, Jiaxuan, Abdelfattah, Mohamed S., Jiang, Ziheng, Qian, Xuehai |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2601.09083 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Speculate Deep and Accurate: Lossless and Training-Free Acceleration for Offloaded LLMs via Substitute Speculative Decoding
by: Wang, Pei-Shuo, et al.
Published: (2025)
by: Wang, Pei-Shuo, et al.
Published: (2025)
DisagMoE: Computation-Communication overlapped MoE Training via Disaggregated AF-Pipe Parallelism
by: Zeng, Zhichen, et al.
Published: (2026)
by: Zeng, Zhichen, et al.
Published: (2026)
SPEC-RL: Accelerating On-Policy Reinforcement Learning with Speculative Rollouts
by: Liu, Bingshuai, et al.
Published: (2025)
by: Liu, Bingshuai, et al.
Published: (2025)
OpenTinker: Separating Concerns in Agentic Reinforcement Learning
by: Zhu, Siqi, et al.
Published: (2026)
by: Zhu, Siqi, et al.
Published: (2026)
Accelerating Large-Scale Reasoning Model Inference with Sparse Self-Speculative Decoding
by: Zhao, Yilong, et al.
Published: (2025)
by: Zhao, Yilong, et al.
Published: (2025)
Mesh-Attention: A New Communication-Efficient Distributed Attention with Improved Data Locality
by: Chen, Sirui, et al.
Published: (2025)
by: Chen, Sirui, et al.
Published: (2025)
xKV: Cross-Layer KV-Cache Compression via Aligned Singular Vector Extraction
by: Chang, Chi-Chih, et al.
Published: (2025)
by: Chang, Chi-Chih, et al.
Published: (2025)
Palu: Compressing KV-Cache with Low-Rank Projection
by: Chang, Chi-Chih, et al.
Published: (2024)
by: Chang, Chi-Chih, et al.
Published: (2024)
Accelerating RL Post-Training Rollouts via System-Integrated Speculative Decoding
by: Iso, Hayate, et al.
Published: (2026)
by: Iso, Hayate, et al.
Published: (2026)
EVCAR-KG: A Knowledge-Infused Multi-Agent Reinforcement Learning Framework for Resilient Electric Vehicle Charging Network Recovery
by: Abdelfattah, Mohamed
Published: (2025)
by: Abdelfattah, Mohamed
Published: (2025)
SwiftSpec: Ultra-Low Latency LLM Decoding by Scaling Asynchronous Speculative Decoding
by: Zhang, Ziyi, et al.
Published: (2025)
by: Zhang, Ziyi, et al.
Published: (2025)
SpeCa: Accelerating Diffusion Transformers with Speculative Feature Caching
by: Liu, Jiacheng, et al.
Published: (2025)
by: Liu, Jiacheng, et al.
Published: (2025)
FlashDLM: Accelerating Diffusion Language Model Inference via Efficient KV Caching and Guided Diffusion
by: Hu, Zhanqiu, et al.
Published: (2025)
by: Hu, Zhanqiu, et al.
Published: (2025)
GreedySnake: Accelerating SSD-Offloaded LLM Training with Efficient Scheduling and Optimizer Step Overlapping
by: Yin, Yishu, et al.
Published: (2025)
by: Yin, Yishu, et al.
Published: (2025)
Faster LLM Inference via Sequential Monte Carlo
by: Emara, Yahya, et al.
Published: (2026)
by: Emara, Yahya, et al.
Published: (2026)
DARE: Diffusion Language Model Activation Reuse for Efficient Inference
by: Frumkin, Natalia, et al.
Published: (2026)
by: Frumkin, Natalia, et al.
Published: (2026)
ProCache: Constraint-Aware Feature Caching with Selective Computation for Diffusion Transformer Acceleration
by: Cao, Fanpu, et al.
Published: (2025)
by: Cao, Fanpu, et al.
Published: (2025)
Cacheback: Speculative Decoding With Nothing But Cache
by: Ma, Zhiyao, et al.
Published: (2025)
by: Ma, Zhiyao, et al.
Published: (2025)
Conservative Discrete Structure Stabilizes Autoregressive Rollouts in a 1D Drift Diffusion Poisson Benchmark
by: Wang, Yufeng, et al.
Published: (2026)
by: Wang, Yufeng, et al.
Published: (2026)
Quamba2: A Robust and Scalable Post-training Quantization Framework for Selective State Space Models
by: Chiang, Hung-Yueh, et al.
Published: (2025)
by: Chiang, Hung-Yueh, et al.
Published: (2025)
SplitReason: Learning To Offload Reasoning
by: Akhauri, Yash, et al.
Published: (2025)
by: Akhauri, Yash, et al.
Published: (2025)
BubbleSpec: Turning Long-Tail Bubbles into Speculative Rollout Drafts for Synchronous Reinforcement Learning
by: Xu, Yuhang, et al.
Published: (2026)
by: Xu, Yuhang, et al.
Published: (2026)
Accelerating Diffusion Transformer via Error-Optimized Cache
by: Qiu, Junxiang, et al.
Published: (2025)
by: Qiu, Junxiang, et al.
Published: (2025)
SparAMX: Accelerating Compressed LLMs Token Generation on AMX-powered CPUs
by: AbouElhamayed, Ahmed F., et al.
Published: (2025)
by: AbouElhamayed, Ahmed F., et al.
Published: (2025)
Minions: Accelerating Large Language Model Inference with Aggregated Speculative Execution
by: Wang, Siqi, et al.
Published: (2024)
by: Wang, Siqi, et al.
Published: (2024)
IndexCache: Accelerating Sparse Attention via Cross-Layer Index Reuse
by: Bai, Yushi, et al.
Published: (2026)
by: Bai, Yushi, et al.
Published: (2026)
GTAlign: Game-Theoretic Alignment of LLM Assistants for Social Welfare
by: Zhu, Siqi, et al.
Published: (2025)
by: Zhu, Siqi, et al.
Published: (2025)
Window-Diffusion: Accelerating Diffusion Language Model Inference with Windowed Token Pruning and Caching
by: Zuo, Fengrui, et al.
Published: (2026)
by: Zuo, Fengrui, et al.
Published: (2026)
UniQL: Unified Quantization and Low-rank Compression for Adaptive Edge LLMs
by: Chiang, Hung-Yueh, et al.
Published: (2025)
by: Chiang, Hung-Yueh, et al.
Published: (2025)
TokenButler: Token Importance is Predictable
by: Akhauri, Yash, et al.
Published: (2025)
by: Akhauri, Yash, et al.
Published: (2025)
Probing the Knowledge Boundary: An Interactive Agentic Framework for Deep Knowledge Extraction
by: Yang, Yuheng, et al.
Published: (2026)
by: Yang, Yuheng, et al.
Published: (2026)
Router-R1: Teaching LLMs Multi-Round Routing and Aggregation via Reinforcement Learning
by: Zhang, Haozhen, et al.
Published: (2025)
by: Zhang, Haozhen, et al.
Published: (2025)
BBS: Bi-directional Bit-level Sparsity for Deep Learning Acceleration
by: Chen, Yuzong, et al.
Published: (2024)
by: Chen, Yuzong, et al.
Published: (2024)
Compute Where it Counts: Self Optimizing Language Models
by: Akhauri, Yash, et al.
Published: (2026)
by: Akhauri, Yash, et al.
Published: (2026)
NITRO: LLM Inference on Intel Laptop NPUs
by: Fei, Anthony, et al.
Published: (2024)
by: Fei, Anthony, et al.
Published: (2024)
Encodings for Prediction-based Neural Architecture Search
by: Akhauri, Yash, et al.
Published: (2024)
by: Akhauri, Yash, et al.
Published: (2024)
On Latency Predictors for Neural Architecture Search
by: Akhauri, Yash, et al.
Published: (2024)
by: Akhauri, Yash, et al.
Published: (2024)
Not All Rollouts are Useful: Down-Sampling Rollouts in LLM Reinforcement Learning
by: Xu, Yixuan Even, et al.
Published: (2025)
by: Xu, Yixuan Even, et al.
Published: (2025)
EchoRL: Reinforcement Learning via Rollout Echoing
by: Bi, Jinhe, et al.
Published: (2026)
by: Bi, Jinhe, et al.
Published: (2026)
CuSearch: Curriculum Rollout Sampling via Search Depth for Agentic RAG
by: Shen, Jianghan, et al.
Published: (2026)
by: Shen, Jianghan, et al.
Published: (2026)
Similar Items
-
Speculate Deep and Accurate: Lossless and Training-Free Acceleration for Offloaded LLMs via Substitute Speculative Decoding
by: Wang, Pei-Shuo, et al.
Published: (2025) -
DisagMoE: Computation-Communication overlapped MoE Training via Disaggregated AF-Pipe Parallelism
by: Zeng, Zhichen, et al.
Published: (2026) -
SPEC-RL: Accelerating On-Policy Reinforcement Learning with Speculative Rollouts
by: Liu, Bingshuai, et al.
Published: (2025) -
OpenTinker: Separating Concerns in Agentic Reinforcement Learning
by: Zhu, Siqi, et al.
Published: (2026) -
Accelerating Large-Scale Reasoning Model Inference with Sparse Self-Speculative Decoding
by: Zhao, Yilong, et al.
Published: (2025)