Saved in:
| Main Authors: | Zhang, Ziyi, Jiang, Ziheng, Jiang, Chengquan, Yu, Menghan, Zheng, Size, Lin, Haibin, Hoffmann, Henry, Liu, Xin |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2506.11309 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
SpecFed: Accelerating Federated LLM Inference with Speculative Decoding and Compressed Transmission
by: Zheng, Ce, et al.
Published: (2026)
by: Zheng, Ce, et al.
Published: (2026)
Accelerating Mixture-of-Experts Inference by Hiding Offloading Latency with Speculative Decoding
by: Wang, Zhibin, et al.
Published: (2025)
by: Wang, Zhibin, et al.
Published: (2025)
FlexSpec: Frozen Drafts Meet Evolving Targets in Edge-Cloud Collaborative LLM Speculative Decoding
by: Li, Yuchen, et al.
Published: (2026)
by: Li, Yuchen, et al.
Published: (2026)
FlowSpec: Continuous Pipelined Speculative Decoding for Efficient Distributed LLM Inference
by: Liu, Xing, et al.
Published: (2025)
by: Liu, Xing, et al.
Published: (2025)
SpecMemo: Speculative Decoding is in Your Pocket
by: Yildirim, Selin, et al.
Published: (2025)
by: Yildirim, Selin, et al.
Published: (2025)
Ghidorah: Fast LLM Inference on Edge with Speculative Decoding and Hetero-Core Parallelism
by: Wei, Jinhui, et al.
Published: (2025)
by: Wei, Jinhui, et al.
Published: (2025)
Speculative Decoding in Decentralized LLM Inference: Turning Communication Latency into Computation Throughput
by: Song, Jingwei, et al.
Published: (2025)
by: Song, Jingwei, et al.
Published: (2025)
ReSpec: Towards Optimizing Speculative Decoding in Reinforcement Learning Systems
by: Chen, Qiaoling, et al.
Published: (2025)
by: Chen, Qiaoling, et al.
Published: (2025)
AMUSD: Asynchronous Multi-Device Speculative Decoding for LLM Acceleration
by: McDanel, Bradley
Published: (2024)
by: McDanel, Bradley
Published: (2024)
Comet: Fine-grained Computation-communication Overlapping for Mixture-of-Experts
by: Zhang, Shulai, et al.
Published: (2025)
by: Zhang, Shulai, et al.
Published: (2025)
SpecBranch: Speculative Decoding via Hybrid Drafting and Rollback-Aware Branch Parallelism
by: Shen, Yuhao, et al.
Published: (2025)
by: Shen, Yuhao, et al.
Published: (2025)
SpecRouter: Adaptive Routing for Multi-Level Speculative Decoding in Large Language Models
by: Wu, Hang, et al.
Published: (2025)
by: Wu, Hang, et al.
Published: (2025)
TileLink: Generating Efficient Compute-Communication Overlapping Kernels using Tile-Centric Primitives
by: Zheng, Size, et al.
Published: (2025)
by: Zheng, Size, et al.
Published: (2025)
PipeSpec: Breaking Stage Dependencies in Hierarchical LLM Decoding
by: McDanel, Bradley, et al.
Published: (2025)
by: McDanel, Bradley, et al.
Published: (2025)
FLASH Viterbi: Fast and Adaptive Viterbi Decoding for Modern Data Systems
by: Deng, Ziheng, et al.
Published: (2025)
by: Deng, Ziheng, et al.
Published: (2025)
FASER: Fine-Grained Phase Management for Speculative Decoding in Dynamic LLM Serving
by: Chen, Wenyan, et al.
Published: (2026)
by: Chen, Wenyan, et al.
Published: (2026)
A Pipelined Collaborative Speculative Decoding Framework for Efficient Edge-Cloud LLM Inference
by: Zhang, Yida, et al.
Published: (2026)
by: Zhang, Yida, et al.
Published: (2026)
Revisiting Speculative Leaderless Protocols for Low-Latency BFT Replication
by: Qian, Daniel, et al.
Published: (2026)
by: Qian, Daniel, et al.
Published: (2026)
Striking the Right Balance between Compute and Copy: Improving LLM Inferencing Under Speculative Decoding
by: Ramachandran, Arun, et al.
Published: (2025)
by: Ramachandran, Arun, et al.
Published: (2025)
Accelerating OpenPangu Inference on NPU via Speculative Decoding
by: Dai, Yuntao, et al.
Published: (2026)
by: Dai, Yuntao, et al.
Published: (2026)
FLUX: Fast Software-based Communication Overlap On GPUs Through Kernel Fusion
by: Chang, Li-Wen, et al.
Published: (2024)
by: Chang, Li-Wen, et al.
Published: (2024)
Falcon: Advancing Asynchronous BFT Consensus for Lower Latency and Enhanced Throughput
by: Dai, Xiaohai, et al.
Published: (2025)
by: Dai, Xiaohai, et al.
Published: (2025)
Hyperion: Low-Latency Ultra-HD Video Analytics via Collaborative Vision Transformer Inference
by: Jiang, Linyi, et al.
Published: (2025)
by: Jiang, Linyi, et al.
Published: (2025)
UniEP: Unified Expert-Parallel MoE MegaKernel for LLM Training
by: Zheng, Size, et al.
Published: (2026)
by: Zheng, Size, et al.
Published: (2026)
StreamServe: Adaptive Speculative Flows for Low-Latency Disaggregated LLM Serving
by: Kumar, Satyam, et al.
Published: (2026)
by: Kumar, Satyam, et al.
Published: (2026)
MegaScale-MoE: Large-Scale Communication-Efficient Training of Mixture-of-Experts Models in Production
by: Jin, Chao, et al.
Published: (2025)
by: Jin, Chao, et al.
Published: (2025)
Asynchronous Latency and Fast Atomic Snapshot
by: Bezerra, João Paulo, et al.
Published: (2024)
by: Bezerra, João Paulo, et al.
Published: (2024)
GoodSpeed: Optimizing Fair Goodput with Adaptive Speculative Decoding in Distributed Edge Inference
by: Tran, Phuong, et al.
Published: (2025)
by: Tran, Phuong, et al.
Published: (2025)
PipeSD: An Efficient Cloud-Edge Collaborative Pipeline Inference Framework with Speculative Decoding
by: Han, Yunhe, et al.
Published: (2026)
by: Han, Yunhe, et al.
Published: (2026)
SpecKV: Adaptive Speculative Decoding with Compression-Aware Gamma Selection
by: Shukla, Shikhar
Published: (2026)
by: Shukla, Shikhar
Published: (2026)
Learning to Keep a Promise: Scaling Language Model Decoding Parallelism with Learned Asynchronous Decoding
by: Jin, Tian, et al.
Published: (2025)
by: Jin, Tian, et al.
Published: (2025)
SP-MoE: Speculative Decoding and Prefetching for Accelerating MoE-based Model Inference
by: Chen, Liangkun, et al.
Published: (2025)
by: Chen, Liangkun, et al.
Published: (2025)
SPEED-Bench: A Unified and Diverse Benchmark for Speculative Decoding
by: Abramovich, Talor, et al.
Published: (2026)
by: Abramovich, Talor, et al.
Published: (2026)
Mahi-Mahi: Low-Latency Asynchronous BFT DAG-Based Consensus
by: Jovanovic, Philipp, et al.
Published: (2024)
by: Jovanovic, Philipp, et al.
Published: (2024)
SpecInF: Exploiting Idle GPU Resources in Distributed DL Training via Speculative Inference Filling
by: Lv, Cunchi, et al.
Published: (2025)
by: Lv, Cunchi, et al.
Published: (2025)
Nightjar: Dynamic Adaptive Speculative Decoding for Large Language Models Serving
by: Li, Rui, et al.
Published: (2025)
by: Li, Rui, et al.
Published: (2025)
Optimizing Long-context LLM Serving via Fine-grained Sequence Parallelism
by: Li, Cong, et al.
Published: (2025)
by: Li, Cong, et al.
Published: (2025)
Triton-distributed: Programming Overlapping Kernels on Distributed AI Systems with the Triton Compiler
by: Zheng, Size, et al.
Published: (2025)
by: Zheng, Size, et al.
Published: (2025)
MegaScale-Omni: A Hyper-Scale, Workload-Resilient System for MultiModal LLM Training in Production
by: Xue, Chunyu, et al.
Published: (2026)
by: Xue, Chunyu, et al.
Published: (2026)
Utility-Driven Speculative Decoding for Mixture-of-Experts
by: Saxena, Anish, et al.
Published: (2025)
by: Saxena, Anish, et al.
Published: (2025)
Similar Items
-
SpecFed: Accelerating Federated LLM Inference with Speculative Decoding and Compressed Transmission
by: Zheng, Ce, et al.
Published: (2026) -
Accelerating Mixture-of-Experts Inference by Hiding Offloading Latency with Speculative Decoding
by: Wang, Zhibin, et al.
Published: (2025) -
FlexSpec: Frozen Drafts Meet Evolving Targets in Edge-Cloud Collaborative LLM Speculative Decoding
by: Li, Yuchen, et al.
Published: (2026) -
FlowSpec: Continuous Pipelined Speculative Decoding for Efficient Distributed LLM Inference
by: Liu, Xing, et al.
Published: (2025) -
SpecMemo: Speculative Decoding is in Your Pocket
by: Yildirim, Selin, et al.
Published: (2025)