:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Zhang, Ziyi, Jiang, Ziheng, Jiang, Chengquan, Yu, Menghan, Zheng, Size, Lin, Haibin, Hoffmann, Henry, Liu, Xin
Format:	Preprint
Published:	2025
Subjects:	Distributed, Parallel, and Cluster Computing Machine Learning
Online Access:	https://arxiv.org/abs/2506.11309
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

SpecFed: Accelerating Federated LLM Inference with Speculative Decoding and Compressed Transmission
by: Zheng, Ce, et al.
Published: (2026)

Accelerating Mixture-of-Experts Inference by Hiding Offloading Latency with Speculative Decoding
by: Wang, Zhibin, et al.
Published: (2025)

FlexSpec: Frozen Drafts Meet Evolving Targets in Edge-Cloud Collaborative LLM Speculative Decoding
by: Li, Yuchen, et al.
Published: (2026)

FlowSpec: Continuous Pipelined Speculative Decoding for Efficient Distributed LLM Inference
by: Liu, Xing, et al.
Published: (2025)

SpecMemo: Speculative Decoding is in Your Pocket
by: Yildirim, Selin, et al.
Published: (2025)

Ghidorah: Fast LLM Inference on Edge with Speculative Decoding and Hetero-Core Parallelism
by: Wei, Jinhui, et al.
Published: (2025)

Speculative Decoding in Decentralized LLM Inference: Turning Communication Latency into Computation Throughput
by: Song, Jingwei, et al.
Published: (2025)

ReSpec: Towards Optimizing Speculative Decoding in Reinforcement Learning Systems
by: Chen, Qiaoling, et al.
Published: (2025)

AMUSD: Asynchronous Multi-Device Speculative Decoding for LLM Acceleration
by: McDanel, Bradley
Published: (2024)

Comet: Fine-grained Computation-communication Overlapping for Mixture-of-Experts
by: Zhang, Shulai, et al.
Published: (2025)

SpecBranch: Speculative Decoding via Hybrid Drafting and Rollback-Aware Branch Parallelism
by: Shen, Yuhao, et al.
Published: (2025)

SpecRouter: Adaptive Routing for Multi-Level Speculative Decoding in Large Language Models
by: Wu, Hang, et al.
Published: (2025)

TileLink: Generating Efficient Compute-Communication Overlapping Kernels using Tile-Centric Primitives
by: Zheng, Size, et al.
Published: (2025)

PipeSpec: Breaking Stage Dependencies in Hierarchical LLM Decoding
by: McDanel, Bradley, et al.
Published: (2025)

FLASH Viterbi: Fast and Adaptive Viterbi Decoding for Modern Data Systems
by: Deng, Ziheng, et al.
Published: (2025)

FASER: Fine-Grained Phase Management for Speculative Decoding in Dynamic LLM Serving
by: Chen, Wenyan, et al.
Published: (2026)

A Pipelined Collaborative Speculative Decoding Framework for Efficient Edge-Cloud LLM Inference
by: Zhang, Yida, et al.
Published: (2026)

Revisiting Speculative Leaderless Protocols for Low-Latency BFT Replication
by: Qian, Daniel, et al.
Published: (2026)

Striking the Right Balance between Compute and Copy: Improving LLM Inferencing Under Speculative Decoding
by: Ramachandran, Arun, et al.
Published: (2025)

Accelerating OpenPangu Inference on NPU via Speculative Decoding
by: Dai, Yuntao, et al.
Published: (2026)

FLUX: Fast Software-based Communication Overlap On GPUs Through Kernel Fusion
by: Chang, Li-Wen, et al.
Published: (2024)

Falcon: Advancing Asynchronous BFT Consensus for Lower Latency and Enhanced Throughput
by: Dai, Xiaohai, et al.
Published: (2025)

Hyperion: Low-Latency Ultra-HD Video Analytics via Collaborative Vision Transformer Inference
by: Jiang, Linyi, et al.
Published: (2025)

UniEP: Unified Expert-Parallel MoE MegaKernel for LLM Training
by: Zheng, Size, et al.
Published: (2026)

StreamServe: Adaptive Speculative Flows for Low-Latency Disaggregated LLM Serving
by: Kumar, Satyam, et al.
Published: (2026)

MegaScale-MoE: Large-Scale Communication-Efficient Training of Mixture-of-Experts Models in Production
by: Jin, Chao, et al.
Published: (2025)

Asynchronous Latency and Fast Atomic Snapshot
by: Bezerra, João Paulo, et al.
Published: (2024)

GoodSpeed: Optimizing Fair Goodput with Adaptive Speculative Decoding in Distributed Edge Inference
by: Tran, Phuong, et al.
Published: (2025)

PipeSD: An Efficient Cloud-Edge Collaborative Pipeline Inference Framework with Speculative Decoding
by: Han, Yunhe, et al.
Published: (2026)

SpecKV: Adaptive Speculative Decoding with Compression-Aware Gamma Selection
by: Shukla, Shikhar
Published: (2026)

Learning to Keep a Promise: Scaling Language Model Decoding Parallelism with Learned Asynchronous Decoding
by: Jin, Tian, et al.
Published: (2025)

SP-MoE: Speculative Decoding and Prefetching for Accelerating MoE-based Model Inference
by: Chen, Liangkun, et al.
Published: (2025)

SPEED-Bench: A Unified and Diverse Benchmark for Speculative Decoding
by: Abramovich, Talor, et al.
Published: (2026)

Mahi-Mahi: Low-Latency Asynchronous BFT DAG-Based Consensus
by: Jovanovic, Philipp, et al.
Published: (2024)

SpecInF: Exploiting Idle GPU Resources in Distributed DL Training via Speculative Inference Filling
by: Lv, Cunchi, et al.
Published: (2025)

Nightjar: Dynamic Adaptive Speculative Decoding for Large Language Models Serving
by: Li, Rui, et al.
Published: (2025)

Optimizing Long-context LLM Serving via Fine-grained Sequence Parallelism
by: Li, Cong, et al.
Published: (2025)

Triton-distributed: Programming Overlapping Kernels on Distributed AI Systems with the Triton Compiler
by: Zheng, Size, et al.
Published: (2025)

MegaScale-Omni: A Hyper-Scale, Workload-Resilient System for MultiModal LLM Training in Production
by: Xue, Chunyu, et al.
Published: (2026)

Utility-Driven Speculative Decoding for Mixture-of-Experts
by: Saxena, Anish, et al.
Published: (2025)