:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Ankner, Zachary, Parthasarathy, Rishab, Nrusimha, Aniruddha, Rinard, Christopher, Ragan-Kelley, Jonathan, Brandon, William
Format:	Preprint
Published:	2024
Subjects:	Machine Learning
Online Access:	https://arxiv.org/abs/2402.05109
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Reducing Transformer Key-Value Cache Size with Cross-Layer Attention
by: Brandon, William, et al.
Published: (2024)

FlashFormer: Whole-Model Kernels for Efficient Low-Batch Inference
by: Nrusimha, Aniruddha, et al.
Published: (2025)

Vid3D: Synthesis of Dynamic 3D Scenes using 2D Video Diffusion
by: Parthasarathy, Rishab, et al.
Published: (2024)

Learning to Keep a Promise: Scaling Language Model Decoding Parallelism with Learned Asynchronous Decoding
by: Jin, Tian, et al.
Published: (2025)

Critique-out-Loud Reward Models
by: Ankner, Zachary, et al.
Published: (2024)

A Novel Recurrent Neural Network Framework for Prediction and Treatment of Oncogenic Mutation Progression
by: Parthasarathy, Rishab, et al.
Published: (2025)

Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads
by: Cai, Tianle, et al.
Published: (2024)

Mitigating the Impact of Outlier Channels for Language Model Quantization with Activation Regularization
by: Nrusimha, Aniruddha, et al.
Published: (2024)

Fast Matrix Multiplications for Lookup Table-Quantized LLMs
by: Guo, Han, et al.
Published: (2024)

Towards Verifiable Text Generation with Symbolic References
by: Hennigen, Lucas Torroba, et al.
Published: (2023)

SlimSpec: Low-Rank Draft LM-Head for Accelerated Speculative Decoding
by: Plaksin, Anton, et al.
Published: (2026)

Perplexed by Perplexity: Perplexity-Based Data Pruning With Small Reference Models
by: Ankner, Zachary, et al.
Published: (2024)

HydraViT: Stacking Heads for a Scalable ViT
by: Haberer, Janek, et al.
Published: (2024)

FastEagle: Cascaded Drafting for Accelerating Speculative Decoding
by: Huang, Haiduo, et al.
Published: (2025)

Ladder-residual: parallelism-aware architecture for accelerating large model inference with communication overlapping
by: Zhang, Muru, et al.
Published: (2025)

Whisper in Medusa's Ear: Multi-head Efficient Decoding for Transformer-based ASR
by: Segal-Feldman, Yael, et al.
Published: (2024)

Emergent Representations of Program Semantics in Language Models Trained on Programs
by: Jin, Charles, et al.
Published: (2023)

Break the Sequential Dependency of LLM Inference Using Lookahead Decoding
by: Fu, Yichao, et al.
Published: (2024)

Training Domain Draft Models for Speculative Decoding: Best Practices and Insights
by: Hong, Fenglu, et al.
Published: (2025)

Draft, Verify, and Improve: Toward Training-Aware Speculative Decoding
by: Bhansali, Shrenik, et al.
Published: (2025)

Scaling Laws for Precision
by: Kumar, Tanishq, et al.
Published: (2024)

SuperUROP: An FPGA-Based Spatial Accelerator for Sparse Matrix Operations
by: Parthasarathy, Rishab
Published: (2025)

Exploring and Improving Drafts in Blockwise Parallel Decoding
by: Kim, Taehyeon, et al.
Published: (2024)

Direct Alignment of Draft Model for Speculative Decoding with Chat-Fine-Tuned LLMs
by: Goel, Raghavv, et al.
Published: (2024)

FlashOptim: Optimizers for Memory-Efficient Training
by: Ortiz, Jose Javier Gonzalez, et al.
Published: (2026)

When Drafts Evolve: Speculative Decoding Meets Online Learning
by: Qian, Yu-Yang, et al.
Published: (2026)

Draft-Conditioned Constrained Decoding for Structured Generation in LLMs
by: Reddy, Avinash, et al.
Published: (2026)

DREAM-S: Speculative Decoding with Searchable Drafting and Target-Aware Refinement for Multimodal Generation
by: Liu, Zining, et al.
Published: (2026)

Draft Less, Retrieve More: Hybrid Tree Construction for Speculative Decoding
by: Shen, Yuhao, et al.
Published: (2026)

POSS: Position Specialist Generates Better Draft for Speculative Decoding
by: Huang, Langlin, et al.
Published: (2025)

TABED: Test-Time Adaptive Ensemble Drafting for Robust Speculative Decoding in LVLMs
by: Lee, Minjae, et al.
Published: (2026)

Speculative Decoding with CTC-based Draft Model for LLM Inference Acceleration
by: Wen, Zhuofan, et al.
Published: (2024)

ML-SpecQD: Multi-Level Speculative Decoding with Quantized Drafts
by: Georganas, Evangelos, et al.
Published: (2025)

Balancing Coverage and Draft Latency in Vocabulary Trimming for Faster Speculative Decoding
by: Shoham, Ofir Ben
Published: (2026)

BudgetDraft: Acceptance-Aware Multi-View Training for Sparse-KV Speculative Decoding
by: He, Liang, et al.
Published: (2026)

OmniDraft: A Cross-vocabulary, Online Adaptive Drafter for On-device Speculative Decoding
by: Ramakrishnan, Ramchalam Kinattinkara, et al.
Published: (2025)

Free Draft-and-Verification: Toward Lossless Parallel Decoding for Diffusion Large Language Models
by: Wu, Shutong, et al.
Published: (2025)

Evaluating the Generalization Capabilities of Large Language Models on Code Reasoning
by: Yang, Rem, et al.
Published: (2025)

LongSpec: Long-Context Lossless Speculative Decoding with Efficient Drafting and Verification
by: Yang, Penghui, et al.
Published: (2025)

MineDraft: A Framework for Batch Parallel Speculative Decoding
by: Tang, Zhenwei, et al.
Published: (2026)