Saved in:
| Main Authors: | Ankner, Zachary, Parthasarathy, Rishab, Nrusimha, Aniruddha, Rinard, Christopher, Ragan-Kelley, Jonathan, Brandon, William |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2402.05109 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Reducing Transformer Key-Value Cache Size with Cross-Layer Attention
by: Brandon, William, et al.
Published: (2024)
by: Brandon, William, et al.
Published: (2024)
FlashFormer: Whole-Model Kernels for Efficient Low-Batch Inference
by: Nrusimha, Aniruddha, et al.
Published: (2025)
by: Nrusimha, Aniruddha, et al.
Published: (2025)
Vid3D: Synthesis of Dynamic 3D Scenes using 2D Video Diffusion
by: Parthasarathy, Rishab, et al.
Published: (2024)
by: Parthasarathy, Rishab, et al.
Published: (2024)
Learning to Keep a Promise: Scaling Language Model Decoding Parallelism with Learned Asynchronous Decoding
by: Jin, Tian, et al.
Published: (2025)
by: Jin, Tian, et al.
Published: (2025)
Critique-out-Loud Reward Models
by: Ankner, Zachary, et al.
Published: (2024)
by: Ankner, Zachary, et al.
Published: (2024)
A Novel Recurrent Neural Network Framework for Prediction and Treatment of Oncogenic Mutation Progression
by: Parthasarathy, Rishab, et al.
Published: (2025)
by: Parthasarathy, Rishab, et al.
Published: (2025)
Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads
by: Cai, Tianle, et al.
Published: (2024)
by: Cai, Tianle, et al.
Published: (2024)
Mitigating the Impact of Outlier Channels for Language Model Quantization with Activation Regularization
by: Nrusimha, Aniruddha, et al.
Published: (2024)
by: Nrusimha, Aniruddha, et al.
Published: (2024)
Fast Matrix Multiplications for Lookup Table-Quantized LLMs
by: Guo, Han, et al.
Published: (2024)
by: Guo, Han, et al.
Published: (2024)
Towards Verifiable Text Generation with Symbolic References
by: Hennigen, Lucas Torroba, et al.
Published: (2023)
by: Hennigen, Lucas Torroba, et al.
Published: (2023)
SlimSpec: Low-Rank Draft LM-Head for Accelerated Speculative Decoding
by: Plaksin, Anton, et al.
Published: (2026)
by: Plaksin, Anton, et al.
Published: (2026)
Perplexed by Perplexity: Perplexity-Based Data Pruning With Small Reference Models
by: Ankner, Zachary, et al.
Published: (2024)
by: Ankner, Zachary, et al.
Published: (2024)
HydraViT: Stacking Heads for a Scalable ViT
by: Haberer, Janek, et al.
Published: (2024)
by: Haberer, Janek, et al.
Published: (2024)
FastEagle: Cascaded Drafting for Accelerating Speculative Decoding
by: Huang, Haiduo, et al.
Published: (2025)
by: Huang, Haiduo, et al.
Published: (2025)
Ladder-residual: parallelism-aware architecture for accelerating large model inference with communication overlapping
by: Zhang, Muru, et al.
Published: (2025)
by: Zhang, Muru, et al.
Published: (2025)
Whisper in Medusa's Ear: Multi-head Efficient Decoding for Transformer-based ASR
by: Segal-Feldman, Yael, et al.
Published: (2024)
by: Segal-Feldman, Yael, et al.
Published: (2024)
Emergent Representations of Program Semantics in Language Models Trained on Programs
by: Jin, Charles, et al.
Published: (2023)
by: Jin, Charles, et al.
Published: (2023)
Break the Sequential Dependency of LLM Inference Using Lookahead Decoding
by: Fu, Yichao, et al.
Published: (2024)
by: Fu, Yichao, et al.
Published: (2024)
Training Domain Draft Models for Speculative Decoding: Best Practices and Insights
by: Hong, Fenglu, et al.
Published: (2025)
by: Hong, Fenglu, et al.
Published: (2025)
Draft, Verify, and Improve: Toward Training-Aware Speculative Decoding
by: Bhansali, Shrenik, et al.
Published: (2025)
by: Bhansali, Shrenik, et al.
Published: (2025)
Scaling Laws for Precision
by: Kumar, Tanishq, et al.
Published: (2024)
by: Kumar, Tanishq, et al.
Published: (2024)
SuperUROP: An FPGA-Based Spatial Accelerator for Sparse Matrix Operations
by: Parthasarathy, Rishab
Published: (2025)
by: Parthasarathy, Rishab
Published: (2025)
Exploring and Improving Drafts in Blockwise Parallel Decoding
by: Kim, Taehyeon, et al.
Published: (2024)
by: Kim, Taehyeon, et al.
Published: (2024)
Direct Alignment of Draft Model for Speculative Decoding with Chat-Fine-Tuned LLMs
by: Goel, Raghavv, et al.
Published: (2024)
by: Goel, Raghavv, et al.
Published: (2024)
FlashOptim: Optimizers for Memory-Efficient Training
by: Ortiz, Jose Javier Gonzalez, et al.
Published: (2026)
by: Ortiz, Jose Javier Gonzalez, et al.
Published: (2026)
When Drafts Evolve: Speculative Decoding Meets Online Learning
by: Qian, Yu-Yang, et al.
Published: (2026)
by: Qian, Yu-Yang, et al.
Published: (2026)
Draft-Conditioned Constrained Decoding for Structured Generation in LLMs
by: Reddy, Avinash, et al.
Published: (2026)
by: Reddy, Avinash, et al.
Published: (2026)
DREAM-S: Speculative Decoding with Searchable Drafting and Target-Aware Refinement for Multimodal Generation
by: Liu, Zining, et al.
Published: (2026)
by: Liu, Zining, et al.
Published: (2026)
Draft Less, Retrieve More: Hybrid Tree Construction for Speculative Decoding
by: Shen, Yuhao, et al.
Published: (2026)
by: Shen, Yuhao, et al.
Published: (2026)
POSS: Position Specialist Generates Better Draft for Speculative Decoding
by: Huang, Langlin, et al.
Published: (2025)
by: Huang, Langlin, et al.
Published: (2025)
TABED: Test-Time Adaptive Ensemble Drafting for Robust Speculative Decoding in LVLMs
by: Lee, Minjae, et al.
Published: (2026)
by: Lee, Minjae, et al.
Published: (2026)
Speculative Decoding with CTC-based Draft Model for LLM Inference Acceleration
by: Wen, Zhuofan, et al.
Published: (2024)
by: Wen, Zhuofan, et al.
Published: (2024)
ML-SpecQD: Multi-Level Speculative Decoding with Quantized Drafts
by: Georganas, Evangelos, et al.
Published: (2025)
by: Georganas, Evangelos, et al.
Published: (2025)
Balancing Coverage and Draft Latency in Vocabulary Trimming for Faster Speculative Decoding
by: Shoham, Ofir Ben
Published: (2026)
by: Shoham, Ofir Ben
Published: (2026)
BudgetDraft: Acceptance-Aware Multi-View Training for Sparse-KV Speculative Decoding
by: He, Liang, et al.
Published: (2026)
by: He, Liang, et al.
Published: (2026)
OmniDraft: A Cross-vocabulary, Online Adaptive Drafter for On-device Speculative Decoding
by: Ramakrishnan, Ramchalam Kinattinkara, et al.
Published: (2025)
by: Ramakrishnan, Ramchalam Kinattinkara, et al.
Published: (2025)
Free Draft-and-Verification: Toward Lossless Parallel Decoding for Diffusion Large Language Models
by: Wu, Shutong, et al.
Published: (2025)
by: Wu, Shutong, et al.
Published: (2025)
Evaluating the Generalization Capabilities of Large Language Models on Code Reasoning
by: Yang, Rem, et al.
Published: (2025)
by: Yang, Rem, et al.
Published: (2025)
LongSpec: Long-Context Lossless Speculative Decoding with Efficient Drafting and Verification
by: Yang, Penghui, et al.
Published: (2025)
by: Yang, Penghui, et al.
Published: (2025)
MineDraft: A Framework for Batch Parallel Speculative Decoding
by: Tang, Zhenwei, et al.
Published: (2026)
by: Tang, Zhenwei, et al.
Published: (2026)
Similar Items
-
Reducing Transformer Key-Value Cache Size with Cross-Layer Attention
by: Brandon, William, et al.
Published: (2024) -
FlashFormer: Whole-Model Kernels for Efficient Low-Batch Inference
by: Nrusimha, Aniruddha, et al.
Published: (2025) -
Vid3D: Synthesis of Dynamic 3D Scenes using 2D Video Diffusion
by: Parthasarathy, Rishab, et al.
Published: (2024) -
Learning to Keep a Promise: Scaling Language Model Decoding Parallelism with Learned Asynchronous Decoding
by: Jin, Tian, et al.
Published: (2025) -
Critique-out-Loud Reward Models
by: Ankner, Zachary, et al.
Published: (2024)