Saved in:
| Main Authors: | Chelba, Ciprian, Chen, Mia, Bapna, Ankur, Shazeer, Noam |
|---|---|
| Format: | Preprint |
| Published: |
2020
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2001.04589 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Coupling Speech Encoders with Downstream Text Models
by: Chelba, Ciprian, et al.
Published: (2024)
by: Chelba, Ciprian, et al.
Published: (2024)
Lossless Acceleration of Large Language Model via Adaptive N-gram Parallel Decoding
by: Ou, Jie, et al.
Published: (2024)
by: Ou, Jie, et al.
Published: (2024)
Understanding Transformers via N-gram Statistics
by: Nguyen, Timothy
Published: (2024)
by: Nguyen, Timothy
Published: (2024)
Self-Speculative Biased Decoding for Faster Re-Translation
by: Zeng, Linxiao, et al.
Published: (2025)
by: Zeng, Linxiao, et al.
Published: (2025)
Sparser is Faster and Less is More: Efficient Sparse Attention for Long-Range Transformers
by: Lou, Chao, et al.
Published: (2024)
by: Lou, Chao, et al.
Published: (2024)
Transformers Can Represent $n$-gram Language Models
by: Svete, Anej, et al.
Published: (2024)
by: Svete, Anej, et al.
Published: (2024)
FlashDecoding++: Faster Large Language Model Inference on GPUs
by: Hong, Ke, et al.
Published: (2023)
by: Hong, Ke, et al.
Published: (2023)
Faster Cascades via Speculative Decoding
by: Narasimhan, Harikrishna, et al.
Published: (2024)
by: Narasimhan, Harikrishna, et al.
Published: (2024)
AdaSplash-2: Faster Differentiable Sparse Attention
by: Gonçalves, Nuno, et al.
Published: (2026)
by: Gonçalves, Nuno, et al.
Published: (2026)
SEA: Sparse Linear Attention with Estimated Attention Mask
by: Lee, Heejun, et al.
Published: (2023)
by: Lee, Heejun, et al.
Published: (2023)
Sparser, Faster, Lighter Transformer Language Models
by: Cetin, Edoardo, et al.
Published: (2026)
by: Cetin, Edoardo, et al.
Published: (2026)
Not All Denoising Steps Are Equal: Model Scheduling for Faster Masked Diffusion Language Models
by: Sedykh, Ivan, et al.
Published: (2026)
by: Sedykh, Ivan, et al.
Published: (2026)
MASSV: Multimodal Adaptation and Self-Data Distillation for Speculative Decoding of Vision-Language Models
by: Ganesan, Mugilan, et al.
Published: (2025)
by: Ganesan, Mugilan, et al.
Published: (2025)
Judge Decoding: Faster Speculative Sampling Requires Going Beyond Model Alignment
by: Bachmann, Gregor, et al.
Published: (2025)
by: Bachmann, Gregor, et al.
Published: (2025)
Is Smaller Always Faster? Tradeoffs in Compressing Self-Supervised Speech Transformers
by: Lin, Tzu-Quan, et al.
Published: (2022)
by: Lin, Tzu-Quan, et al.
Published: (2022)
Entropic-Time Inference: Self-Organizing Large Language Model Decoding Beyond Attention
by: Kiruluta, Andrew
Published: (2026)
by: Kiruluta, Andrew
Published: (2026)
Stopping Computation for Converged Tokens in Masked Diffusion-LM Decoding
by: Oba, Daisuke, et al.
Published: (2026)
by: Oba, Daisuke, et al.
Published: (2026)
A2SF: Accumulative Attention Scoring with Forgetting Factor for Token Pruning in Transformer Decoder
by: Jo, Hyun-rae, et al.
Published: (2024)
by: Jo, Hyun-rae, et al.
Published: (2024)
An Interpretable N-gram Perplexity Threat Model for Large Language Model Jailbreaks
by: Boreiko, Valentyn, et al.
Published: (2024)
by: Boreiko, Valentyn, et al.
Published: (2024)
Optimizing Decoding Paths in Masked Diffusion Models by Quantifying Uncertainty
by: Chen, Ziyu, et al.
Published: (2025)
by: Chen, Ziyu, et al.
Published: (2025)
Hardware-Efficient Attention for Fast Decoding
by: Zadouri, Ted, et al.
Published: (2025)
by: Zadouri, Ted, et al.
Published: (2025)
Enhancing Bangla Language Next Word Prediction and Sentence Completion through Extended RNN with Bi-LSTM Model On N-gram Language
by: Islam, Md Robiul, et al.
Published: (2024)
by: Islam, Md Robiul, et al.
Published: (2024)
Balancing Coverage and Draft Latency in Vocabulary Trimming for Faster Speculative Decoding
by: Shoham, Ofir Ben
Published: (2026)
by: Shoham, Ofir Ben
Published: (2026)
DualDiffusion: A Speculative Decoding Strategy for Masked Diffusion Models
by: Goyal, Satyam, et al.
Published: (2026)
by: Goyal, Satyam, et al.
Published: (2026)
Improving Rare Word Translation With Dictionaries and Attention Masking
by: Sible, Kenneth J., et al.
Published: (2024)
by: Sible, Kenneth J., et al.
Published: (2024)
Block Sparse Flash Attention
by: Ohayon, Daniel, et al.
Published: (2025)
by: Ohayon, Daniel, et al.
Published: (2025)
Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clusters
by: Shyam, Vasudev, et al.
Published: (2024)
by: Shyam, Vasudev, et al.
Published: (2024)
TidalDecode: Fast and Accurate LLM Decoding with Position Persistent Sparse Attention
by: Yang, Lijie, et al.
Published: (2024)
by: Yang, Lijie, et al.
Published: (2024)
Trainable Dynamic Mask Sparse Attention
by: Shi, Jingze, et al.
Published: (2025)
by: Shi, Jingze, et al.
Published: (2025)
Learnable Multi-Scale Wavelet Transformer: A Novel Alternative to Self-Attention
by: Kiruluta, Andrew, et al.
Published: (2025)
by: Kiruluta, Andrew, et al.
Published: (2025)
Efficiently Dispatching Flash Attention For Partially Filled Attention Masks
by: Sharma, Agniv, et al.
Published: (2024)
by: Sharma, Agniv, et al.
Published: (2024)
EntmaxKV: Support-Aware Decoding for Entmax Attention
by: Duarte, Gonçalo, et al.
Published: (2026)
by: Duarte, Gonçalo, et al.
Published: (2026)
Mixture of Attentions For Speculative Decoding
by: Zimmer, Matthieu, et al.
Published: (2024)
by: Zimmer, Matthieu, et al.
Published: (2024)
On The Adaptation of Unlimiformer for Decoder-Only Transformers
by: Ahrabian, Kian, et al.
Published: (2024)
by: Ahrabian, Kian, et al.
Published: (2024)
Mask-Enhanced Autoregressive Prediction: Pay Less Attention to Learn More
by: Zhuang, Xialie, et al.
Published: (2025)
by: Zhuang, Xialie, et al.
Published: (2025)
Beyond Self Attention: A Subquadratic Fourier Wavelet Transformer with Multi Modal Fusion
by: Kiruluta, Andrew, et al.
Published: (2021)
by: Kiruluta, Andrew, et al.
Published: (2021)
Masked Hard-Attention Transformers Recognize Exactly the Star-Free Languages
by: Yang, Andy, et al.
Published: (2023)
by: Yang, Andy, et al.
Published: (2023)
Exclusive Self Attention
by: Zhai, Shuangfei
Published: (2026)
by: Zhai, Shuangfei
Published: (2026)
Sparse Attention Remapping with Clustering for Efficient LLM Decoding on PIM
by: Fan, Zehao, et al.
Published: (2025)
by: Fan, Zehao, et al.
Published: (2025)
RecurFormer: Not All Transformer Heads Need Self-Attention
by: Yan, Ruiqing, et al.
Published: (2024)
by: Yan, Ruiqing, et al.
Published: (2024)
Similar Items
-
Coupling Speech Encoders with Downstream Text Models
by: Chelba, Ciprian, et al.
Published: (2024) -
Lossless Acceleration of Large Language Model via Adaptive N-gram Parallel Decoding
by: Ou, Jie, et al.
Published: (2024) -
Understanding Transformers via N-gram Statistics
by: Nguyen, Timothy
Published: (2024) -
Self-Speculative Biased Decoding for Faster Re-Translation
by: Zeng, Linxiao, et al.
Published: (2025) -
Sparser is Faster and Less is More: Efficient Sparse Attention for Long-Range Transformers
by: Lou, Chao, et al.
Published: (2024)