Saved in:
| Main Authors: | Ribar, Luka, Chelombiev, Ivan, Hudlass-Galley, Luke, Blake, Charlie, Luschi, Carlo, Orr, Douglas |
|---|---|
| Format: | Preprint |
| Published: |
2023
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2312.04985 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Approximate Top-$k$ for Increased Parallelism
by: Key, Oscar, et al.
Published: (2024)
by: Key, Oscar, et al.
Published: (2024)
Optimal Formats for Weight Quantisation
by: Orr, Douglas, et al.
Published: (2025)
by: Orr, Douglas, et al.
Published: (2025)
u-$μ$P: The Unit-Scaled Maximal Update Parametrization
by: Blake, Charlie, et al.
Published: (2024)
by: Blake, Charlie, et al.
Published: (2024)
Elucidating the Design Space of FP4 training
by: Hu, Robert, et al.
Published: (2025)
by: Hu, Robert, et al.
Published: (2025)
Scalify: scale propagation for efficient low-precision LLM training
by: Balança, Paul, et al.
Published: (2024)
by: Balança, Paul, et al.
Published: (2024)
FedSparQ: Adaptive Sparse Quantization with Error Feedback for Robust & Efficient Federated Learning
by: Medjadji, Chaimaa, et al.
Published: (2025)
by: Medjadji, Chaimaa, et al.
Published: (2025)
Ground-Truth Subgraphs for Better Training and Evaluation of Knowledge Graph Augmented LLMs
by: Cattaneo, Alberto, et al.
Published: (2025)
by: Cattaneo, Alberto, et al.
Published: (2025)
MXNorm: Reusing MXFP block scales for efficient tensor normalisation
by: McLean, Callum, et al.
Published: (2026)
by: McLean, Callum, et al.
Published: (2026)
SparDL: Distributed Deep Learning Training with Efficient Sparse Communication
by: Zhao, Minjun, et al.
Published: (2023)
by: Zhao, Minjun, et al.
Published: (2023)
Ragged Paged Attention: A High-Performance and Flexible LLM Inference Kernel for TPU
by: Jiang, Jevin, et al.
Published: (2026)
by: Jiang, Jevin, et al.
Published: (2026)
CHAI: Clustered Head Attention for Efficient LLM Inference
by: Agarwal, Saurabh, et al.
Published: (2024)
by: Agarwal, Saurabh, et al.
Published: (2024)
UltRAG: a Universal Simple Scalable Recipe for Knowledge Graph RAG
by: Georgiev, Dobrik, et al.
Published: (2026)
by: Georgiev, Dobrik, et al.
Published: (2026)
Star Attention: Efficient LLM Inference over Long Sequences
by: Acharya, Shantanu, et al.
Published: (2024)
by: Acharya, Shantanu, et al.
Published: (2024)
Megalodon: Efficient LLM Pretraining and Inference with Unlimited Context Length
by: Ma, Xuezhe, et al.
Published: (2024)
by: Ma, Xuezhe, et al.
Published: (2024)
The Role of Graph Topology in the Performance of Biomedical Knowledge Graph Completion Models
by: Cattaneo, Alberto, et al.
Published: (2024)
by: Cattaneo, Alberto, et al.
Published: (2024)
NoMAD-Attention: Efficient LLM Inference on CPUs Through Multiply-add-free Attention
by: Zhang, Tianyi, et al.
Published: (2024)
by: Zhang, Tianyi, et al.
Published: (2024)
Reducing the Cost of Quantum Chemical Data By Backpropagating Through Density Functional Theory
by: Mathiasen, Alexander, et al.
Published: (2024)
by: Mathiasen, Alexander, et al.
Published: (2024)
Scout Before You Attend: Sketch-and-Walk Sparse Attention for Efficient LLM Inference
by: Le, Hoang Anh Duy, et al.
Published: (2026)
by: Le, Hoang Anh Duy, et al.
Published: (2026)
1-Bit Wonder: Improving QAT Performance in the Low-Bit Regime through K-Means Quantization
by: Maskey, Sohir, et al.
Published: (2026)
by: Maskey, Sohir, et al.
Published: (2026)
PackInfer: Compute- and I/O-Efficient Attention for Batched LLM Inference
by: Ning, Rui, et al.
Published: (2026)
by: Ning, Rui, et al.
Published: (2026)
FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving
by: Ye, Zihao, et al.
Published: (2025)
by: Ye, Zihao, et al.
Published: (2025)
IntAttention: A Fully Integer Attention Pipeline for Efficient Edge Inference
by: Zhong, Wanli, et al.
Published: (2025)
by: Zhong, Wanli, et al.
Published: (2025)
ScoutAttention: Efficient KV Cache Offloading via Layer-Ahead CPU Pre-computation for LLM Inference
by: Zhang, Qiuyang, et al.
Published: (2026)
by: Zhang, Qiuyang, et al.
Published: (2026)
SparAMX: Accelerating Compressed LLMs Token Generation on AMX-powered CPUs
by: AbouElhamayed, Ahmed F., et al.
Published: (2025)
by: AbouElhamayed, Ahmed F., et al.
Published: (2025)
CSAttention: Centroid-Scoring Attention for Accelerating LLM Inference
by: Song, Chuxu, et al.
Published: (2026)
by: Song, Chuxu, et al.
Published: (2026)
Flux Attention: Context-Aware Hybrid Attention for Efficient LLMs Inference
by: Qiu, Quantong, et al.
Published: (2026)
by: Qiu, Quantong, et al.
Published: (2026)
SparK: Query-Aware Unstructured Sparsity with Recoverable KV Cache Channel Pruning
by: Liao, Huanxuan, et al.
Published: (2025)
by: Liao, Huanxuan, et al.
Published: (2025)
HGCA: Hybrid GPU-CPU Attention for Long Context LLM Inference
by: Deng, Weishu, et al.
Published: (2025)
by: Deng, Weishu, et al.
Published: (2025)
AttenMIA: LLM Membership Inference Attack through Attention Signals
by: Zaree, Pedram, et al.
Published: (2026)
by: Zaree, Pedram, et al.
Published: (2026)
LLM in a flash: Efficient Large Language Model Inference with Limited Memory
by: Alizadeh, Keivan, et al.
Published: (2023)
by: Alizadeh, Keivan, et al.
Published: (2023)
Inference-Time Computations for LLM Reasoning and Planning: A Benchmark and Insights
by: Parashar, Shubham, et al.
Published: (2025)
by: Parashar, Shubham, et al.
Published: (2025)
AttnCache: Accelerating Self-Attention Inference for LLM Prefill via Attention Cache
by: Song, Dinghong, et al.
Published: (2025)
by: Song, Dinghong, et al.
Published: (2025)
Bandwidth-Efficient Adaptive Mixture-of-Experts via Low-Rank Compensation
by: Liu, Zhenyu, et al.
Published: (2025)
by: Liu, Zhenyu, et al.
Published: (2025)
SkillGen: Verified Inference-Time Agent Skill Synthesis
by: Ma, Yuchen, et al.
Published: (2026)
by: Ma, Yuchen, et al.
Published: (2026)
Hogwild! Inference: Parallel LLM Generation via Concurrent Attention
by: Rodionov, Gleb, et al.
Published: (2025)
by: Rodionov, Gleb, et al.
Published: (2025)
Building Better Datasets: Seven Recommendations for Responsible Design from Dataset Creators
by: Orr, Will, et al.
Published: (2024)
by: Orr, Will, et al.
Published: (2024)
Meta-Attention: Bayesian Per-Token Routing for Efficient Transformer Inference
by: Ferrari, Alan
Published: (2026)
by: Ferrari, Alan
Published: (2026)
MoSKA: Mixture of Shared KV Attention for Efficient Long-Sequence LLM Inference
by: Rhee, Myunghyun, et al.
Published: (2025)
by: Rhee, Myunghyun, et al.
Published: (2025)
WebLLM: A High-Performance In-Browser LLM Inference Engine
by: Ruan, Charlie F., et al.
Published: (2024)
by: Ruan, Charlie F., et al.
Published: (2024)
SparVAR: Exploring Sparsity in Visual AutoRegressive Modeling for Training-Free Acceleration
by: Li, Zekun, et al.
Published: (2026)
by: Li, Zekun, et al.
Published: (2026)
Similar Items
-
Approximate Top-$k$ for Increased Parallelism
by: Key, Oscar, et al.
Published: (2024) -
Optimal Formats for Weight Quantisation
by: Orr, Douglas, et al.
Published: (2025) -
u-$μ$P: The Unit-Scaled Maximal Update Parametrization
by: Blake, Charlie, et al.
Published: (2024) -
Elucidating the Design Space of FP4 training
by: Hu, Robert, et al.
Published: (2025) -
Scalify: scale propagation for efficient low-precision LLM training
by: Balança, Paul, et al.
Published: (2024)