:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Ribar, Luka, Chelombiev, Ivan, Hudlass-Galley, Luke, Blake, Charlie, Luschi, Carlo, Orr, Douglas
Format:	Preprint
Published:	2023
Subjects:	Machine Learning
Online Access:	https://arxiv.org/abs/2312.04985
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Approximate Top-$k$ for Increased Parallelism
by: Key, Oscar, et al.
Published: (2024)

Optimal Formats for Weight Quantisation
by: Orr, Douglas, et al.
Published: (2025)

u-$μ$P: The Unit-Scaled Maximal Update Parametrization
by: Blake, Charlie, et al.
Published: (2024)

Elucidating the Design Space of FP4 training
by: Hu, Robert, et al.
Published: (2025)

Scalify: scale propagation for efficient low-precision LLM training
by: Balança, Paul, et al.
Published: (2024)

FedSparQ: Adaptive Sparse Quantization with Error Feedback for Robust & Efficient Federated Learning
by: Medjadji, Chaimaa, et al.
Published: (2025)

Ground-Truth Subgraphs for Better Training and Evaluation of Knowledge Graph Augmented LLMs
by: Cattaneo, Alberto, et al.
Published: (2025)

MXNorm: Reusing MXFP block scales for efficient tensor normalisation
by: McLean, Callum, et al.
Published: (2026)

SparDL: Distributed Deep Learning Training with Efficient Sparse Communication
by: Zhao, Minjun, et al.
Published: (2023)

Ragged Paged Attention: A High-Performance and Flexible LLM Inference Kernel for TPU
by: Jiang, Jevin, et al.
Published: (2026)

CHAI: Clustered Head Attention for Efficient LLM Inference
by: Agarwal, Saurabh, et al.
Published: (2024)

UltRAG: a Universal Simple Scalable Recipe for Knowledge Graph RAG
by: Georgiev, Dobrik, et al.
Published: (2026)

Star Attention: Efficient LLM Inference over Long Sequences
by: Acharya, Shantanu, et al.
Published: (2024)

Megalodon: Efficient LLM Pretraining and Inference with Unlimited Context Length
by: Ma, Xuezhe, et al.
Published: (2024)

The Role of Graph Topology in the Performance of Biomedical Knowledge Graph Completion Models
by: Cattaneo, Alberto, et al.
Published: (2024)

NoMAD-Attention: Efficient LLM Inference on CPUs Through Multiply-add-free Attention
by: Zhang, Tianyi, et al.
Published: (2024)

Reducing the Cost of Quantum Chemical Data By Backpropagating Through Density Functional Theory
by: Mathiasen, Alexander, et al.
Published: (2024)

Scout Before You Attend: Sketch-and-Walk Sparse Attention for Efficient LLM Inference
by: Le, Hoang Anh Duy, et al.
Published: (2026)

1-Bit Wonder: Improving QAT Performance in the Low-Bit Regime through K-Means Quantization
by: Maskey, Sohir, et al.
Published: (2026)

PackInfer: Compute- and I/O-Efficient Attention for Batched LLM Inference
by: Ning, Rui, et al.
Published: (2026)

FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving
by: Ye, Zihao, et al.
Published: (2025)

IntAttention: A Fully Integer Attention Pipeline for Efficient Edge Inference
by: Zhong, Wanli, et al.
Published: (2025)

ScoutAttention: Efficient KV Cache Offloading via Layer-Ahead CPU Pre-computation for LLM Inference
by: Zhang, Qiuyang, et al.
Published: (2026)

SparAMX: Accelerating Compressed LLMs Token Generation on AMX-powered CPUs
by: AbouElhamayed, Ahmed F., et al.
Published: (2025)

CSAttention: Centroid-Scoring Attention for Accelerating LLM Inference
by: Song, Chuxu, et al.
Published: (2026)

Flux Attention: Context-Aware Hybrid Attention for Efficient LLMs Inference
by: Qiu, Quantong, et al.
Published: (2026)

SparK: Query-Aware Unstructured Sparsity with Recoverable KV Cache Channel Pruning
by: Liao, Huanxuan, et al.
Published: (2025)

HGCA: Hybrid GPU-CPU Attention for Long Context LLM Inference
by: Deng, Weishu, et al.
Published: (2025)

AttenMIA: LLM Membership Inference Attack through Attention Signals
by: Zaree, Pedram, et al.
Published: (2026)

LLM in a flash: Efficient Large Language Model Inference with Limited Memory
by: Alizadeh, Keivan, et al.
Published: (2023)

Inference-Time Computations for LLM Reasoning and Planning: A Benchmark and Insights
by: Parashar, Shubham, et al.
Published: (2025)

AttnCache: Accelerating Self-Attention Inference for LLM Prefill via Attention Cache
by: Song, Dinghong, et al.
Published: (2025)

Bandwidth-Efficient Adaptive Mixture-of-Experts via Low-Rank Compensation
by: Liu, Zhenyu, et al.
Published: (2025)

SkillGen: Verified Inference-Time Agent Skill Synthesis
by: Ma, Yuchen, et al.
Published: (2026)

Hogwild! Inference: Parallel LLM Generation via Concurrent Attention
by: Rodionov, Gleb, et al.
Published: (2025)

Building Better Datasets: Seven Recommendations for Responsible Design from Dataset Creators
by: Orr, Will, et al.
Published: (2024)

Meta-Attention: Bayesian Per-Token Routing for Efficient Transformer Inference
by: Ferrari, Alan
Published: (2026)

MoSKA: Mixture of Shared KV Attention for Efficient Long-Sequence LLM Inference
by: Rhee, Myunghyun, et al.
Published: (2025)

WebLLM: A High-Performance In-Browser LLM Inference Engine
by: Ruan, Charlie F., et al.
Published: (2024)

SparVAR: Exploring Sparsity in Visual AutoRegressive Modeling for Training-Free Acceleration
by: Li, Zekun, et al.
Published: (2026)