:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Zhong, Wanli, Feng, Haibo, Zhou, Zirui, Peng, Hanyang, Yu, Shiqi
Format:	Preprint
Published:	2025
Subjects:	Machine Learning
Online Access:	https://arxiv.org/abs/2511.21513
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

I-LLM: Efficient Integer-Only Inference for Fully-Quantized Low-Bit Large Language Models
by: Hu, Xing, et al.
Published: (2024)

Modality-Aware Zero-Shot Pruning and Sparse Attention for Efficient Multimodal Edge Inference
by: Sui, Yueyuan, et al.
Published: (2026)

Unsupervised Multi-Attention Meta Transformer for Rotating Machinery Fault Diagnosis
by: Wang, Hanyang, et al.
Published: (2025)

AttentionEngine: A Versatile Framework for Efficient Attention Mechanisms on Diverse Hardware Platforms
by: Chen, Feiyang, et al.
Published: (2025)

Flux Attention: Context-Aware Hybrid Attention for Efficient LLMs Inference
by: Qiu, Quantong, et al.
Published: (2026)

SparQ Attention: Bandwidth-Efficient LLM Inference
by: Ribar, Luka, et al.
Published: (2023)

Spatial Conformal Inference through Localized Quantile Regression
by: Jiang, Hanyang, et al.
Published: (2024)

ScoutAttention: Efficient KV Cache Offloading via Layer-Ahead CPU Pre-computation for LLM Inference
by: Zhang, Qiuyang, et al.
Published: (2026)

ENA: Efficient N-dimensional Attention
by: Zhong, Yibo
Published: (2025)

CHAI: Clustered Head Attention for Efficient LLM Inference
by: Agarwal, Saurabh, et al.
Published: (2024)

Efficient Low Rank Attention for Long-Context Inference in Large Language Models
by: Li, Tenghui, et al.
Published: (2025)

FlashMLA-ETAP: Efficient Transpose Attention Pipeline for Accelerating MLA Inference on NVIDIA H20 GPUs
by: Dege, Pengcuo, et al.
Published: (2025)

Hermes: Memory-Efficient Pipeline Inference for Large Models on Edge Devices
by: Han, Xueyuan, et al.
Published: (2024)

Taming the Exponential: A Fast Softmax Surrogate for Integer-Native Edge Inference
by: Danopoulos, Dimitrios, et al.
Published: (2026)

Delta Attention: Fast and Accurate Sparse Attention Inference by Delta Correction
by: Willette, Jeffrey, et al.
Published: (2025)

CSAttention: Centroid-Scoring Attention for Accelerating LLM Inference
by: Song, Chuxu, et al.
Published: (2026)

AttnCache: Accelerating Self-Attention Inference for LLM Prefill via Attention Cache
by: Song, Dinghong, et al.
Published: (2025)

Meta-Attention: Bayesian Per-Token Routing for Efficient Transformer Inference
by: Ferrari, Alan
Published: (2026)

FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference
by: Lai, Xunhao, et al.
Published: (2025)

INT-FlashAttention: Enabling Flash Attention for INT8 Quantization
by: Chen, Shimao, et al.
Published: (2024)

Kernelized Edge Attention: Addressing Semantic Attention Blurring in Temporal Graph Neural Networks
by: Waghmare, Govind, et al.
Published: (2026)

NoMAD-Attention: Efficient LLM Inference on CPUs Through Multiply-add-free Attention
by: Zhang, Tianyi, et al.
Published: (2024)

Spatial-Temporal Attention Model for Traffic State Estimation with Sparse Internet of Vehicles
by: Xue, Jianzhe, et al.
Published: (2024)

SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration
by: Zhang, Jintao, et al.
Published: (2024)

HGCA: Hybrid GPU-CPU Attention for Long Context LLM Inference
by: Deng, Weishu, et al.
Published: (2025)

Attention Once Is All You Need: Efficient Streaming Inference with Stateful Transformers
by: Norgren, Victor
Published: (2026)

DistrAttention: An Efficient and Flexible Self-Attention Mechanism on Modern GPUs
by: Jin, Haolin, et al.
Published: (2025)

Federated Attention: A Distributed Paradigm for Collaborative LLM Inference over Edge Networks
by: Deng, Xiumei, et al.
Published: (2025)

Progressive Sparse Attention: Algorithm and System Co-design for Efficient Attention in LLM Serving
by: Zhou, Qihui, et al.
Published: (2025)

HATA: Trainable and Hardware-Efficient Hash-Aware Top-k Attention for Scalable Large Model Inference
by: Gong, Ping, et al.
Published: (2025)

Time-Aware Attention for Enhanced Electronic Health Records Modeling
by: Yu, Junhan, et al.
Published: (2025)

ASAP: Attention-Shift-Aware Pruning for Efficient LVLM Inference
by: Pathak, Surendra, et al.
Published: (2026)

Star Attention: Efficient LLM Inference over Long Sequences
by: Acharya, Shantanu, et al.
Published: (2024)

Symphony-MoE: Harmonizing Disparate Pre-trained Models into a Coherent Mixture-of-Experts
by: Wang, Qi, et al.
Published: (2025)

Edge Attention Module for Object Classification
by: Roy, Santanu, et al.
Published: (2025)

Conv-Basis: A New Paradigm for Efficient Attention Inference and Gradient Computation in Transformers
by: Liang, Yingyu, et al.
Published: (2024)

Linear Attention Sequence Parallelism
by: Sun, Weigao, et al.
Published: (2024)

HelixPipe: Efficient Distributed Training of Long Sequence Transformers with Attention Parallel Pipeline Parallelism
by: Zhang, Geng, et al.
Published: (2025)

SSVEP-BiMA: Bifocal Masking Attention Leveraging Native and Symmetric-Antisymmetric Components for Robust SSVEP Decoding
by: Liu, Yuxin, et al.
Published: (2025)

TPLA: Tensor Parallel Latent Attention for Efficient Disaggregated Prefill and Decode Inference
by: Tang, Xiaojuan, et al.
Published: (2025)