Saved in:
| Main Authors: | Zhong, Wanli, Feng, Haibo, Zhou, Zirui, Peng, Hanyang, Yu, Shiqi |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2511.21513 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
I-LLM: Efficient Integer-Only Inference for Fully-Quantized Low-Bit Large Language Models
by: Hu, Xing, et al.
Published: (2024)
by: Hu, Xing, et al.
Published: (2024)
Modality-Aware Zero-Shot Pruning and Sparse Attention for Efficient Multimodal Edge Inference
by: Sui, Yueyuan, et al.
Published: (2026)
by: Sui, Yueyuan, et al.
Published: (2026)
Unsupervised Multi-Attention Meta Transformer for Rotating Machinery Fault Diagnosis
by: Wang, Hanyang, et al.
Published: (2025)
by: Wang, Hanyang, et al.
Published: (2025)
AttentionEngine: A Versatile Framework for Efficient Attention Mechanisms on Diverse Hardware Platforms
by: Chen, Feiyang, et al.
Published: (2025)
by: Chen, Feiyang, et al.
Published: (2025)
Flux Attention: Context-Aware Hybrid Attention for Efficient LLMs Inference
by: Qiu, Quantong, et al.
Published: (2026)
by: Qiu, Quantong, et al.
Published: (2026)
SparQ Attention: Bandwidth-Efficient LLM Inference
by: Ribar, Luka, et al.
Published: (2023)
by: Ribar, Luka, et al.
Published: (2023)
Spatial Conformal Inference through Localized Quantile Regression
by: Jiang, Hanyang, et al.
Published: (2024)
by: Jiang, Hanyang, et al.
Published: (2024)
ScoutAttention: Efficient KV Cache Offloading via Layer-Ahead CPU Pre-computation for LLM Inference
by: Zhang, Qiuyang, et al.
Published: (2026)
by: Zhang, Qiuyang, et al.
Published: (2026)
ENA: Efficient N-dimensional Attention
by: Zhong, Yibo
Published: (2025)
by: Zhong, Yibo
Published: (2025)
CHAI: Clustered Head Attention for Efficient LLM Inference
by: Agarwal, Saurabh, et al.
Published: (2024)
by: Agarwal, Saurabh, et al.
Published: (2024)
Efficient Low Rank Attention for Long-Context Inference in Large Language Models
by: Li, Tenghui, et al.
Published: (2025)
by: Li, Tenghui, et al.
Published: (2025)
FlashMLA-ETAP: Efficient Transpose Attention Pipeline for Accelerating MLA Inference on NVIDIA H20 GPUs
by: Dege, Pengcuo, et al.
Published: (2025)
by: Dege, Pengcuo, et al.
Published: (2025)
Hermes: Memory-Efficient Pipeline Inference for Large Models on Edge Devices
by: Han, Xueyuan, et al.
Published: (2024)
by: Han, Xueyuan, et al.
Published: (2024)
Taming the Exponential: A Fast Softmax Surrogate for Integer-Native Edge Inference
by: Danopoulos, Dimitrios, et al.
Published: (2026)
by: Danopoulos, Dimitrios, et al.
Published: (2026)
Delta Attention: Fast and Accurate Sparse Attention Inference by Delta Correction
by: Willette, Jeffrey, et al.
Published: (2025)
by: Willette, Jeffrey, et al.
Published: (2025)
CSAttention: Centroid-Scoring Attention for Accelerating LLM Inference
by: Song, Chuxu, et al.
Published: (2026)
by: Song, Chuxu, et al.
Published: (2026)
AttnCache: Accelerating Self-Attention Inference for LLM Prefill via Attention Cache
by: Song, Dinghong, et al.
Published: (2025)
by: Song, Dinghong, et al.
Published: (2025)
Meta-Attention: Bayesian Per-Token Routing for Efficient Transformer Inference
by: Ferrari, Alan
Published: (2026)
by: Ferrari, Alan
Published: (2026)
FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference
by: Lai, Xunhao, et al.
Published: (2025)
by: Lai, Xunhao, et al.
Published: (2025)
INT-FlashAttention: Enabling Flash Attention for INT8 Quantization
by: Chen, Shimao, et al.
Published: (2024)
by: Chen, Shimao, et al.
Published: (2024)
Kernelized Edge Attention: Addressing Semantic Attention Blurring in Temporal Graph Neural Networks
by: Waghmare, Govind, et al.
Published: (2026)
by: Waghmare, Govind, et al.
Published: (2026)
NoMAD-Attention: Efficient LLM Inference on CPUs Through Multiply-add-free Attention
by: Zhang, Tianyi, et al.
Published: (2024)
by: Zhang, Tianyi, et al.
Published: (2024)
Spatial-Temporal Attention Model for Traffic State Estimation with Sparse Internet of Vehicles
by: Xue, Jianzhe, et al.
Published: (2024)
by: Xue, Jianzhe, et al.
Published: (2024)
SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration
by: Zhang, Jintao, et al.
Published: (2024)
by: Zhang, Jintao, et al.
Published: (2024)
HGCA: Hybrid GPU-CPU Attention for Long Context LLM Inference
by: Deng, Weishu, et al.
Published: (2025)
by: Deng, Weishu, et al.
Published: (2025)
Attention Once Is All You Need: Efficient Streaming Inference with Stateful Transformers
by: Norgren, Victor
Published: (2026)
by: Norgren, Victor
Published: (2026)
DistrAttention: An Efficient and Flexible Self-Attention Mechanism on Modern GPUs
by: Jin, Haolin, et al.
Published: (2025)
by: Jin, Haolin, et al.
Published: (2025)
Federated Attention: A Distributed Paradigm for Collaborative LLM Inference over Edge Networks
by: Deng, Xiumei, et al.
Published: (2025)
by: Deng, Xiumei, et al.
Published: (2025)
Progressive Sparse Attention: Algorithm and System Co-design for Efficient Attention in LLM Serving
by: Zhou, Qihui, et al.
Published: (2025)
by: Zhou, Qihui, et al.
Published: (2025)
HATA: Trainable and Hardware-Efficient Hash-Aware Top-k Attention for Scalable Large Model Inference
by: Gong, Ping, et al.
Published: (2025)
by: Gong, Ping, et al.
Published: (2025)
Time-Aware Attention for Enhanced Electronic Health Records Modeling
by: Yu, Junhan, et al.
Published: (2025)
by: Yu, Junhan, et al.
Published: (2025)
ASAP: Attention-Shift-Aware Pruning for Efficient LVLM Inference
by: Pathak, Surendra, et al.
Published: (2026)
by: Pathak, Surendra, et al.
Published: (2026)
Star Attention: Efficient LLM Inference over Long Sequences
by: Acharya, Shantanu, et al.
Published: (2024)
by: Acharya, Shantanu, et al.
Published: (2024)
Symphony-MoE: Harmonizing Disparate Pre-trained Models into a Coherent Mixture-of-Experts
by: Wang, Qi, et al.
Published: (2025)
by: Wang, Qi, et al.
Published: (2025)
Edge Attention Module for Object Classification
by: Roy, Santanu, et al.
Published: (2025)
by: Roy, Santanu, et al.
Published: (2025)
Conv-Basis: A New Paradigm for Efficient Attention Inference and Gradient Computation in Transformers
by: Liang, Yingyu, et al.
Published: (2024)
by: Liang, Yingyu, et al.
Published: (2024)
Linear Attention Sequence Parallelism
by: Sun, Weigao, et al.
Published: (2024)
by: Sun, Weigao, et al.
Published: (2024)
HelixPipe: Efficient Distributed Training of Long Sequence Transformers with Attention Parallel Pipeline Parallelism
by: Zhang, Geng, et al.
Published: (2025)
by: Zhang, Geng, et al.
Published: (2025)
SSVEP-BiMA: Bifocal Masking Attention Leveraging Native and Symmetric-Antisymmetric Components for Robust SSVEP Decoding
by: Liu, Yuxin, et al.
Published: (2025)
by: Liu, Yuxin, et al.
Published: (2025)
TPLA: Tensor Parallel Latent Attention for Efficient Disaggregated Prefill and Decode Inference
by: Tang, Xiaojuan, et al.
Published: (2025)
by: Tang, Xiaojuan, et al.
Published: (2025)
Similar Items
-
I-LLM: Efficient Integer-Only Inference for Fully-Quantized Low-Bit Large Language Models
by: Hu, Xing, et al.
Published: (2024) -
Modality-Aware Zero-Shot Pruning and Sparse Attention for Efficient Multimodal Edge Inference
by: Sui, Yueyuan, et al.
Published: (2026) -
Unsupervised Multi-Attention Meta Transformer for Rotating Machinery Fault Diagnosis
by: Wang, Hanyang, et al.
Published: (2025) -
AttentionEngine: A Versatile Framework for Efficient Attention Mechanisms on Diverse Hardware Platforms
by: Chen, Feiyang, et al.
Published: (2025) -
Flux Attention: Context-Aware Hybrid Attention for Efficient LLMs Inference
by: Qiu, Quantong, et al.
Published: (2026)