Saved in:
| Main Authors: | Cai, Zhengge, Hou, Haowen |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2509.16686 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
X-EcoMLA: Upcycling Pre-Trained Attention into MLA for Efficient and Extreme KV Compression
by: Li, Guihong, et al.
Published: (2025)
by: Li, Guihong, et al.
Published: (2025)
SnapMLA: Efficient Long-Context MLA Decoding via Hardware-Aware FP8 Quantized Pipelining
by: Zhang, Yifan, et al.
Published: (2026)
by: Zhang, Yifan, et al.
Published: (2026)
EmbeddingRWKV: State-Centric Retrieval with Reusable States
by: Hou, Haowen, et al.
Published: (2026)
by: Hou, Haowen, et al.
Published: (2026)
MHA2MLA-VLM: Enabling DeepSeek's Economical Multi-Head Latent Attention across Vision-Language Models
by: Fan, Xiaoran, et al.
Published: (2026)
by: Fan, Xiaoran, et al.
Published: (2026)
In-context KV-Cache Eviction for LLMs via Attention-Gate
by: Zeng, Zihao, et al.
Published: (2024)
by: Zeng, Zihao, et al.
Published: (2024)
Interleaved Latent Visual Reasoning with Selective Perceptual Modeling
by: Dong, Shuai, et al.
Published: (2025)
by: Dong, Shuai, et al.
Published: (2025)
Explicit Multi-head Attention for Inter-head Interaction in Large Language Models
by: Peng, Runyu, et al.
Published: (2026)
by: Peng, Runyu, et al.
Published: (2026)
Gated Slot Attention for Efficient Linear-Time Sequence Modeling
by: Zhang, Yu, et al.
Published: (2024)
by: Zhang, Yu, et al.
Published: (2024)
Advancing Sentiment Analysis: A Novel LSTM Framework with Multi-head Attention
by: Yi, Jingyuan, et al.
Published: (2025)
by: Yi, Jingyuan, et al.
Published: (2025)
Pair-In, Pair-Out: Latent Multi-Token Prediction for Efficient LLMs
by: Tan, Wenhui, et al.
Published: (2026)
by: Tan, Wenhui, et al.
Published: (2026)
Gated Linear Attention Transformers with Hardware-Efficient Training
by: Yang, Songlin, et al.
Published: (2023)
by: Yang, Songlin, et al.
Published: (2023)
OpenGuardrails: A Configurable, Unified, and Scalable Guardrails Platform for Large Language Models
by: Wang, Thomas, et al.
Published: (2025)
by: Wang, Thomas, et al.
Published: (2025)
VisualRWKV-HD and UHD: Advancing High-Resolution Processing for Visual Language Models
by: Li, Zihang, et al.
Published: (2024)
by: Li, Zihang, et al.
Published: (2024)
Gated Tree Cross-Attention for Checkpoint-Compatible Syntax Injection in Decoder-Only LLMs
by: Gao, Xinyu, et al.
Published: (2026)
by: Gao, Xinyu, et al.
Published: (2026)
MARRO: Multi-headed Attention for Rhetorical Role Labeling in Legal Documents
by: Bambroo, Purbid, et al.
Published: (2025)
by: Bambroo, Purbid, et al.
Published: (2025)
Compressed Convolutional Attention: Efficient Attention in a Compressed Latent Space
by: Figliolia, Tomas, et al.
Published: (2025)
by: Figliolia, Tomas, et al.
Published: (2025)
Efficient Ternary Weight Embedding Model: Bridging Scalability and Performance
by: Chen, Jiayi, et al.
Published: (2024)
by: Chen, Jiayi, et al.
Published: (2024)
GTA: Grouped-head latenT Attention
by: Sun, Luoyang, et al.
Published: (2025)
by: Sun, Luoyang, et al.
Published: (2025)
Arctic-Embed: Scalable, Efficient, and Accurate Text Embedding Models
by: Merrick, Luke, et al.
Published: (2024)
by: Merrick, Luke, et al.
Published: (2024)
Optimizing Native Sparse Attention with Latent Attention and Local Global Alternating Strategies
by: Hu, Yuxuan, et al.
Published: (2025)
by: Hu, Yuxuan, et al.
Published: (2025)
CausalEmbed: Auto-Regressive Multi-Vector Generation in Latent Space for Visual Document Embedding
by: Huo, Jiahao, et al.
Published: (2026)
by: Huo, Jiahao, et al.
Published: (2026)
Towards Economical Inference: Enabling DeepSeek's Multi-Head Latent Attention in Any Transformer-based LLMs
by: Ji, Tao, et al.
Published: (2025)
by: Ji, Tao, et al.
Published: (2025)
Towards Better Multi-head Attention via Channel-wise Sample Permutation
by: Yuan, Shen, et al.
Published: (2024)
by: Yuan, Shen, et al.
Published: (2024)
Latent Multi-Head Attention for Small Language Models
by: Mehta, Sushant, et al.
Published: (2025)
by: Mehta, Sushant, et al.
Published: (2025)
Do Multilingual LLMs have specialized language heads?
by: Naufil, Muhammad
Published: (2026)
by: Naufil, Muhammad
Published: (2026)
FlashMLA-ETAP: Efficient Transpose Attention Pipeline for Accelerating MLA Inference on NVIDIA H20 GPUs
by: Dege, Pengcuo, et al.
Published: (2025)
by: Dege, Pengcuo, et al.
Published: (2025)
Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models
by: De, Soham, et al.
Published: (2024)
by: De, Soham, et al.
Published: (2024)
ReGLA: Refining Gated Linear Attention
by: Lu, Peng, et al.
Published: (2025)
by: Lu, Peng, et al.
Published: (2025)
Fast-MIA: Efficient and Scalable Membership Inference for LLMs
by: Takahashi, Hiromu, et al.
Published: (2025)
by: Takahashi, Hiromu, et al.
Published: (2025)
Flux Attention: Context-Aware Hybrid Attention for Efficient LLMs Inference
by: Qiu, Quantong, et al.
Published: (2026)
by: Qiu, Quantong, et al.
Published: (2026)
Answer-Centric or Reasoning-Driven? Uncovering the Latent Memory Anchor in LLMs
by: Wu, Yang, et al.
Published: (2025)
by: Wu, Yang, et al.
Published: (2025)
SR-KI: Scalable and Real-Time Knowledge Integration into LLMs via Supervised Attention
by: Yu, Bohan, et al.
Published: (2025)
by: Yu, Bohan, et al.
Published: (2025)
GAProtoNet: A Multi-head Graph Attention-based Prototypical Network for Interpretable Text Classification
by: Wen, Ximing, et al.
Published: (2024)
by: Wen, Ximing, et al.
Published: (2024)
Uncertainty-Aware Attention Heads: Efficient Unsupervised Uncertainty Quantification for LLMs
by: Vazhentsev, Artem, et al.
Published: (2025)
by: Vazhentsev, Artem, et al.
Published: (2025)
Attn-GS: Attention-Guided Context Compression for Efficient Personalized LLMs
by: Zeng, Shenglai, et al.
Published: (2026)
by: Zeng, Shenglai, et al.
Published: (2026)
Scalable Multi-phase Word Embedding Using Conjunctive Propositional Clauses
by: Kadhim, Ahmed K., et al.
Published: (2025)
by: Kadhim, Ahmed K., et al.
Published: (2025)
Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free
by: Qiu, Zihan, et al.
Published: (2025)
by: Qiu, Zihan, et al.
Published: (2025)
SAGE: A Novelty Gate for Efficient Memory Evolution in Agentic LLMs
by: Wang, Sijia, et al.
Published: (2026)
by: Wang, Sijia, et al.
Published: (2026)
Efficient Latent Semantic Clustering for Scaling Test-Time Computation of LLMs
by: Lee, Sungjae, et al.
Published: (2025)
by: Lee, Sungjae, et al.
Published: (2025)
Memorization and Knowledge Injection in Gated LLMs
by: Pan, Xu, et al.
Published: (2025)
by: Pan, Xu, et al.
Published: (2025)
Similar Items
-
X-EcoMLA: Upcycling Pre-Trained Attention into MLA for Efficient and Extreme KV Compression
by: Li, Guihong, et al.
Published: (2025) -
SnapMLA: Efficient Long-Context MLA Decoding via Hardware-Aware FP8 Quantized Pipelining
by: Zhang, Yifan, et al.
Published: (2026) -
EmbeddingRWKV: State-Centric Retrieval with Reusable States
by: Hou, Haowen, et al.
Published: (2026) -
MHA2MLA-VLM: Enabling DeepSeek's Economical Multi-Head Latent Attention across Vision-Language Models
by: Fan, Xiaoran, et al.
Published: (2026) -
In-context KV-Cache Eviction for LLMs via Attention-Gate
by: Zeng, Zihao, et al.
Published: (2024)