:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Cai, Zhengge, Hou, Haowen
Format:	Preprint
Published:	2025
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2509.16686
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

X-EcoMLA: Upcycling Pre-Trained Attention into MLA for Efficient and Extreme KV Compression
by: Li, Guihong, et al.
Published: (2025)

SnapMLA: Efficient Long-Context MLA Decoding via Hardware-Aware FP8 Quantized Pipelining
by: Zhang, Yifan, et al.
Published: (2026)

EmbeddingRWKV: State-Centric Retrieval with Reusable States
by: Hou, Haowen, et al.
Published: (2026)

MHA2MLA-VLM: Enabling DeepSeek's Economical Multi-Head Latent Attention across Vision-Language Models
by: Fan, Xiaoran, et al.
Published: (2026)

In-context KV-Cache Eviction for LLMs via Attention-Gate
by: Zeng, Zihao, et al.
Published: (2024)

Interleaved Latent Visual Reasoning with Selective Perceptual Modeling
by: Dong, Shuai, et al.
Published: (2025)

Explicit Multi-head Attention for Inter-head Interaction in Large Language Models
by: Peng, Runyu, et al.
Published: (2026)

Gated Slot Attention for Efficient Linear-Time Sequence Modeling
by: Zhang, Yu, et al.
Published: (2024)

Advancing Sentiment Analysis: A Novel LSTM Framework with Multi-head Attention
by: Yi, Jingyuan, et al.
Published: (2025)

Pair-In, Pair-Out: Latent Multi-Token Prediction for Efficient LLMs
by: Tan, Wenhui, et al.
Published: (2026)

Gated Linear Attention Transformers with Hardware-Efficient Training
by: Yang, Songlin, et al.
Published: (2023)

OpenGuardrails: A Configurable, Unified, and Scalable Guardrails Platform for Large Language Models
by: Wang, Thomas, et al.
Published: (2025)

VisualRWKV-HD and UHD: Advancing High-Resolution Processing for Visual Language Models
by: Li, Zihang, et al.
Published: (2024)

Gated Tree Cross-Attention for Checkpoint-Compatible Syntax Injection in Decoder-Only LLMs
by: Gao, Xinyu, et al.
Published: (2026)

MARRO: Multi-headed Attention for Rhetorical Role Labeling in Legal Documents
by: Bambroo, Purbid, et al.
Published: (2025)

Compressed Convolutional Attention: Efficient Attention in a Compressed Latent Space
by: Figliolia, Tomas, et al.
Published: (2025)

Efficient Ternary Weight Embedding Model: Bridging Scalability and Performance
by: Chen, Jiayi, et al.
Published: (2024)

GTA: Grouped-head latenT Attention
by: Sun, Luoyang, et al.
Published: (2025)

Arctic-Embed: Scalable, Efficient, and Accurate Text Embedding Models
by: Merrick, Luke, et al.
Published: (2024)

Optimizing Native Sparse Attention with Latent Attention and Local Global Alternating Strategies
by: Hu, Yuxuan, et al.
Published: (2025)

CausalEmbed: Auto-Regressive Multi-Vector Generation in Latent Space for Visual Document Embedding
by: Huo, Jiahao, et al.
Published: (2026)

Towards Economical Inference: Enabling DeepSeek's Multi-Head Latent Attention in Any Transformer-based LLMs
by: Ji, Tao, et al.
Published: (2025)

Towards Better Multi-head Attention via Channel-wise Sample Permutation
by: Yuan, Shen, et al.
Published: (2024)

Latent Multi-Head Attention for Small Language Models
by: Mehta, Sushant, et al.
Published: (2025)

Do Multilingual LLMs have specialized language heads?
by: Naufil, Muhammad
Published: (2026)

FlashMLA-ETAP: Efficient Transpose Attention Pipeline for Accelerating MLA Inference on NVIDIA H20 GPUs
by: Dege, Pengcuo, et al.
Published: (2025)

Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models
by: De, Soham, et al.
Published: (2024)

ReGLA: Refining Gated Linear Attention
by: Lu, Peng, et al.
Published: (2025)

Fast-MIA: Efficient and Scalable Membership Inference for LLMs
by: Takahashi, Hiromu, et al.
Published: (2025)

Flux Attention: Context-Aware Hybrid Attention for Efficient LLMs Inference
by: Qiu, Quantong, et al.
Published: (2026)

Answer-Centric or Reasoning-Driven? Uncovering the Latent Memory Anchor in LLMs
by: Wu, Yang, et al.
Published: (2025)

SR-KI: Scalable and Real-Time Knowledge Integration into LLMs via Supervised Attention
by: Yu, Bohan, et al.
Published: (2025)

GAProtoNet: A Multi-head Graph Attention-based Prototypical Network for Interpretable Text Classification
by: Wen, Ximing, et al.
Published: (2024)

Uncertainty-Aware Attention Heads: Efficient Unsupervised Uncertainty Quantification for LLMs
by: Vazhentsev, Artem, et al.
Published: (2025)

Attn-GS: Attention-Guided Context Compression for Efficient Personalized LLMs
by: Zeng, Shenglai, et al.
Published: (2026)

Scalable Multi-phase Word Embedding Using Conjunctive Propositional Clauses
by: Kadhim, Ahmed K., et al.
Published: (2025)

Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free
by: Qiu, Zihan, et al.
Published: (2025)

SAGE: A Novelty Gate for Efficient Memory Evolution in Agentic LLMs
by: Wang, Sijia, et al.
Published: (2026)

Efficient Latent Semantic Clustering for Scaling Test-Time Computation of LLMs
by: Lee, Sungjae, et al.
Published: (2025)

Memorization and Knowledge Injection in Gated LLMs
by: Pan, Xu, et al.
Published: (2025)