Saved in:
| Main Authors: | Figliolia, Tomas, Alonso, Nicholas, Iyer, Rishi, Anthony, Quentin, Millidge, Beren |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2510.04476 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Toward Conversational Agents with Context and Time Sensitive Long-term Memory
by: Alonso, Nick, et al.
Published: (2024)
by: Alonso, Nick, et al.
Published: (2024)
Online Vector Quantized Attention
by: Alonso, Nick, et al.
Published: (2026)
by: Alonso, Nick, et al.
Published: (2026)
Zyda-2: a 5 Trillion Token High-Quality Dataset
by: Tokpanov, Yury, et al.
Published: (2024)
by: Tokpanov, Yury, et al.
Published: (2024)
BlackMamba: Mixture of Experts for State-Space Models
by: Anthony, Quentin, et al.
Published: (2024)
by: Anthony, Quentin, et al.
Published: (2024)
Hybrid Associative Memories
by: Lufkin, Leon, et al.
Published: (2026)
by: Lufkin, Leon, et al.
Published: (2026)
ZAYA1-8B Technical Report
by: Washbourne, Robert, et al.
Published: (2026)
by: Washbourne, Robert, et al.
Published: (2026)
Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clusters
by: Shyam, Vasudev, et al.
Published: (2024)
by: Shyam, Vasudev, et al.
Published: (2024)
Zyda: A 1.3T Dataset for Open Language Modeling
by: Tokpanov, Yury, et al.
Published: (2024)
by: Tokpanov, Yury, et al.
Published: (2024)
Zamba: A Compact 7B SSM Hybrid Model
by: Glorioso, Paolo, et al.
Published: (2024)
by: Glorioso, Paolo, et al.
Published: (2024)
Training Foundation Models on a Full-Stack AMD Platform: Compute, Networking, and System Design
by: Anthony, Quentin, et al.
Published: (2025)
by: Anthony, Quentin, et al.
Published: (2025)
The Zamba2 Suite: Technical Report
by: Glorioso, Paolo, et al.
Published: (2024)
by: Glorioso, Paolo, et al.
Published: (2024)
Eigen Attention: Attention in Low-Rank Space for KV Cache Compression
by: Saxena, Utkarsh, et al.
Published: (2024)
by: Saxena, Utkarsh, et al.
Published: (2024)
LatentLLM: Attention-Aware Joint Tensor Compression
by: Koike-Akino, Toshiaki, et al.
Published: (2025)
by: Koike-Akino, Toshiaki, et al.
Published: (2025)
PIS: Linking Importance Sampling and Attention Mechanisms for Efficient Prompt Compression
by: Chen, Lizhe, et al.
Published: (2025)
by: Chen, Lizhe, et al.
Published: (2025)
Focus on the Core: Efficient Attention via Pruned Token Compression for Document Classification
by: Yun, Jungmin, et al.
Published: (2024)
by: Yun, Jungmin, et al.
Published: (2024)
Expected Attention: KV Cache Compression by Estimating Attention from Future Queries Distribution
by: Devoto, Alessio, et al.
Published: (2025)
by: Devoto, Alessio, et al.
Published: (2025)
Sentinel: Decoding Context Utilization via Attention Probing for Efficient LLM Context Compression
by: Zhang, Yong, et al.
Published: (2025)
by: Zhang, Yong, et al.
Published: (2025)
Compressing Sequences in the Latent Embedding Space: $K$-Token Merging for Large Language Models
by: Xu, Zihao, et al.
Published: (2026)
by: Xu, Zihao, et al.
Published: (2026)
When Compression Meets Model Compression: Memory-Efficient Double Compression for Large Language Models
by: Wang, Weilan, et al.
Published: (2025)
by: Wang, Weilan, et al.
Published: (2025)
Vision Token Reduction via Attention-Driven Self-Compression for Efficient Multimodal Large Language Models
by: Deniz, Omer Faruk, et al.
Published: (2026)
by: Deniz, Omer Faruk, et al.
Published: (2026)
Compressible Softmax-Attended Language under Incompressible Attention
by: Lee, Wonsuk
Published: (2026)
by: Lee, Wonsuk
Published: (2026)
HCAttention: Extreme KV Cache Compression via Heterogeneous Attention Computing for LLMs
by: Yang, Dongquan, et al.
Published: (2025)
by: Yang, Dongquan, et al.
Published: (2025)
Decoding Compressed Trust: Scrutinizing the Trustworthiness of Efficient LLMs Under Compression
by: Hong, Junyuan, et al.
Published: (2024)
by: Hong, Junyuan, et al.
Published: (2024)
Probing the Limits of Compressive Memory: A Study of Infini-Attention in Small-Scale Pretraining
by: Huang, Ruizhe, et al.
Published: (2025)
by: Huang, Ruizhe, et al.
Published: (2025)
SurfaceLogicKV: Surface and Logic Attention Behaviors are All You Need for Robust KV Cache Compression
by: Li, Mengjie, et al.
Published: (2025)
by: Li, Mengjie, et al.
Published: (2025)
Projected Compression: Trainable Projection for Efficient Transformer Compression
by: Stefaniak, Maciej, et al.
Published: (2025)
by: Stefaniak, Maciej, et al.
Published: (2025)
Depth-Recurrent Attention Mixtures: Giving Latent Reasoning the Attention it Deserves
by: Knupp, Jonas, et al.
Published: (2026)
by: Knupp, Jonas, et al.
Published: (2026)
Latent Multi-Head Attention for Small Language Models
by: Mehta, Sushant, et al.
Published: (2025)
by: Mehta, Sushant, et al.
Published: (2025)
LAWCAT: Efficient Distillation from Quadratic to Linear Attention with Convolution across Tokens for Long Context Modeling
by: Liu, Zeyu, et al.
Published: (2025)
by: Liu, Zeyu, et al.
Published: (2025)
LoRA-Squeeze: Simple and Effective Post-Tuning and In-Tuning Compression of LoRA Modules
by: Vulić, Ivan, et al.
Published: (2026)
by: Vulić, Ivan, et al.
Published: (2026)
ZAYA1-VL-8B Technical Report
by: Shapourian, Hassan, et al.
Published: (2026)
by: Shapourian, Hassan, et al.
Published: (2026)
EFPC: Towards Efficient and Flexible Prompt Compression
by: Cao, Yun-Hao, et al.
Published: (2025)
by: Cao, Yun-Hao, et al.
Published: (2025)
SWAN: Sparse Winnowed Attention for Reduced Inference Memory via Decompression-Free KV-Cache Compression
by: S, Santhosh G, et al.
Published: (2025)
by: S, Santhosh G, et al.
Published: (2025)
SVDq: 1.25-bit and 410x Key Cache Compression for LLM Attention
by: Yankun, Hong, et al.
Published: (2025)
by: Yankun, Hong, et al.
Published: (2025)
ConvD: Attention Enhanced Dynamic Convolutional Embeddings for Knowledge Graph Completion
by: Guo, Wenbin, et al.
Published: (2023)
by: Guo, Wenbin, et al.
Published: (2023)
Taipan: Efficient and Expressive State Space Language Models with Selective Attention
by: Van Nguyen, Chien, et al.
Published: (2024)
by: Van Nguyen, Chien, et al.
Published: (2024)
RuPLaR : Efficient Latent Compression of LLM Reasoning Chains with Rule-Based Priors From Multi-Step to One-Step
by: Luo, Xiaocheng, et al.
Published: (2026)
by: Luo, Xiaocheng, et al.
Published: (2026)
Mixture-of-PageRanks: Replacing Long-Context with Real-Time, Sparse GraphRAG
by: Alonso, Nicholas, et al.
Published: (2024)
by: Alonso, Nicholas, et al.
Published: (2024)
Efficiently Dispatching Flash Attention For Partially Filled Attention Masks
by: Sharma, Agniv, et al.
Published: (2024)
by: Sharma, Agniv, et al.
Published: (2024)
Dynamic Compressing Prompts for Efficient Inference of Large Language Models
by: Hu, Jinwu, et al.
Published: (2025)
by: Hu, Jinwu, et al.
Published: (2025)
Similar Items
-
Toward Conversational Agents with Context and Time Sensitive Long-term Memory
by: Alonso, Nick, et al.
Published: (2024) -
Online Vector Quantized Attention
by: Alonso, Nick, et al.
Published: (2026) -
Zyda-2: a 5 Trillion Token High-Quality Dataset
by: Tokpanov, Yury, et al.
Published: (2024) -
BlackMamba: Mixture of Experts for State-Space Models
by: Anthony, Quentin, et al.
Published: (2024) -
Hybrid Associative Memories
by: Lufkin, Leon, et al.
Published: (2026)