:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Figliolia, Tomas, Alonso, Nicholas, Iyer, Rishi, Anthony, Quentin, Millidge, Beren
Format:	Preprint
Published:	2025
Subjects:	Computation and Language Artificial Intelligence
Online Access:	https://arxiv.org/abs/2510.04476
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Toward Conversational Agents with Context and Time Sensitive Long-term Memory
by: Alonso, Nick, et al.
Published: (2024)

Online Vector Quantized Attention
by: Alonso, Nick, et al.
Published: (2026)

Zyda-2: a 5 Trillion Token High-Quality Dataset
by: Tokpanov, Yury, et al.
Published: (2024)

BlackMamba: Mixture of Experts for State-Space Models
by: Anthony, Quentin, et al.
Published: (2024)

Hybrid Associative Memories
by: Lufkin, Leon, et al.
Published: (2026)

ZAYA1-8B Technical Report
by: Washbourne, Robert, et al.
Published: (2026)

Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clusters
by: Shyam, Vasudev, et al.
Published: (2024)

Zyda: A 1.3T Dataset for Open Language Modeling
by: Tokpanov, Yury, et al.
Published: (2024)

Zamba: A Compact 7B SSM Hybrid Model
by: Glorioso, Paolo, et al.
Published: (2024)

Training Foundation Models on a Full-Stack AMD Platform: Compute, Networking, and System Design
by: Anthony, Quentin, et al.
Published: (2025)

The Zamba2 Suite: Technical Report
by: Glorioso, Paolo, et al.
Published: (2024)

Eigen Attention: Attention in Low-Rank Space for KV Cache Compression
by: Saxena, Utkarsh, et al.
Published: (2024)

LatentLLM: Attention-Aware Joint Tensor Compression
by: Koike-Akino, Toshiaki, et al.
Published: (2025)

PIS: Linking Importance Sampling and Attention Mechanisms for Efficient Prompt Compression
by: Chen, Lizhe, et al.
Published: (2025)

Focus on the Core: Efficient Attention via Pruned Token Compression for Document Classification
by: Yun, Jungmin, et al.
Published: (2024)

Expected Attention: KV Cache Compression by Estimating Attention from Future Queries Distribution
by: Devoto, Alessio, et al.
Published: (2025)

Sentinel: Decoding Context Utilization via Attention Probing for Efficient LLM Context Compression
by: Zhang, Yong, et al.
Published: (2025)

Compressing Sequences in the Latent Embedding Space: $K$-Token Merging for Large Language Models
by: Xu, Zihao, et al.
Published: (2026)

When Compression Meets Model Compression: Memory-Efficient Double Compression for Large Language Models
by: Wang, Weilan, et al.
Published: (2025)

Vision Token Reduction via Attention-Driven Self-Compression for Efficient Multimodal Large Language Models
by: Deniz, Omer Faruk, et al.
Published: (2026)

Compressible Softmax-Attended Language under Incompressible Attention
by: Lee, Wonsuk
Published: (2026)

HCAttention: Extreme KV Cache Compression via Heterogeneous Attention Computing for LLMs
by: Yang, Dongquan, et al.
Published: (2025)

Decoding Compressed Trust: Scrutinizing the Trustworthiness of Efficient LLMs Under Compression
by: Hong, Junyuan, et al.
Published: (2024)

Probing the Limits of Compressive Memory: A Study of Infini-Attention in Small-Scale Pretraining
by: Huang, Ruizhe, et al.
Published: (2025)

SurfaceLogicKV: Surface and Logic Attention Behaviors are All You Need for Robust KV Cache Compression
by: Li, Mengjie, et al.
Published: (2025)

Projected Compression: Trainable Projection for Efficient Transformer Compression
by: Stefaniak, Maciej, et al.
Published: (2025)

Depth-Recurrent Attention Mixtures: Giving Latent Reasoning the Attention it Deserves
by: Knupp, Jonas, et al.
Published: (2026)

Latent Multi-Head Attention for Small Language Models
by: Mehta, Sushant, et al.
Published: (2025)

LAWCAT: Efficient Distillation from Quadratic to Linear Attention with Convolution across Tokens for Long Context Modeling
by: Liu, Zeyu, et al.
Published: (2025)

LoRA-Squeeze: Simple and Effective Post-Tuning and In-Tuning Compression of LoRA Modules
by: Vulić, Ivan, et al.
Published: (2026)

ZAYA1-VL-8B Technical Report
by: Shapourian, Hassan, et al.
Published: (2026)

EFPC: Towards Efficient and Flexible Prompt Compression
by: Cao, Yun-Hao, et al.
Published: (2025)

SWAN: Sparse Winnowed Attention for Reduced Inference Memory via Decompression-Free KV-Cache Compression
by: S, Santhosh G, et al.
Published: (2025)

SVDq: 1.25-bit and 410x Key Cache Compression for LLM Attention
by: Yankun, Hong, et al.
Published: (2025)

ConvD: Attention Enhanced Dynamic Convolutional Embeddings for Knowledge Graph Completion
by: Guo, Wenbin, et al.
Published: (2023)

Taipan: Efficient and Expressive State Space Language Models with Selective Attention
by: Van Nguyen, Chien, et al.
Published: (2024)

RuPLaR : Efficient Latent Compression of LLM Reasoning Chains with Rule-Based Priors From Multi-Step to One-Step
by: Luo, Xiaocheng, et al.
Published: (2026)

Mixture-of-PageRanks: Replacing Long-Context with Real-Time, Sparse GraphRAG
by: Alonso, Nicholas, et al.
Published: (2024)

Efficiently Dispatching Flash Attention For Partially Filled Attention Masks
by: Sharma, Agniv, et al.
Published: (2024)

Dynamic Compressing Prompts for Efficient Inference of Large Language Models
by: Hu, Jinwu, et al.
Published: (2025)