:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Wang, Yumeng, Xiao, Zhenyang
Format:	Preprint
Published:	2024
Subjects:	Machine Learning Computation and Language
Online Access:	https://arxiv.org/abs/2401.09486
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Llamazip: Leveraging LLaMA for Lossless Text Compression and Training Dataset Detection
by: Dréano, Sören, et al.
Published: (2025)

Lossless Token Sequence Compression via Meta-Tokens
by: Harvill, John, et al.
Published: (2025)

Trellis: Learning to Compress Key-Value Memory in Attention Models
by: Karami, Mahdi, et al.
Published: (2025)

SampleAttention: Near-Lossless Acceleration of Long Context LLM Inference with Adaptive Structured Sparse Attention
by: Zhu, Qianchao, et al.
Published: (2024)

Lossless Compression of Large Language Model-Generated Text via Next-Token Prediction
by: Mao, Yu, et al.
Published: (2025)

KeepKV: Achieving Periodic Lossless KV Cache Compression for Efficient LLM Inference
by: Tian, Yuxuan, et al.
Published: (2025)

LoCoCo: Dropping In Convolutions for Long Context Compression
by: Cai, Ruisi, et al.
Published: (2024)

FineZip : Pushing the Limits of Large Language Models for Practical Lossless Text Compression
by: Mittu, Fazal, et al.
Published: (2024)

GEAR: An Efficient KV Cache Compression Recipe for Near-Lossless Generative Inference of LLM
by: Kang, Hao, et al.
Published: (2024)

Boosting Lossless Speculative Decoding via Feature Sampling and Partial Alignment Distillation
by: Gui, Lujun, et al.
Published: (2024)

RAM-Net: Expressive Linear Attention with Selectively Addressable Memory
by: Xiao, Kaicheng, et al.
Published: (2026)

PMKLC: Parallel Multi-Knowledge Learning-based Lossless Compression for Large-Scale Genomics Database
by: Sun, Hui, et al.
Published: (2025)

Probing the Limits of Compressive Memory: A Study of Infini-Attention in Small-Scale Pretraining
by: Huang, Ruizhe, et al.
Published: (2025)

JoMA: Demystifying Multilayer Transformers via JOint Dynamics of MLP and Attention
by: Tian, Yuandong, et al.
Published: (2023)

Kangaroo: Lossless Self-Speculative Decoding via Double Early Exiting
by: Liu, Fangcheng, et al.
Published: (2024)

LoLA: Low-Rank Linear Attention With Sparse Caching
by: McDermott, Luke, et al.
Published: (2025)

S$^3$-Attention:Attention-Aligned Endogenous Retrieval for Memory-Bounded Long-Context Inference
by: Ma, Qingsen, et al.
Published: (2026)

AttentionPredictor: Temporal Patterns Matter for KV Cache Compression
by: Yang, Qingyue, et al.
Published: (2025)

SWAN: Sparse Winnowed Attention for Reduced Inference Memory via Decompression-Free KV-Cache Compression
by: S, Santhosh G, et al.
Published: (2025)

Compressed Context Memory For Online Language Model Interaction
by: Kim, Jang-Hyun, et al.
Published: (2023)

RazorAttention: Efficient KV Cache Compression Through Retrieval Heads
by: Tang, Hanlin, et al.
Published: (2024)

LogLLaMA: Transformer-based log anomaly detection with LLaMA
by: Yang, Zhuoyi, et al.
Published: (2025)

ECHO-LLaMA: Efficient Caching for High-Performance LLaMA Training
by: Dialameh, Maryam, et al.
Published: (2025)

BiTA: Bi-Directional Tuning for Lossless Acceleration in Large Language Models
by: Lin, Feng, et al.
Published: (2024)

CompAct: Compressed Activations for Memory-Efficient LLM Training
by: Shamshoum, Yara, et al.
Published: (2024)

HCAttention: Extreme KV Cache Compression via Heterogeneous Attention Computing for LLMs
by: Yang, Dongquan, et al.
Published: (2025)

LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy
by: Zhang, Rongzhi, et al.
Published: (2024)

Sparse Memory Finetuning as a Low-Forgetting Alternative to LoRA and Full Finetuning
by: Gupta, Prakhar, et al.
Published: (2026)

LoSA: Locality Aware Sparse Attention for Block-Wise Diffusion Language Models
by: Xi, Haocheng, et al.
Published: (2026)

TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding
by: Sun, Hanshi, et al.
Published: (2024)

Divide & Bind Your Attention for Improved Generative Semantic Nursing
by: Li, Yumeng, et al.
Published: (2023)

Eigen Attention: Attention in Low-Rank Space for KV Cache Compression
by: Saxena, Utkarsh, et al.
Published: (2024)

Lossless Vocabulary Reduction for Auto-Regressive Language Models
by: Chijiwa, Daiki, et al.
Published: (2025)

Lossless Acceleration of Large Language Model via Adaptive N-gram Parallel Decoding
by: Ou, Jie, et al.
Published: (2024)

HyperAdaLoRA: Accelerating LoRA Rank Allocation During Training via Hypernetworks without Sacrificing Performance
by: Zhang, Hao, et al.
Published: (2025)

TensorLLM: Tensorising Multi-Head Attention for Enhanced Reasoning and Compression in LLMs
by: Gu, Yuxuan, et al.
Published: (2025)

Trust Functions: Near-Lossless Weak-to-Strong Generalization by Learning When to Trust the Weak Teacher
by: Uzunoglu, Arda, et al.
Published: (2026)

LongSpec: Long-Context Lossless Speculative Decoding with Efficient Drafting and Verification
by: Yang, Penghui, et al.
Published: (2025)

Proximity to Losslessly Compressible Parameters
by: Farrugia-Roberts, Matthew
Published: (2023)

2-Tier SimCSE: Elevating BERT for Robust Sentence Embeddings
by: Wang, Yumeng, et al.
Published: (2025)