:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Luo, Cheng, Cai, Zefan, Sun, Hanshi, Xiao, Jinqi, Yuan, Bo, Xiao, Wen, Hu, Junjie, Zhao, Jiawei, Chen, Beidi, Anandkumar, Anima
Format:	Preprint
Published:	2025
Subjects:	Machine Learning Artificial Intelligence
Online Access:	https://arxiv.org/abs/2502.12574
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

MOM: Memory-Efficient Offloaded Mini-Sequence Inference for Long Context Language Models
by: Zhang, Junyang, et al.
Published: (2025)

Mini-Sequence Transformer: Optimizing Intermediate Memory for Long Sequences Training
by: Luo, Cheng, et al.
Published: (2024)

GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection
by: Zhao, Jiawei, et al.
Published: (2024)

R-KV: Redundancy-aware KV Cache Compression for Reasoning Models
by: Cai, Zefan, et al.
Published: (2025)

InRank: Incremental Low-Rank Learning
by: Zhao, Jiawei, et al.
Published: (2023)

EcoSpa: Efficient Transformer Training with Coupled Sparsity
by: Xiao, Jinqi, et al.
Published: (2025)

Not All Heads Matter: A Head-Level KV Cache Compression Method with Integrated Retrieval and Reasoning
by: Fu, Yu, et al.
Published: (2024)

Delta Attention Residuals
by: Luo, Cheng, et al.
Published: (2026)

TensorGRaD: Tensor Gradient Robust Decomposition for Memory-Efficient Neural Operator Training
by: Loeschcke, Sebastian, et al.
Published: (2025)

FlexInfer: Breaking Memory Constraint via Flexible and Efficient Offloading for On-Device LLM Inference
by: Du, Hongchao, et al.
Published: (2025)

PerAda: Parameter-Efficient Federated Learning Personalization with Generalization Guarantees
by: Xie, Chulin, et al.
Published: (2023)

ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference
by: Sun, Hanshi, et al.
Published: (2024)

COAP: Memory-Efficient Training with Correlation-Aware Gradient Projection
by: Xiao, Jinqi, et al.
Published: (2024)

DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads
by: Xiao, Guangxuan, et al.
Published: (2024)

MENTOR: Efficient Multimodal-Conditioned Tuning for Autoregressive Vision Generation Models
by: Zhao, Haozhe, et al.
Published: (2025)

InstInfer: In-Storage Attention Offloading for Cost-Effective Long-Context LLM Inference
by: Pan, Xiurui, et al.
Published: (2024)

CHAI: Clustered Head Attention for Efficient LLM Inference
by: Agarwal, Saurabh, et al.
Published: (2024)

Diffusion State-Guided Projected Gradient for Inverse Problems
by: Zirvi, Rayhan, et al.
Published: (2024)

Lean Copilot: Large Language Models as Copilots for Theorem Proving in Lean
by: Song, Peiyang, et al.
Published: (2024)

Calibrated Uncertainty Quantification for Operator Learning via Conformal Prediction
by: Ma, Ziqi, et al.
Published: (2024)

Mechanistic Interpretability with Sparse Autoencoder Neural Operators
by: Tolooshams, Bahareh, et al.
Published: (2025)

Fourier Neural Operators Explained: A Practical Perspective
by: Duruisseaux, Valentin, et al.
Published: (2025)

Locret: Enhancing Eviction in Long-Context LLM Inference with Trained Retaining Heads on Consumer-Grade Devices
by: Huang, Yuxiang, et al.
Published: (2024)

COMCAT: Towards Efficient Compression and Customization of Attention-Based Vision Models
by: Xiao, Jinqi, et al.
Published: (2023)

TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding
by: Sun, Hanshi, et al.
Published: (2024)

Incremental Spatial and Spectral Learning of Neural Operators for Solving Large-Scale PDEs
by: George, Robert Joseph, et al.
Published: (2022)

TopV: Compatible Token Pruning with Inference Time Optimization for Fast and Low-Memory Multimodal Vision Language Model
by: Yang, Cheng, et al.
Published: (2025)

Automatic BLAS Offloading on Unified Memory Architecture: A Study on NVIDIA Grace-Hopper
by: Li, Junjie, et al.
Published: (2024)

Solving Poisson Equations using Neural Walk-on-Spheres
by: Nam, Hong Chul, et al.
Published: (2024)

Prismer: A Vision-Language Model with Multi-Task Experts
by: Liu, Shikun, et al.
Published: (2023)

Improving Vision Transformers by Overlapping Heads in Multi-Head Self-Attention
by: Zhang, Tianxiao, et al.
Published: (2024)

Inferring Functionality of Attention Heads from their Parameters
by: Elhelo, Amit, et al.
Published: (2024)

FlashHead: Efficient Drop-In Replacement for the Classification Head in Language Model Inference
by: Tranheden, Wilhelm, et al.
Published: (2026)

MoH: Multi-Head Attention as Mixture-of-Head Attention
by: Jin, Peng, et al.
Published: (2024)

MemSifter: Offloading LLM Memory Retrieval via Outcome-Driven Proxy Reasoning
by: Tan, Jiejun, et al.
Published: (2026)

Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads
by: Cai, Tianle, et al.
Published: (2024)

Geometric Operator Learning with Optimal Transport
by: Li, Xinyi, et al.
Published: (2025)

Fast Training of Diffusion Models with Masked Transformers
by: Zheng, Hongkai, et al.
Published: (2023)

ChatGPT Based Data Augmentation for Improved Parameter-Efficient Debiasing of LLMs
by: Han, Pengrui, et al.
Published: (2024)

T-Stitch: Accelerating Sampling in Pre-Trained Diffusion Models with Trajectory Stitching
by: Pan, Zizheng, et al.
Published: (2024)