Saved in:
| Main Authors: | Luo, Cheng, Cai, Zefan, Sun, Hanshi, Xiao, Jinqi, Yuan, Bo, Xiao, Wen, Hu, Junjie, Zhao, Jiawei, Chen, Beidi, Anandkumar, Anima |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2502.12574 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
MOM: Memory-Efficient Offloaded Mini-Sequence Inference for Long Context Language Models
by: Zhang, Junyang, et al.
Published: (2025)
by: Zhang, Junyang, et al.
Published: (2025)
Mini-Sequence Transformer: Optimizing Intermediate Memory for Long Sequences Training
by: Luo, Cheng, et al.
Published: (2024)
by: Luo, Cheng, et al.
Published: (2024)
GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection
by: Zhao, Jiawei, et al.
Published: (2024)
by: Zhao, Jiawei, et al.
Published: (2024)
R-KV: Redundancy-aware KV Cache Compression for Reasoning Models
by: Cai, Zefan, et al.
Published: (2025)
by: Cai, Zefan, et al.
Published: (2025)
InRank: Incremental Low-Rank Learning
by: Zhao, Jiawei, et al.
Published: (2023)
by: Zhao, Jiawei, et al.
Published: (2023)
EcoSpa: Efficient Transformer Training with Coupled Sparsity
by: Xiao, Jinqi, et al.
Published: (2025)
by: Xiao, Jinqi, et al.
Published: (2025)
Not All Heads Matter: A Head-Level KV Cache Compression Method with Integrated Retrieval and Reasoning
by: Fu, Yu, et al.
Published: (2024)
by: Fu, Yu, et al.
Published: (2024)
Delta Attention Residuals
by: Luo, Cheng, et al.
Published: (2026)
by: Luo, Cheng, et al.
Published: (2026)
TensorGRaD: Tensor Gradient Robust Decomposition for Memory-Efficient Neural Operator Training
by: Loeschcke, Sebastian, et al.
Published: (2025)
by: Loeschcke, Sebastian, et al.
Published: (2025)
FlexInfer: Breaking Memory Constraint via Flexible and Efficient Offloading for On-Device LLM Inference
by: Du, Hongchao, et al.
Published: (2025)
by: Du, Hongchao, et al.
Published: (2025)
PerAda: Parameter-Efficient Federated Learning Personalization with Generalization Guarantees
by: Xie, Chulin, et al.
Published: (2023)
by: Xie, Chulin, et al.
Published: (2023)
ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference
by: Sun, Hanshi, et al.
Published: (2024)
by: Sun, Hanshi, et al.
Published: (2024)
COAP: Memory-Efficient Training with Correlation-Aware Gradient Projection
by: Xiao, Jinqi, et al.
Published: (2024)
by: Xiao, Jinqi, et al.
Published: (2024)
DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads
by: Xiao, Guangxuan, et al.
Published: (2024)
by: Xiao, Guangxuan, et al.
Published: (2024)
MENTOR: Efficient Multimodal-Conditioned Tuning for Autoregressive Vision Generation Models
by: Zhao, Haozhe, et al.
Published: (2025)
by: Zhao, Haozhe, et al.
Published: (2025)
InstInfer: In-Storage Attention Offloading for Cost-Effective Long-Context LLM Inference
by: Pan, Xiurui, et al.
Published: (2024)
by: Pan, Xiurui, et al.
Published: (2024)
CHAI: Clustered Head Attention for Efficient LLM Inference
by: Agarwal, Saurabh, et al.
Published: (2024)
by: Agarwal, Saurabh, et al.
Published: (2024)
Diffusion State-Guided Projected Gradient for Inverse Problems
by: Zirvi, Rayhan, et al.
Published: (2024)
by: Zirvi, Rayhan, et al.
Published: (2024)
Lean Copilot: Large Language Models as Copilots for Theorem Proving in Lean
by: Song, Peiyang, et al.
Published: (2024)
by: Song, Peiyang, et al.
Published: (2024)
Calibrated Uncertainty Quantification for Operator Learning via Conformal Prediction
by: Ma, Ziqi, et al.
Published: (2024)
by: Ma, Ziqi, et al.
Published: (2024)
Mechanistic Interpretability with Sparse Autoencoder Neural Operators
by: Tolooshams, Bahareh, et al.
Published: (2025)
by: Tolooshams, Bahareh, et al.
Published: (2025)
Fourier Neural Operators Explained: A Practical Perspective
by: Duruisseaux, Valentin, et al.
Published: (2025)
by: Duruisseaux, Valentin, et al.
Published: (2025)
Locret: Enhancing Eviction in Long-Context LLM Inference with Trained Retaining Heads on Consumer-Grade Devices
by: Huang, Yuxiang, et al.
Published: (2024)
by: Huang, Yuxiang, et al.
Published: (2024)
COMCAT: Towards Efficient Compression and Customization of Attention-Based Vision Models
by: Xiao, Jinqi, et al.
Published: (2023)
by: Xiao, Jinqi, et al.
Published: (2023)
TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding
by: Sun, Hanshi, et al.
Published: (2024)
by: Sun, Hanshi, et al.
Published: (2024)
Incremental Spatial and Spectral Learning of Neural Operators for Solving Large-Scale PDEs
by: George, Robert Joseph, et al.
Published: (2022)
by: George, Robert Joseph, et al.
Published: (2022)
TopV: Compatible Token Pruning with Inference Time Optimization for Fast and Low-Memory Multimodal Vision Language Model
by: Yang, Cheng, et al.
Published: (2025)
by: Yang, Cheng, et al.
Published: (2025)
Automatic BLAS Offloading on Unified Memory Architecture: A Study on NVIDIA Grace-Hopper
by: Li, Junjie, et al.
Published: (2024)
by: Li, Junjie, et al.
Published: (2024)
Solving Poisson Equations using Neural Walk-on-Spheres
by: Nam, Hong Chul, et al.
Published: (2024)
by: Nam, Hong Chul, et al.
Published: (2024)
Prismer: A Vision-Language Model with Multi-Task Experts
by: Liu, Shikun, et al.
Published: (2023)
by: Liu, Shikun, et al.
Published: (2023)
Improving Vision Transformers by Overlapping Heads in Multi-Head Self-Attention
by: Zhang, Tianxiao, et al.
Published: (2024)
by: Zhang, Tianxiao, et al.
Published: (2024)
Inferring Functionality of Attention Heads from their Parameters
by: Elhelo, Amit, et al.
Published: (2024)
by: Elhelo, Amit, et al.
Published: (2024)
FlashHead: Efficient Drop-In Replacement for the Classification Head in Language Model Inference
by: Tranheden, Wilhelm, et al.
Published: (2026)
by: Tranheden, Wilhelm, et al.
Published: (2026)
MoH: Multi-Head Attention as Mixture-of-Head Attention
by: Jin, Peng, et al.
Published: (2024)
by: Jin, Peng, et al.
Published: (2024)
MemSifter: Offloading LLM Memory Retrieval via Outcome-Driven Proxy Reasoning
by: Tan, Jiejun, et al.
Published: (2026)
by: Tan, Jiejun, et al.
Published: (2026)
Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads
by: Cai, Tianle, et al.
Published: (2024)
by: Cai, Tianle, et al.
Published: (2024)
Geometric Operator Learning with Optimal Transport
by: Li, Xinyi, et al.
Published: (2025)
by: Li, Xinyi, et al.
Published: (2025)
Fast Training of Diffusion Models with Masked Transformers
by: Zheng, Hongkai, et al.
Published: (2023)
by: Zheng, Hongkai, et al.
Published: (2023)
ChatGPT Based Data Augmentation for Improved Parameter-Efficient Debiasing of LLMs
by: Han, Pengrui, et al.
Published: (2024)
by: Han, Pengrui, et al.
Published: (2024)
T-Stitch: Accelerating Sampling in Pre-Trained Diffusion Models with Trajectory Stitching
by: Pan, Zizheng, et al.
Published: (2024)
by: Pan, Zizheng, et al.
Published: (2024)
Similar Items
-
MOM: Memory-Efficient Offloaded Mini-Sequence Inference for Long Context Language Models
by: Zhang, Junyang, et al.
Published: (2025) -
Mini-Sequence Transformer: Optimizing Intermediate Memory for Long Sequences Training
by: Luo, Cheng, et al.
Published: (2024) -
GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection
by: Zhao, Jiawei, et al.
Published: (2024) -
R-KV: Redundancy-aware KV Cache Compression for Reasoning Models
by: Cai, Zefan, et al.
Published: (2025) -
InRank: Incremental Low-Rank Learning
by: Zhao, Jiawei, et al.
Published: (2023)