Saved in:
| Main Authors: | Dong, Harry, Johnson, Tyler, Cho, Minsik, Soroush, Emad |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2411.07942 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Communication Compression for Tensor Parallel LLM Inference
by: Hansen-Palmus, Jan, et al.
Published: (2024)
by: Hansen-Palmus, Jan, et al.
Published: (2024)
SPD: Sync-Point Drop for Efficient Tensor Parallelism of Large Language Models
by: Kim, Han-Byul, et al.
Published: (2025)
by: Kim, Han-Byul, et al.
Published: (2025)
MoEs Are Stronger than You Think: Hyper-Parallel Inference Scaling with RoE
by: Zibakhsh, Soheil, et al.
Published: (2025)
by: Zibakhsh, Soheil, et al.
Published: (2025)
LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference
by: Fu, Qichen, et al.
Published: (2024)
by: Fu, Qichen, et al.
Published: (2024)
LLM in a flash: Efficient Large Language Model Inference with Limited Memory
by: Alizadeh, Keivan, et al.
Published: (2023)
by: Alizadeh, Keivan, et al.
Published: (2023)
SpecMD: A Comprehensive Study On Speculative Expert Prefetching
by: Hoang, Duc, et al.
Published: (2026)
by: Hoang, Duc, et al.
Published: (2026)
ICQuant: Index Coding enables Low-bit LLM Quantization
by: Li, Xinlin, et al.
Published: (2025)
by: Li, Xinlin, et al.
Published: (2025)
Recursive Language Models Meet Uncertainty: The Surprising Effectiveness of Self-Reflective Program Search for Long Context
by: Alizadeh, Keivan, et al.
Published: (2026)
by: Alizadeh, Keivan, et al.
Published: (2026)
KV-Runahead: Scalable Causal LLM Inference by Parallel Key-Value Cache Generation
by: Cho, Minsik, et al.
Published: (2024)
by: Cho, Minsik, et al.
Published: (2024)
TPLA: Tensor Parallel Latent Attention for Efficient Disaggregated Prefill and Decode Inference
by: Tang, Xiaojuan, et al.
Published: (2025)
by: Tang, Xiaojuan, et al.
Published: (2025)
SALE : Low-bit Estimation for Efficient Sparse Attention in Long-context LLM Prefilling
by: Ji, Xiaodong, et al.
Published: (2025)
by: Ji, Xiaodong, et al.
Published: (2025)
Scalable LLM Reasoning Acceleration with Low-rank Distillation
by: Dong, Harry, et al.
Published: (2025)
by: Dong, Harry, et al.
Published: (2025)
any4: Learned 4-bit Numeric Representation for LLMs
by: Elhoushi, Mostafa, et al.
Published: (2025)
by: Elhoushi, Mostafa, et al.
Published: (2025)
Get More with LESS: Synthesizing Recurrence with KV Cache Compression for Efficient LLM Inference
by: Dong, Harry, et al.
Published: (2024)
by: Dong, Harry, et al.
Published: (2024)
SageBwd: A Trainable Low-bit Attention
by: Zhang, Jintao, et al.
Published: (2026)
by: Zhang, Jintao, et al.
Published: (2026)
Towards Interpretable Deep Reinforcement Learning Models via Inverse Reinforcement Learning
by: Xie, Sean, et al.
Published: (2022)
by: Xie, Sean, et al.
Published: (2022)
Generalized Parallel Scaling with Interdependent Generations
by: Dong, Harry, et al.
Published: (2025)
by: Dong, Harry, et al.
Published: (2025)
Low-bit Model Quantization for Deep Neural Networks: A Survey
by: Liu, Kai, et al.
Published: (2025)
by: Liu, Kai, et al.
Published: (2025)
MoE-PHDS: One MoE checkpoint for flexible runtime sparsity
by: Hannah, Lauren. A, et al.
Published: (2025)
by: Hannah, Lauren. A, et al.
Published: (2025)
ButterflyQuant: Ultra-low-bit LLM Quantization through Learnable Orthogonal Butterfly Transforms
by: Xu, Bingxin, et al.
Published: (2025)
by: Xu, Bingxin, et al.
Published: (2025)
TIDE: Every Layer Knows the Token Beneath the Context
by: Jaiswal, Ajay, et al.
Published: (2026)
by: Jaiswal, Ajay, et al.
Published: (2026)
Towards LLM-guided Efficient and Interpretable Multi-linear Tensor Network Rank Selection
by: Iacovides, Giorgos, et al.
Published: (2024)
by: Iacovides, Giorgos, et al.
Published: (2024)
D$^2$Quant: Accurate Low-bit Post-Training Weight Quantization for LLMs
by: Yan, Xianglong, et al.
Published: (2026)
by: Yan, Xianglong, et al.
Published: (2026)
Prompt-prompted Adaptive Structured Pruning for Efficient LLM Generation
by: Dong, Harry, et al.
Published: (2024)
by: Dong, Harry, et al.
Published: (2024)
Token-Efficient RL for LLM Reasoning
by: Lee, Alan, et al.
Published: (2025)
by: Lee, Alan, et al.
Published: (2025)
INT v.s. FP: A Comprehensive Study of Fine-Grained Low-bit Quantization Formats
by: Chen, Mengzhao, et al.
Published: (2025)
by: Chen, Mengzhao, et al.
Published: (2025)
Effective Generative AI: The Human-Algorithm Centaur
by: Saghafian, Soroush, et al.
Published: (2024)
by: Saghafian, Soroush, et al.
Published: (2024)
Personalized Student Knowledge Modeling for Future Learning Resource Prediction
by: Hashemifar, Soroush, et al.
Published: (2025)
by: Hashemifar, Soroush, et al.
Published: (2025)
Hyperparameter Trajectory Inference with Conditional Lagrangian Optimal Transport
by: Amad, Harry, et al.
Published: (2026)
by: Amad, Harry, et al.
Published: (2026)
ParetoQ: Improving Scaling Laws in Extremely Low-bit LLM Quantization
by: Liu, Zechun, et al.
Published: (2025)
by: Liu, Zechun, et al.
Published: (2025)
Unmasking On-Policy Distillation: Where It Helps, Where It Hurts, and Why
by: Armandpour, Mohammadreza, et al.
Published: (2026)
by: Armandpour, Mohammadreza, et al.
Published: (2026)
Robust Conformal Prediction with a Single Binary Certificate
by: Zargarbashi, Soroush H., et al.
Published: (2025)
by: Zargarbashi, Soroush H., et al.
Published: (2025)
A Survey of Low-bit Large Language Models: Basics, Systems, and Algorithms
by: Gong, Ruihao, et al.
Published: (2024)
by: Gong, Ruihao, et al.
Published: (2024)
Low-Rank Tensor Decompositions for the Theory of Neural Networks
by: Borsoi, Ricardo, et al.
Published: (2025)
by: Borsoi, Ricardo, et al.
Published: (2025)
Low Tensor-Rank Adaptation of Kolmogorov--Arnold Networks
by: Gao, Yihang, et al.
Published: (2025)
by: Gao, Yihang, et al.
Published: (2025)
WebLLM: A High-Performance In-Browser LLM Inference Engine
by: Ruan, Charlie F., et al.
Published: (2024)
by: Ruan, Charlie F., et al.
Published: (2024)
Entropy Meets Importance: A Unified Head Importance-Entropy Score for Stable and Efficient Transformer Pruning
by: Choi, Minsik, et al.
Published: (2025)
by: Choi, Minsik, et al.
Published: (2025)
Scaling Smart: Accelerating Large Language Model Pre-training with Small Model Initialization
by: Samragh, Mohammad, et al.
Published: (2024)
by: Samragh, Mohammad, et al.
Published: (2024)
Matmul or No Matmul in the Era of 1-bit LLMs
by: Malekar, Jinendra, et al.
Published: (2024)
by: Malekar, Jinendra, et al.
Published: (2024)
4bit-Quantization in Vector-Embedding for RAG
by: Jeong, Taehee
Published: (2025)
by: Jeong, Taehee
Published: (2025)
Similar Items
-
Communication Compression for Tensor Parallel LLM Inference
by: Hansen-Palmus, Jan, et al.
Published: (2024) -
SPD: Sync-Point Drop for Efficient Tensor Parallelism of Large Language Models
by: Kim, Han-Byul, et al.
Published: (2025) -
MoEs Are Stronger than You Think: Hyper-Parallel Inference Scaling with RoE
by: Zibakhsh, Soheil, et al.
Published: (2025) -
LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference
by: Fu, Qichen, et al.
Published: (2024) -
LLM in a flash: Efficient Large Language Model Inference with Limited Memory
by: Alizadeh, Keivan, et al.
Published: (2023)