:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Dong, Harry, Johnson, Tyler, Cho, Minsik, Soroush, Emad
Format:	Preprint
Published:	2024
Subjects:	Artificial Intelligence Machine Learning
Online Access:	https://arxiv.org/abs/2411.07942
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Communication Compression for Tensor Parallel LLM Inference
by: Hansen-Palmus, Jan, et al.
Published: (2024)

SPD: Sync-Point Drop for Efficient Tensor Parallelism of Large Language Models
by: Kim, Han-Byul, et al.
Published: (2025)

MoEs Are Stronger than You Think: Hyper-Parallel Inference Scaling with RoE
by: Zibakhsh, Soheil, et al.
Published: (2025)

LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference
by: Fu, Qichen, et al.
Published: (2024)

LLM in a flash: Efficient Large Language Model Inference with Limited Memory
by: Alizadeh, Keivan, et al.
Published: (2023)

SpecMD: A Comprehensive Study On Speculative Expert Prefetching
by: Hoang, Duc, et al.
Published: (2026)

ICQuant: Index Coding enables Low-bit LLM Quantization
by: Li, Xinlin, et al.
Published: (2025)

Recursive Language Models Meet Uncertainty: The Surprising Effectiveness of Self-Reflective Program Search for Long Context
by: Alizadeh, Keivan, et al.
Published: (2026)

KV-Runahead: Scalable Causal LLM Inference by Parallel Key-Value Cache Generation
by: Cho, Minsik, et al.
Published: (2024)

TPLA: Tensor Parallel Latent Attention for Efficient Disaggregated Prefill and Decode Inference
by: Tang, Xiaojuan, et al.
Published: (2025)

SALE : Low-bit Estimation for Efficient Sparse Attention in Long-context LLM Prefilling
by: Ji, Xiaodong, et al.
Published: (2025)

Scalable LLM Reasoning Acceleration with Low-rank Distillation
by: Dong, Harry, et al.
Published: (2025)

any4: Learned 4-bit Numeric Representation for LLMs
by: Elhoushi, Mostafa, et al.
Published: (2025)

Get More with LESS: Synthesizing Recurrence with KV Cache Compression for Efficient LLM Inference
by: Dong, Harry, et al.
Published: (2024)

SageBwd: A Trainable Low-bit Attention
by: Zhang, Jintao, et al.
Published: (2026)

Towards Interpretable Deep Reinforcement Learning Models via Inverse Reinforcement Learning
by: Xie, Sean, et al.
Published: (2022)

Generalized Parallel Scaling with Interdependent Generations
by: Dong, Harry, et al.
Published: (2025)

Low-bit Model Quantization for Deep Neural Networks: A Survey
by: Liu, Kai, et al.
Published: (2025)

MoE-PHDS: One MoE checkpoint for flexible runtime sparsity
by: Hannah, Lauren. A, et al.
Published: (2025)

ButterflyQuant: Ultra-low-bit LLM Quantization through Learnable Orthogonal Butterfly Transforms
by: Xu, Bingxin, et al.
Published: (2025)

TIDE: Every Layer Knows the Token Beneath the Context
by: Jaiswal, Ajay, et al.
Published: (2026)

Towards LLM-guided Efficient and Interpretable Multi-linear Tensor Network Rank Selection
by: Iacovides, Giorgos, et al.
Published: (2024)

D$^2$Quant: Accurate Low-bit Post-Training Weight Quantization for LLMs
by: Yan, Xianglong, et al.
Published: (2026)

Prompt-prompted Adaptive Structured Pruning for Efficient LLM Generation
by: Dong, Harry, et al.
Published: (2024)

Token-Efficient RL for LLM Reasoning
by: Lee, Alan, et al.
Published: (2025)

INT v.s. FP: A Comprehensive Study of Fine-Grained Low-bit Quantization Formats
by: Chen, Mengzhao, et al.
Published: (2025)

Effective Generative AI: The Human-Algorithm Centaur
by: Saghafian, Soroush, et al.
Published: (2024)

Personalized Student Knowledge Modeling for Future Learning Resource Prediction
by: Hashemifar, Soroush, et al.
Published: (2025)

Hyperparameter Trajectory Inference with Conditional Lagrangian Optimal Transport
by: Amad, Harry, et al.
Published: (2026)

ParetoQ: Improving Scaling Laws in Extremely Low-bit LLM Quantization
by: Liu, Zechun, et al.
Published: (2025)

Unmasking On-Policy Distillation: Where It Helps, Where It Hurts, and Why
by: Armandpour, Mohammadreza, et al.
Published: (2026)

Robust Conformal Prediction with a Single Binary Certificate
by: Zargarbashi, Soroush H., et al.
Published: (2025)

A Survey of Low-bit Large Language Models: Basics, Systems, and Algorithms
by: Gong, Ruihao, et al.
Published: (2024)

Low-Rank Tensor Decompositions for the Theory of Neural Networks
by: Borsoi, Ricardo, et al.
Published: (2025)

Low Tensor-Rank Adaptation of Kolmogorov--Arnold Networks
by: Gao, Yihang, et al.
Published: (2025)

WebLLM: A High-Performance In-Browser LLM Inference Engine
by: Ruan, Charlie F., et al.
Published: (2024)

Entropy Meets Importance: A Unified Head Importance-Entropy Score for Stable and Efficient Transformer Pruning
by: Choi, Minsik, et al.
Published: (2025)

Scaling Smart: Accelerating Large Language Model Pre-training with Small Model Initialization
by: Samragh, Mohammad, et al.
Published: (2024)

Matmul or No Matmul in the Era of 1-bit LLMs
by: Malekar, Jinendra, et al.
Published: (2024)

4bit-Quantization in Vector-Embedding for RAG
by: Jeong, Taehee
Published: (2025)