Saved in:
| Main Authors: | Zhang, Tianyi, Shrivastava, Anshumali |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2407.10032 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
KV Cache is 1 Bit Per Channel: Efficient Large Language Model Inference with Coupled Quantization
by: Zhang, Tianyi, et al.
Published: (2024)
by: Zhang, Tianyi, et al.
Published: (2024)
Learning Scalable Structural Representations for Link Prediction with Bloom Signatures
by: Zhang, Tianyi, et al.
Published: (2023)
by: Zhang, Tianyi, et al.
Published: (2023)
SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models
by: Xiao, Guangxuan, et al.
Published: (2022)
by: Xiao, Guangxuan, et al.
Published: (2022)
GuidedQuant: Large Language Model Quantization via Exploiting End Loss Guidance
by: Kim, Jinuk, et al.
Published: (2025)
by: Kim, Jinuk, et al.
Published: (2025)
To Compress or Not? Pushing the Frontier of Lossless GenAI Model Weights Compression with Exponent Concentration
by: Yang, Zeyu, et al.
Published: (2025)
by: Yang, Zeyu, et al.
Published: (2025)
Sketch to Adapt: Fine-Tunable Sketches for Efficient LLM Adaptation
by: Zhang, Tianyi, et al.
Published: (2024)
by: Zhang, Tianyi, et al.
Published: (2024)
MergeQuant: Accurate 4-bit Static Quantization of Large Language Models by Channel-wise Calibration
by: Wang, Jinguang, et al.
Published: (2025)
by: Wang, Jinguang, et al.
Published: (2025)
CoVE: Compressed Vocabulary Expansion Makes Better LLM-based Recommender Systems
by: Zhang, Haochen, et al.
Published: (2025)
by: Zhang, Haochen, et al.
Published: (2025)
NoMAD-Attention: Efficient LLM Inference on CPUs Through Multiply-add-free Attention
by: Zhang, Tianyi, et al.
Published: (2024)
by: Zhang, Tianyi, et al.
Published: (2024)
OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models
by: Shao, Wenqi, et al.
Published: (2023)
by: Shao, Wenqi, et al.
Published: (2023)
RepQuant: Towards Accurate Post-Training Quantization of Large Transformer Models via Scale Reparameterization
by: Li, Zhikai, et al.
Published: (2024)
by: Li, Zhikai, et al.
Published: (2024)
AffineQuant: Affine Transformation Quantization for Large Language Models
by: Ma, Yuexiao, et al.
Published: (2024)
by: Ma, Yuexiao, et al.
Published: (2024)
NanoQuant: Efficient Sub-1-Bit Quantization of Large Language Models
by: Chong, Hyochan, et al.
Published: (2026)
by: Chong, Hyochan, et al.
Published: (2026)
LLMEasyQuant: Scalable Quantization for Parallel and Distributed LLM Inference
by: Liu, Dong, et al.
Published: (2024)
by: Liu, Dong, et al.
Published: (2024)
70% Size, 100% Accuracy: Lossless LLM Compression for Efficient GPU Inference via Dynamic-Length Float (DFloat11)
by: Zhang, Tianyi, et al.
Published: (2025)
by: Zhang, Tianyi, et al.
Published: (2025)
Superintelligent Retrieval Agent: The Next Frontier of Information Retrieval
by: Yang, Zeyu, et al.
Published: (2026)
by: Yang, Zeyu, et al.
Published: (2026)
PrefixQuant: Eliminating Outliers by Prefixed Tokens for Large Language Models Quantization
by: Chen, Mengzhao, et al.
Published: (2024)
by: Chen, Mengzhao, et al.
Published: (2024)
Breaking the Frozen Subspace: Importance Sampling for Low-Rank Optimization in LLM Pretraining
by: Zhang, Haochen, et al.
Published: (2025)
by: Zhang, Haochen, et al.
Published: (2025)
D$^2$Quant: Accurate Low-bit Post-Training Weight Quantization for LLMs
by: Yan, Xianglong, et al.
Published: (2026)
by: Yan, Xianglong, et al.
Published: (2026)
CrossQuant: A Post-Training Quantization Method with Smaller Quantization Kernel for Precise Large Language Model Compression
by: Liu, Wenyuan, et al.
Published: (2024)
by: Liu, Wenyuan, et al.
Published: (2024)
RACE Attention: A Strictly Linear-Time Attention Layer for Training on Outrageously Large Contexts
by: Joshi, Sahil, et al.
Published: (2025)
by: Joshi, Sahil, et al.
Published: (2025)
QuantLRM: Quantization of Large Reasoning Models via Fine-Tuning Signals
by: Zhang, Nan, et al.
Published: (2026)
by: Zhang, Nan, et al.
Published: (2026)
Quant-dLLM: Post-Training Extreme Low-Bit Quantization for Diffusion Large Language Models
by: Zhang, Tianao, et al.
Published: (2025)
by: Zhang, Tianao, et al.
Published: (2025)
Grid Games: The Power of Multiple Grids for Quantizing Large Language Models
by: Egiazarian, Vage, et al.
Published: (2026)
by: Egiazarian, Vage, et al.
Published: (2026)
QuantVLA: Scale-Calibrated Post-Training Quantization for Vision-Language-Action Models
by: Zhang, Jingxuan, et al.
Published: (2026)
by: Zhang, Jingxuan, et al.
Published: (2026)
FlattenQuant: Breaking Through the Inference Compute-bound for Large Language Models with Per-tensor Quantization
by: Zhang, Yi, et al.
Published: (2024)
by: Zhang, Yi, et al.
Published: (2024)
OstQuant: Refining Large Language Model Quantization with Orthogonal and Scaling Transformations for Better Distribution Fitting
by: Hu, Xing, et al.
Published: (2025)
by: Hu, Xing, et al.
Published: (2025)
Scout Before You Attend: Sketch-and-Walk Sparse Attention for Efficient LLM Inference
by: Le, Hoang Anh Duy, et al.
Published: (2026)
by: Le, Hoang Anh Duy, et al.
Published: (2026)
Obstacle-aware Gaussian Process Regression
by: Shrivastava, Gaurav
Published: (2024)
by: Shrivastava, Gaurav
Published: (2024)
FrameQuant: Flexible Low-Bit Quantization for Transformers
by: Adepu, Harshavardhan, et al.
Published: (2024)
by: Adepu, Harshavardhan, et al.
Published: (2024)
FlatQuant: Flatness Matters for LLM Quantization
by: Sun, Yuxuan, et al.
Published: (2024)
by: Sun, Yuxuan, et al.
Published: (2024)
Empowering Distributed Training with Sparsity-driven Data Synchronization
by: Wang, Zhuang, et al.
Published: (2023)
by: Wang, Zhuang, et al.
Published: (2023)
REFRAG: Rethinking RAG based Decoding
by: Lin, Xiaoqiang, et al.
Published: (2025)
by: Lin, Xiaoqiang, et al.
Published: (2025)
pQuant: Towards Effective Low-Bit Language Models via Decoupled Linear Quantization-Aware Training
by: Zhang, Wenzheng, et al.
Published: (2026)
by: Zhang, Wenzheng, et al.
Published: (2026)
Q-resafe: Assessing Safety Risks and Quantization-aware Safety Patching for Quantized Large Language Models
by: Chen, Kejia, et al.
Published: (2025)
by: Chen, Kejia, et al.
Published: (2025)
BPDQ: Bit-Plane Decomposition Quantization on a Variable Grid for Large Language Models
by: Chen, Junyu, et al.
Published: (2026)
by: Chen, Junyu, et al.
Published: (2026)
PolarQuant: Quantizing KV Caches with Polar Transformation
by: Han, Insu, et al.
Published: (2025)
by: Han, Insu, et al.
Published: (2025)
Accurate and Efficient Fine-Tuning of Quantized Large Language Models Through Optimal Balance
by: Shen, Ao, et al.
Published: (2024)
by: Shen, Ao, et al.
Published: (2024)
QLLM: Accurate and Efficient Low-Bitwidth Quantization for Large Language Models
by: Liu, Jing, et al.
Published: (2023)
by: Liu, Jing, et al.
Published: (2023)
QuantKAN: A Unified Quantization Framework for Kolmogorov Arnold Networks
by: Fuad, Kazi Ahmed Asif, et al.
Published: (2025)
by: Fuad, Kazi Ahmed Asif, et al.
Published: (2025)
Similar Items
-
KV Cache is 1 Bit Per Channel: Efficient Large Language Model Inference with Coupled Quantization
by: Zhang, Tianyi, et al.
Published: (2024) -
Learning Scalable Structural Representations for Link Prediction with Bloom Signatures
by: Zhang, Tianyi, et al.
Published: (2023) -
SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models
by: Xiao, Guangxuan, et al.
Published: (2022) -
GuidedQuant: Large Language Model Quantization via Exploiting End Loss Guidance
by: Kim, Jinuk, et al.
Published: (2025) -
To Compress or Not? Pushing the Frontier of Lossless GenAI Model Weights Compression with Exponent Concentration
by: Yang, Zeyu, et al.
Published: (2025)