Saved in:
| Main Authors: | Zhang, Pengle, Wei, Jia, Zhang, Jintao, Zhu, Jun, Chen, Jianfei |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2503.08040 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration
by: Zhang, Jintao, et al.
Published: (2024)
by: Zhang, Jintao, et al.
Published: (2024)
SageAttention2: Efficient Attention with Thorough Outlier Smoothing and Per-thread INT4 Quantization
by: Zhang, Jintao, et al.
Published: (2024)
by: Zhang, Jintao, et al.
Published: (2024)
Jetfire: Efficient and Accurate Transformer Pretraining with INT8 Data Flow and Per-Block Quantization
by: Xi, Haocheng, et al.
Published: (2024)
by: Xi, Haocheng, et al.
Published: (2024)
SageAttention3: Microscaling FP4 Attention for Inference and An Exploration of 8-Bit Training
by: Zhang, Jintao, et al.
Published: (2025)
by: Zhang, Jintao, et al.
Published: (2025)
TetraJet-v2: Accurate NVFP4 Training for Large Language Models with Oscillation Suppression and Outlier Control
by: Chen, Yuxiang, et al.
Published: (2025)
by: Chen, Yuxiang, et al.
Published: (2025)
SageAttention2++: A More Efficient Implementation of SageAttention2
by: Zhang, Jintao, et al.
Published: (2025)
by: Zhang, Jintao, et al.
Published: (2025)
SpargeAttention: Accurate and Training-free Sparse Attention Accelerating Any Model Inference
by: Zhang, Jintao, et al.
Published: (2025)
by: Zhang, Jintao, et al.
Published: (2025)
Deterministic Differentiable Structured Pruning for Large Language Models
by: Huang, Weiyu, et al.
Published: (2026)
by: Huang, Weiyu, et al.
Published: (2026)
INT-FlashAttention: Enabling Flash Attention for INT8 Quantization
by: Chen, Shimao, et al.
Published: (2024)
by: Chen, Shimao, et al.
Published: (2024)
FF-INT8: Efficient Forward-Forward DNN Training on Edge Devices with INT8 Precision
by: Ma, Jingxiao, et al.
Published: (2025)
by: Ma, Jingxiao, et al.
Published: (2025)
KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels
by: Wang, Han, et al.
Published: (2026)
by: Wang, Han, et al.
Published: (2026)
Safer Policy Compliance with Dynamic Epistemic Fallback
by: Imperial, Joseph Marvin, et al.
Published: (2026)
by: Imperial, Joseph Marvin, et al.
Published: (2026)
Identifying Sensitive Weights via Post-quantization Integral
by: Hu, Yuezhou, et al.
Published: (2025)
by: Hu, Yuezhou, et al.
Published: (2025)
Rotated Runtime Smooth: Training-Free Activation Smoother for accurate INT4 inference
by: Yi, Ke, et al.
Published: (2024)
by: Yi, Ke, et al.
Published: (2024)
CAST: Continuous and Differentiable Semi-Structured Sparsity-Aware Training for Large Language Models
by: Huang, Weiyu, et al.
Published: (2025)
by: Huang, Weiyu, et al.
Published: (2025)
Oscillation-Reduced MXFP4 Training for Vision Transformers
by: Chen, Yuxiang, et al.
Published: (2025)
by: Chen, Yuxiang, et al.
Published: (2025)
SageBwd: A Trainable Low-bit Attention
by: Zhang, Jintao, et al.
Published: (2026)
by: Zhang, Jintao, et al.
Published: (2026)
FlexQ: Efficient Post-training INT6 Quantization for LLM Serving via Algorithm-System Co-Design
by: Zhang, Hao, et al.
Published: (2025)
by: Zhang, Hao, et al.
Published: (2025)
DeltaDock: A Unified Framework for Accurate, Efficient, and Physically Reliable Molecular Docking
by: Yan, Jiaxian, et al.
Published: (2024)
by: Yan, Jiaxian, et al.
Published: (2024)
ReMoE: Fully Differentiable Mixture-of-Experts with ReLU Routing
by: Wang, Ziteng, et al.
Published: (2024)
by: Wang, Ziteng, et al.
Published: (2024)
S-STE: Continuous Pruning Function for Efficient 2:4 Sparse Pre-training
by: Hu, Yuezhou, et al.
Published: (2024)
by: Hu, Yuezhou, et al.
Published: (2024)
Efficient Backpropagation with Variance-Controlled Adaptive Sampling
by: Wang, Ziteng, et al.
Published: (2024)
by: Wang, Ziteng, et al.
Published: (2024)
SpargeAttention2: Trainable Sparse Attention via Hybrid Top-k+Top-p Masking and Distillation Fine-Tuning
by: Zhang, Jintao, et al.
Published: (2026)
by: Zhang, Jintao, et al.
Published: (2026)
GPU-Accelerated INT8 Quantization for KV Cache Compression in Large Language Models
by: Taneja, Maanas, et al.
Published: (2026)
by: Taneja, Maanas, et al.
Published: (2026)
COAT: Compressing Optimizer states and Activation for Memory-Efficient FP8 Training
by: Xi, Haocheng, et al.
Published: (2024)
by: Xi, Haocheng, et al.
Published: (2024)
InfLLM: Training-Free Long-Context Extrapolation for LLMs with an Efficient Context Memory
by: Xiao, Chaojun, et al.
Published: (2024)
by: Xiao, Chaojun, et al.
Published: (2024)
TurboDiffusion: Accelerating Video Diffusion Models by 100-200 Times
by: Zhang, Jintao, et al.
Published: (2025)
by: Zhang, Jintao, et al.
Published: (2025)
Large Scale Diffusion Distillation via Score-Regularized Continuous-Time Consistency
by: Zheng, Kaiwen, et al.
Published: (2025)
by: Zheng, Kaiwen, et al.
Published: (2025)
Improved Techniques for Maximum Likelihood Estimation for Diffusion ODEs
by: Zheng, Kaiwen, et al.
Published: (2023)
by: Zheng, Kaiwen, et al.
Published: (2023)
SAW-INT4: System-Aware 4-Bit KV-Cache Quantization for Real-World LLM Serving
by: Jia, Jinda, et al.
Published: (2026)
by: Jia, Jinda, et al.
Published: (2026)
FP=xINT:Representing Neural Networks via Low-Bit Series Basis Functions
by: Zhang, Boyang, et al.
Published: (2024)
by: Zhang, Boyang, et al.
Published: (2024)
MOSS: Efficient and Accurate FP8 LLM Training with Microscaling and Automatic Scaling
by: Zhang, Yu, et al.
Published: (2025)
by: Zhang, Yu, et al.
Published: (2025)
NTP-INT: Network Traffic Prediction-Driven In-band Network Telemetry for High-load Switches
by: Zhang, Penghui, et al.
Published: (2025)
by: Zhang, Penghui, et al.
Published: (2025)
Exploring Dynamic Properties of Backdoor Training Through Information Bottleneck
by: Liu, Xinyu, et al.
Published: (2025)
by: Liu, Xinyu, et al.
Published: (2025)
FireQ: Fast INT4-FP8 Kernel and RoPE-aware Quantization for LLM Inference Acceleration
by: Baek, Daehyeon, et al.
Published: (2025)
by: Baek, Daehyeon, et al.
Published: (2025)
SLA2: Sparse-Linear Attention with Learnable Routing and QAT
by: Zhang, Jintao, et al.
Published: (2026)
by: Zhang, Jintao, et al.
Published: (2026)
INT v.s. FP: A Comprehensive Study of Fine-Grained Low-bit Quantization Formats
by: Chen, Mengzhao, et al.
Published: (2025)
by: Chen, Mengzhao, et al.
Published: (2025)
Q-GaLore: Quantized GaLore with INT4 Projection and Layer-Adaptive Low-Rank Gradients
by: Zhang, Zhenyu, et al.
Published: (2024)
by: Zhang, Zhenyu, et al.
Published: (2024)
Efficient Training of Large-Scale AI Models Through Federated Mixture-of-Experts: A System-Level Approach
by: Chen, Xiaobing, et al.
Published: (2025)
by: Chen, Xiaobing, et al.
Published: (2025)
Efficient Hyperparameter Tuning via Trajectory Invariance Principle
by: Li, Bingrui, et al.
Published: (2025)
by: Li, Bingrui, et al.
Published: (2025)
Similar Items
-
SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration
by: Zhang, Jintao, et al.
Published: (2024) -
SageAttention2: Efficient Attention with Thorough Outlier Smoothing and Per-thread INT4 Quantization
by: Zhang, Jintao, et al.
Published: (2024) -
Jetfire: Efficient and Accurate Transformer Pretraining with INT8 Data Flow and Per-Block Quantization
by: Xi, Haocheng, et al.
Published: (2024) -
SageAttention3: Microscaling FP4 Attention for Inference and An Exploration of 8-Bit Training
by: Zhang, Jintao, et al.
Published: (2025) -
TetraJet-v2: Accurate NVFP4 Training for Large Language Models with Oscillation Suppression and Outlier Control
by: Chen, Yuxiang, et al.
Published: (2025)