:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Zhang, Pengle, Wei, Jia, Zhang, Jintao, Zhu, Jun, Chen, Jianfei
Format:	Preprint
Published:	2025
Subjects:	Machine Learning
Online Access:	https://arxiv.org/abs/2503.08040
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration
by: Zhang, Jintao, et al.
Published: (2024)

SageAttention2: Efficient Attention with Thorough Outlier Smoothing and Per-thread INT4 Quantization
by: Zhang, Jintao, et al.
Published: (2024)

Jetfire: Efficient and Accurate Transformer Pretraining with INT8 Data Flow and Per-Block Quantization
by: Xi, Haocheng, et al.
Published: (2024)

SageAttention3: Microscaling FP4 Attention for Inference and An Exploration of 8-Bit Training
by: Zhang, Jintao, et al.
Published: (2025)

TetraJet-v2: Accurate NVFP4 Training for Large Language Models with Oscillation Suppression and Outlier Control
by: Chen, Yuxiang, et al.
Published: (2025)

SageAttention2++: A More Efficient Implementation of SageAttention2
by: Zhang, Jintao, et al.
Published: (2025)

SpargeAttention: Accurate and Training-free Sparse Attention Accelerating Any Model Inference
by: Zhang, Jintao, et al.
Published: (2025)

Deterministic Differentiable Structured Pruning for Large Language Models
by: Huang, Weiyu, et al.
Published: (2026)

INT-FlashAttention: Enabling Flash Attention for INT8 Quantization
by: Chen, Shimao, et al.
Published: (2024)

FF-INT8: Efficient Forward-Forward DNN Training on Edge Devices with INT8 Precision
by: Ma, Jingxiao, et al.
Published: (2025)

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels
by: Wang, Han, et al.
Published: (2026)

Safer Policy Compliance with Dynamic Epistemic Fallback
by: Imperial, Joseph Marvin, et al.
Published: (2026)

Identifying Sensitive Weights via Post-quantization Integral
by: Hu, Yuezhou, et al.
Published: (2025)

Rotated Runtime Smooth: Training-Free Activation Smoother for accurate INT4 inference
by: Yi, Ke, et al.
Published: (2024)

CAST: Continuous and Differentiable Semi-Structured Sparsity-Aware Training for Large Language Models
by: Huang, Weiyu, et al.
Published: (2025)

Oscillation-Reduced MXFP4 Training for Vision Transformers
by: Chen, Yuxiang, et al.
Published: (2025)

SageBwd: A Trainable Low-bit Attention
by: Zhang, Jintao, et al.
Published: (2026)

FlexQ: Efficient Post-training INT6 Quantization for LLM Serving via Algorithm-System Co-Design
by: Zhang, Hao, et al.
Published: (2025)

DeltaDock: A Unified Framework for Accurate, Efficient, and Physically Reliable Molecular Docking
by: Yan, Jiaxian, et al.
Published: (2024)

ReMoE: Fully Differentiable Mixture-of-Experts with ReLU Routing
by: Wang, Ziteng, et al.
Published: (2024)

S-STE: Continuous Pruning Function for Efficient 2:4 Sparse Pre-training
by: Hu, Yuezhou, et al.
Published: (2024)

Efficient Backpropagation with Variance-Controlled Adaptive Sampling
by: Wang, Ziteng, et al.
Published: (2024)

SpargeAttention2: Trainable Sparse Attention via Hybrid Top-k+Top-p Masking and Distillation Fine-Tuning
by: Zhang, Jintao, et al.
Published: (2026)

GPU-Accelerated INT8 Quantization for KV Cache Compression in Large Language Models
by: Taneja, Maanas, et al.
Published: (2026)

COAT: Compressing Optimizer states and Activation for Memory-Efficient FP8 Training
by: Xi, Haocheng, et al.
Published: (2024)

InfLLM: Training-Free Long-Context Extrapolation for LLMs with an Efficient Context Memory
by: Xiao, Chaojun, et al.
Published: (2024)

TurboDiffusion: Accelerating Video Diffusion Models by 100-200 Times
by: Zhang, Jintao, et al.
Published: (2025)

Large Scale Diffusion Distillation via Score-Regularized Continuous-Time Consistency
by: Zheng, Kaiwen, et al.
Published: (2025)

Improved Techniques for Maximum Likelihood Estimation for Diffusion ODEs
by: Zheng, Kaiwen, et al.
Published: (2023)

SAW-INT4: System-Aware 4-Bit KV-Cache Quantization for Real-World LLM Serving
by: Jia, Jinda, et al.
Published: (2026)

FP=xINT:Representing Neural Networks via Low-Bit Series Basis Functions
by: Zhang, Boyang, et al.
Published: (2024)

MOSS: Efficient and Accurate FP8 LLM Training with Microscaling and Automatic Scaling
by: Zhang, Yu, et al.
Published: (2025)

NTP-INT: Network Traffic Prediction-Driven In-band Network Telemetry for High-load Switches
by: Zhang, Penghui, et al.
Published: (2025)

Exploring Dynamic Properties of Backdoor Training Through Information Bottleneck
by: Liu, Xinyu, et al.
Published: (2025)

FireQ: Fast INT4-FP8 Kernel and RoPE-aware Quantization for LLM Inference Acceleration
by: Baek, Daehyeon, et al.
Published: (2025)

SLA2: Sparse-Linear Attention with Learnable Routing and QAT
by: Zhang, Jintao, et al.
Published: (2026)

INT v.s. FP: A Comprehensive Study of Fine-Grained Low-bit Quantization Formats
by: Chen, Mengzhao, et al.
Published: (2025)

Q-GaLore: Quantized GaLore with INT4 Projection and Layer-Adaptive Low-Rank Gradients
by: Zhang, Zhenyu, et al.
Published: (2024)

Efficient Training of Large-Scale AI Models Through Federated Mixture-of-Experts: A System-Level Approach
by: Chen, Xiaobing, et al.
Published: (2025)

Efficient Hyperparameter Tuning via Trajectory Invariance Principle
by: Li, Bingrui, et al.
Published: (2025)