Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Modoranu, Ionut-Vlad, Zmushko, Philip, Schultheis, Erik, Safaryan, Mher, Alistarh, Dan
Format:	Preprint
Published:	2026
Subjects:	Machine Learning
Online Access:	https://arxiv.org/abs/2602.02016
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866911416383963136
author	Modoranu, Ionut-Vlad Zmushko, Philip Schultheis, Erik Safaryan, Mher Alistarh, Dan
author_facet	Modoranu, Ionut-Vlad Zmushko, Philip Schultheis, Erik Safaryan, Mher Alistarh, Dan
contents	Shampoo is one of the leading approximate second-order optimizers: a variant of it has won the MLCommons AlgoPerf competition, and it has been shown to produce models with lower activation outliers that are easier to compress. Yet, applying Shampoo currently comes at the cost of significant computational slowdown, due to its expensive internal operations. In this paper, we take a significant step to address this shortcoming by proposing \method (for \textbf{D}istributed \textbf{A}ccelerated \textbf{SH}ampoo), a faster implementation of Distributed Shampoo based on two main new techniques: First, we show that preconditioner blocks can be stacked into 3D tensors to significantly improve GPU utilization; second, we introduce the Newton-DB iteration and the Chebyshev polynomial approximations as novel and faster approaches for computing the inverse matrix roots required by Shampoo. Along with these algorithmic contributions, we provide a first in-depth analysis of how matrix scaling critically affects Shampoo convergence. On the practical side, our GPU-aware implementation achieves up to $4.83\times$ faster optimizer steps compared to the well-optimized Distributed Shampoo, while Newton-DB attains the lowest validation perplexity per iteration among all tested methods. Our code is available at https://github.com/IST-DASLab/DASH.
format	Preprint
id	arxiv_https___arxiv_org_abs_2602_02016
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	DASH: Faster Shampoo via Batched Block Preconditioning and Efficient Inverse-Root Solvers Modoranu, Ionut-Vlad Zmushko, Philip Schultheis, Erik Safaryan, Mher Alistarh, Dan Machine Learning Shampoo is one of the leading approximate second-order optimizers: a variant of it has won the MLCommons AlgoPerf competition, and it has been shown to produce models with lower activation outliers that are easier to compress. Yet, applying Shampoo currently comes at the cost of significant computational slowdown, due to its expensive internal operations. In this paper, we take a significant step to address this shortcoming by proposing \method (for \textbf{D}istributed \textbf{A}ccelerated \textbf{SH}ampoo), a faster implementation of Distributed Shampoo based on two main new techniques: First, we show that preconditioner blocks can be stacked into 3D tensors to significantly improve GPU utilization; second, we introduce the Newton-DB iteration and the Chebyshev polynomial approximations as novel and faster approaches for computing the inverse matrix roots required by Shampoo. Along with these algorithmic contributions, we provide a first in-depth analysis of how matrix scaling critically affects Shampoo convergence. On the practical side, our GPU-aware implementation achieves up to $4.83\times$ faster optimizer steps compared to the well-optimized Distributed Shampoo, while Newton-DB attains the lowest validation perplexity per iteration among all tested methods. Our code is available at https://github.com/IST-DASLab/DASH.
title	DASH: Faster Shampoo via Batched Block Preconditioning and Efficient Inverse-Root Solvers
topic	Machine Learning
url	https://arxiv.org/abs/2602.02016

Similar Items