Staff View: :: Library Catalog

$Cover Image$

Saved in:

Bibliographic Details
Main Authors:	Burlachenko, Konstantin, Richtárik, Peter
Format:	Preprint
Published:	2025
Subjects:	Machine Learning Mathematical Software 65Y10 I.2.6; I.2.8; C.1.3; C.5.3; G.4; C.3
Online Access:	https://arxiv.org/abs/2503.13795
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866910880244957184
author	Burlachenko, Konstantin Richtárik, Peter
author_facet	Burlachenko, Konstantin Richtárik, Peter
contents	In this work, we introduce BurTorch, a compact high-performance framework designed to optimize Deep Learning (DL) training on single-node workstations through an exceptionally efficient CPU-based backpropagation (Rumelhart et al., 1986; Linnainmaa, 1970) implementation. Although modern DL frameworks rely on compilerlike optimizations internally, BurTorch takes a different path. It adopts a minimalist design and demonstrates that, in these circumstances, classical compiled programming languages can play a significant role in DL research. By eliminating the overhead of large frameworks and making efficient implementation choices, BurTorch achieves orders-of-magnitude improvements in performance and memory efficiency when computing $\nabla f(x)$ on a CPU. BurTorch features a compact codebase designed to achieve two key goals simultaneously. First, it provides a user experience similar to script-based programming environments. Second, it dramatically minimizes runtime overheads. In large DL frameworks, the primary source of memory overhead for relatively small computation graphs $f(x)$ is due to feature-heavy implementations. We benchmarked BurTorch against widely used DL frameworks in their execution modes: JAX (Bradbury et al., 2018), PyTorch (Paszke et al., 2019), TensorFlow (Abadi et al., 2016); and several standalone libraries: Autograd (Maclaurin et al., 2015), Micrograd (Karpathy, 2020), Apple MLX (Hannun et al., 2023). For small compute graphs, BurTorch outperforms best-practice solutions by up to $\times 2000$ in runtime and reduces memory consumption by up to $\times 3500$. For a miniaturized GPT-3 model (Brown et al., 2020), BurTorch achieves up to a $\times 20$ speedup and reduces memory up to $\times 80$ compared to PyTorch.
format	Preprint
id	arxiv_https___arxiv_org_abs_2503_13795
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	BurTorch: Revisiting Training from First Principles by Coupling Autodiff, Math Optimization, and Systems Burlachenko, Konstantin Richtárik, Peter Machine Learning Mathematical Software 65Y10 I.2.6; I.2.8; C.1.3; C.5.3; G.4; C.3 In this work, we introduce BurTorch, a compact high-performance framework designed to optimize Deep Learning (DL) training on single-node workstations through an exceptionally efficient CPU-based backpropagation (Rumelhart et al., 1986; Linnainmaa, 1970) implementation. Although modern DL frameworks rely on compilerlike optimizations internally, BurTorch takes a different path. It adopts a minimalist design and demonstrates that, in these circumstances, classical compiled programming languages can play a significant role in DL research. By eliminating the overhead of large frameworks and making efficient implementation choices, BurTorch achieves orders-of-magnitude improvements in performance and memory efficiency when computing $\nabla f(x)$ on a CPU. BurTorch features a compact codebase designed to achieve two key goals simultaneously. First, it provides a user experience similar to script-based programming environments. Second, it dramatically minimizes runtime overheads. In large DL frameworks, the primary source of memory overhead for relatively small computation graphs $f(x)$ is due to feature-heavy implementations. We benchmarked BurTorch against widely used DL frameworks in their execution modes: JAX (Bradbury et al., 2018), PyTorch (Paszke et al., 2019), TensorFlow (Abadi et al., 2016); and several standalone libraries: Autograd (Maclaurin et al., 2015), Micrograd (Karpathy, 2020), Apple MLX (Hannun et al., 2023). For small compute graphs, BurTorch outperforms best-practice solutions by up to $\times 2000$ in runtime and reduces memory consumption by up to $\times 3500$. For a miniaturized GPT-3 model (Brown et al., 2020), BurTorch achieves up to a $\times 20$ speedup and reduces memory up to $\times 80$ compared to PyTorch.
title	BurTorch: Revisiting Training from First Principles by Coupling Autodiff, Math Optimization, and Systems
topic	Machine Learning Mathematical Software 65Y10 I.2.6; I.2.8; C.1.3; C.5.3; G.4; C.3
url	https://arxiv.org/abs/2503.13795