Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Mazzawi, Hanna, Awasthi, Pranjal, Gonzalvo, Xavi, Ramalingam, Srikumar
Format:	Preprint
Published:	2024
Subjects:	Machine Learning
Online Access:	https://arxiv.org/abs/2402.05033
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866916489742778368
author	Mazzawi, Hanna Awasthi, Pranjal Gonzalvo, Xavi Ramalingam, Srikumar
author_facet	Mazzawi, Hanna Awasthi, Pranjal Gonzalvo, Xavi Ramalingam, Srikumar
contents	Recent breakthroughs and successful deployment of large language and vision models in a constrained environment predominantly follow a two phase approach. First, large models are trained to achieve peak performance, followed by a model shrinking method to meet hardware constraints; Methods like distillation, compression or quantization help leverage the highly performant large models to induce smaller performant ones. Formally, this can be seen as the problem of identifying an optimal model of size $n$ from a larger model of size $k \cdot n$, where $k > 1$ is the overparameterization factor. This paper explores the hypothesis that a single training run can simultaneously train a larger model for performance and derive a smaller model for deployment. Our contribution is an effective architectural change, namely, {\it Majority Kernels} that is compatible with the main standard architectures such as multi-layer perceptrons (MLPs), Residual networks (ResNets), and Transformers. We demonstrate that applying our technique can modify the training dynamics resulting in performance gains across architectures and tasks while maintaining the inference performance consistent. Furthermore, our approach adds minimal overhead to the cost incurred (wall clock time) at training time. The proposed approach shows strong performance on a wide variety of datasets and models, even outperforming strong baselines such as distilled ensembles as well as combinatorial optimization methods based on submodular optimization.
format	Preprint
id	arxiv_https___arxiv_org_abs_2402_05033
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	Majority Kernels: An Approach to Leverage Big Model Dynamics for Efficient Small Model Training Mazzawi, Hanna Awasthi, Pranjal Gonzalvo, Xavi Ramalingam, Srikumar Machine Learning Recent breakthroughs and successful deployment of large language and vision models in a constrained environment predominantly follow a two phase approach. First, large models are trained to achieve peak performance, followed by a model shrinking method to meet hardware constraints; Methods like distillation, compression or quantization help leverage the highly performant large models to induce smaller performant ones. Formally, this can be seen as the problem of identifying an optimal model of size $n$ from a larger model of size $k \cdot n$, where $k > 1$ is the overparameterization factor. This paper explores the hypothesis that a single training run can simultaneously train a larger model for performance and derive a smaller model for deployment. Our contribution is an effective architectural change, namely, {\it Majority Kernels} that is compatible with the main standard architectures such as multi-layer perceptrons (MLPs), Residual networks (ResNets), and Transformers. We demonstrate that applying our technique can modify the training dynamics resulting in performance gains across architectures and tasks while maintaining the inference performance consistent. Furthermore, our approach adds minimal overhead to the cost incurred (wall clock time) at training time. The proposed approach shows strong performance on a wide variety of datasets and models, even outperforming strong baselines such as distilled ensembles as well as combinatorial optimization methods based on submodular optimization.
title	Majority Kernels: An Approach to Leverage Big Model Dynamics for Efficient Small Model Training
topic	Machine Learning
url	https://arxiv.org/abs/2402.05033

Similar Items