Saved in:
Bibliographic Details
Main Authors: Mazzawi, Hanna, Awasthi, Pranjal, Gonzalvo, Xavi, Ramalingam, Srikumar
Format: Preprint
Published: 2024
Subjects:
Online Access:https://arxiv.org/abs/2402.05033
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866916489742778368
author Mazzawi, Hanna
Awasthi, Pranjal
Gonzalvo, Xavi
Ramalingam, Srikumar
author_facet Mazzawi, Hanna
Awasthi, Pranjal
Gonzalvo, Xavi
Ramalingam, Srikumar
contents Recent breakthroughs and successful deployment of large language and vision models in a constrained environment predominantly follow a two phase approach. First, large models are trained to achieve peak performance, followed by a model shrinking method to meet hardware constraints; Methods like distillation, compression or quantization help leverage the highly performant large models to induce smaller performant ones. Formally, this can be seen as the problem of identifying an optimal model of size $n$ from a larger model of size $k \cdot n$, where $k > 1$ is the overparameterization factor. This paper explores the hypothesis that a single training run can simultaneously train a larger model for performance and derive a smaller model for deployment. Our contribution is an effective architectural change, namely, {\it Majority Kernels} that is compatible with the main standard architectures such as multi-layer perceptrons (MLPs), Residual networks (ResNets), and Transformers. We demonstrate that applying our technique can modify the training dynamics resulting in performance gains across architectures and tasks while maintaining the inference performance consistent. Furthermore, our approach adds minimal overhead to the cost incurred (wall clock time) at training time. The proposed approach shows strong performance on a wide variety of datasets and models, even outperforming strong baselines such as distilled ensembles as well as combinatorial optimization methods based on submodular optimization.
format Preprint
id arxiv_https___arxiv_org_abs_2402_05033
institution arXiv
publishDate 2024
record_format arxiv
spellingShingle Majority Kernels: An Approach to Leverage Big Model Dynamics for Efficient Small Model Training
Mazzawi, Hanna
Awasthi, Pranjal
Gonzalvo, Xavi
Ramalingam, Srikumar
Machine Learning
Recent breakthroughs and successful deployment of large language and vision models in a constrained environment predominantly follow a two phase approach. First, large models are trained to achieve peak performance, followed by a model shrinking method to meet hardware constraints; Methods like distillation, compression or quantization help leverage the highly performant large models to induce smaller performant ones. Formally, this can be seen as the problem of identifying an optimal model of size $n$ from a larger model of size $k \cdot n$, where $k > 1$ is the overparameterization factor. This paper explores the hypothesis that a single training run can simultaneously train a larger model for performance and derive a smaller model for deployment. Our contribution is an effective architectural change, namely, {\it Majority Kernels} that is compatible with the main standard architectures such as multi-layer perceptrons (MLPs), Residual networks (ResNets), and Transformers. We demonstrate that applying our technique can modify the training dynamics resulting in performance gains across architectures and tasks while maintaining the inference performance consistent. Furthermore, our approach adds minimal overhead to the cost incurred (wall clock time) at training time. The proposed approach shows strong performance on a wide variety of datasets and models, even outperforming strong baselines such as distilled ensembles as well as combinatorial optimization methods based on submodular optimization.
title Majority Kernels: An Approach to Leverage Big Model Dynamics for Efficient Small Model Training
topic Machine Learning
url https://arxiv.org/abs/2402.05033