Affichage MARC: :: Library Catalog

Enregistré dans:

Détails bibliographiques
Auteurs principaux:	Jawahar, Ganesh, Yang, Haichuan, Xiong, Yunyang, Liu, Zechun, Wang, Dilin, Sun, Fei, Li, Meng, Pappu, Aasish, Oguz, Barlas, Abdul-Mageed, Muhammad, Lakshmanan, Laks V. S., Krishnamoorthi, Raghuraman, Chandra, Vikas
Format:	Preprint
Publié:	2023
Sujets:	Computation and Language
Accès en ligne:	https://arxiv.org/abs/2306.04845
Tags:	Ajouter un tag Pas de tags, Soyez le premier à ajouter un tag!

_version_	1866916350015832064
author	Jawahar, Ganesh Yang, Haichuan Xiong, Yunyang Liu, Zechun Wang, Dilin Sun, Fei Li, Meng Pappu, Aasish Oguz, Barlas Abdul-Mageed, Muhammad Lakshmanan, Laks V. S. Krishnamoorthi, Raghuraman Chandra, Vikas
author_facet	Jawahar, Ganesh Yang, Haichuan Xiong, Yunyang Liu, Zechun Wang, Dilin Sun, Fei Li, Meng Pappu, Aasish Oguz, Barlas Abdul-Mageed, Muhammad Lakshmanan, Laks V. S. Krishnamoorthi, Raghuraman Chandra, Vikas
contents	Weight-sharing supernets are crucial for performance estimation in cutting-edge neural architecture search (NAS) frameworks. Despite their ability to generate diverse subnetworks without retraining, the quality of these subnetworks is not guaranteed due to weight sharing. In NLP tasks like machine translation and pre-trained language modeling, there is a significant performance gap between supernet and training from scratch for the same model architecture, necessitating retraining post optimal architecture identification. This study introduces a solution called mixture-of-supernets, a generalized supernet formulation leveraging mixture-of-experts (MoE) to enhance supernet model expressiveness with minimal training overhead. Unlike conventional supernets, this method employs an architecture-based routing mechanism, enabling indirect sharing of model weights among subnetworks. This customization of weights for specific architectures, learned through gradient descent, minimizes retraining time, significantly enhancing training efficiency in NLP. The proposed method attains state-of-the-art (SoTA) performance in NAS for fast machine translation models, exhibiting a superior latency-BLEU tradeoff compared to HAT, the SoTA NAS framework for machine translation. Furthermore, it excels in NAS for building memory-efficient task-agnostic BERT models, surpassing NAS-BERT and AutoDistil across various model sizes. The code can be found at: https://github.com/UBC-NLP/MoS.
format	Preprint
id	arxiv_https___arxiv_org_abs_2306_04845
institution	arXiv
publishDate	2023
record_format	arxiv
spellingShingle	Mixture-of-Supernets: Improving Weight-Sharing Supernet Training with Architecture-Routed Mixture-of-Experts Jawahar, Ganesh Yang, Haichuan Xiong, Yunyang Liu, Zechun Wang, Dilin Sun, Fei Li, Meng Pappu, Aasish Oguz, Barlas Abdul-Mageed, Muhammad Lakshmanan, Laks V. S. Krishnamoorthi, Raghuraman Chandra, Vikas Computation and Language Weight-sharing supernets are crucial for performance estimation in cutting-edge neural architecture search (NAS) frameworks. Despite their ability to generate diverse subnetworks without retraining, the quality of these subnetworks is not guaranteed due to weight sharing. In NLP tasks like machine translation and pre-trained language modeling, there is a significant performance gap between supernet and training from scratch for the same model architecture, necessitating retraining post optimal architecture identification. This study introduces a solution called mixture-of-supernets, a generalized supernet formulation leveraging mixture-of-experts (MoE) to enhance supernet model expressiveness with minimal training overhead. Unlike conventional supernets, this method employs an architecture-based routing mechanism, enabling indirect sharing of model weights among subnetworks. This customization of weights for specific architectures, learned through gradient descent, minimizes retraining time, significantly enhancing training efficiency in NLP. The proposed method attains state-of-the-art (SoTA) performance in NAS for fast machine translation models, exhibiting a superior latency-BLEU tradeoff compared to HAT, the SoTA NAS framework for machine translation. Furthermore, it excels in NAS for building memory-efficient task-agnostic BERT models, surpassing NAS-BERT and AutoDistil across various model sizes. The code can be found at: https://github.com/UBC-NLP/MoS.
title	Mixture-of-Supernets: Improving Weight-Sharing Supernet Training with Architecture-Routed Mixture-of-Experts
topic	Computation and Language
url	https://arxiv.org/abs/2306.04845

Documents similaires