Saved in:
Bibliographic Details
Main Author: Huang, Allen Hao
Format: Preprint
Published: 2024
Subjects:
Online Access:https://arxiv.org/abs/2405.20768
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866909213980098560
author Huang, Allen Hao
author_facet Huang, Allen Hao
contents Activation functions are core components of all deep learning architectures. Currently, the most popular activation functions are smooth ReLU variants like GELU and SiLU. These are self-gated activation functions where the range of the gating function is between zero and one. In this paper, we explore the viability of using arctan as a gating mechanism. A self-gated activation function that uses arctan as its gating function has a monotonically increasing first derivative. To make this activation function competitive, it is necessary to introduce a trainable parameter for every MLP block to expand the range of the gating function beyond zero and one. We find that this technique also improves existing self-gated activation functions. We conduct an empirical evaluation of Expanded ArcTan Linear Unit (xATLU), Expanded GELU (xGELU), and Expanded SiLU (xSiLU) and show that they outperform existing activation functions within a transformer architecture. Additionally, expanded gating ranges show promising results in improving first-order Gated Linear Units (GLU).
format Preprint
id arxiv_https___arxiv_org_abs_2405_20768
institution arXiv
publishDate 2024
record_format arxiv
spellingShingle Expanded Gating Ranges Improve Activation Functions
Huang, Allen Hao
Neural and Evolutionary Computing
Machine Learning
Activation functions are core components of all deep learning architectures. Currently, the most popular activation functions are smooth ReLU variants like GELU and SiLU. These are self-gated activation functions where the range of the gating function is between zero and one. In this paper, we explore the viability of using arctan as a gating mechanism. A self-gated activation function that uses arctan as its gating function has a monotonically increasing first derivative. To make this activation function competitive, it is necessary to introduce a trainable parameter for every MLP block to expand the range of the gating function beyond zero and one. We find that this technique also improves existing self-gated activation functions. We conduct an empirical evaluation of Expanded ArcTan Linear Unit (xATLU), Expanded GELU (xGELU), and Expanded SiLU (xSiLU) and show that they outperform existing activation functions within a transformer architecture. Additionally, expanded gating ranges show promising results in improving first-order Gated Linear Units (GLU).
title Expanded Gating Ranges Improve Activation Functions
topic Neural and Evolutionary Computing
Machine Learning
url https://arxiv.org/abs/2405.20768