Saved in:
Bibliographic Details
Main Authors: Liu, Bo, Wang, Rui, Wu, Lemeng, Feng, Yihao, Stone, Peter, Liu, Qiang
Format: Preprint
Published: 2024
Subjects:
Online Access:https://arxiv.org/abs/2407.14207
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866910627935551488
author Liu, Bo
Wang, Rui
Wu, Lemeng
Feng, Yihao
Stone, Peter
Liu, Qiang
author_facet Liu, Bo
Wang, Rui
Wu, Lemeng
Feng, Yihao
Stone, Peter
Liu, Qiang
contents Modern large language models are built on sequence modeling via next-token prediction. While the Transformer remains the dominant architecture for sequence modeling, its quadratic decoding complexity in sequence length poses a major limitation. State-space models (SSMs) present a competitive alternative, offering linear decoding efficiency while maintaining parallelism during training. However, most existing SSMs rely on linear recurrence designs that appear somewhat ad hoc. In this work, we explore SSM design through the lens of online learning, conceptualizing SSMs as meta-modules for specific online learning problems. This approach links SSM design to formulating precise online learning objectives, with state transition rules derived from solving these objectives. Based on this insight, we introduce a novel deep SSM architecture, Longhorn, whose update resembles the closed-form solution for solving the online associative recall problem. Our experimental results show that Longhorn outperforms state-of-the-art SSMs, including the Mamba model, on standard sequence modeling benchmarks, language modeling, and vision tasks. Specifically, Longhorn achieves a 1.8x improvement in sample efficiency compared to Mamba, and can extrapolate over contexts that are up to 16x longer during inference.
format Preprint
id arxiv_https___arxiv_org_abs_2407_14207
institution arXiv
publishDate 2024
record_format arxiv
spellingShingle Longhorn: State Space Models are Amortized Online Learners
Liu, Bo
Wang, Rui
Wu, Lemeng
Feng, Yihao
Stone, Peter
Liu, Qiang
Machine Learning
Modern large language models are built on sequence modeling via next-token prediction. While the Transformer remains the dominant architecture for sequence modeling, its quadratic decoding complexity in sequence length poses a major limitation. State-space models (SSMs) present a competitive alternative, offering linear decoding efficiency while maintaining parallelism during training. However, most existing SSMs rely on linear recurrence designs that appear somewhat ad hoc. In this work, we explore SSM design through the lens of online learning, conceptualizing SSMs as meta-modules for specific online learning problems. This approach links SSM design to formulating precise online learning objectives, with state transition rules derived from solving these objectives. Based on this insight, we introduce a novel deep SSM architecture, Longhorn, whose update resembles the closed-form solution for solving the online associative recall problem. Our experimental results show that Longhorn outperforms state-of-the-art SSMs, including the Mamba model, on standard sequence modeling benchmarks, language modeling, and vision tasks. Specifically, Longhorn achieves a 1.8x improvement in sample efficiency compared to Mamba, and can extrapolate over contexts that are up to 16x longer during inference.
title Longhorn: State Space Models are Amortized Online Learners
topic Machine Learning
url https://arxiv.org/abs/2407.14207