Saved in:
Bibliographic Details
Main Authors: Szewczyk, Konrad, Fernández, Daniel Gallo, Townsend, James
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2510.02401
Tags: Add Tag
No Tags, Be the first to tag this record!
Table of Contents:
  • Directly learning to generate audio waveforms in an autoregressive manner is a challenging task, due to the length of the raw sequences and the existence of important structure on many different timescales. Traditional approaches based on recurrent neural networks, as well as causal convolutions and self-attention, have only had limited success on this task. However, recent work has shown that deep state space models, also referred to as linear RNNs, can be highly efficient in this context. In this work, we push the boundaries of linear RNNs applied to raw audio modeling, investigating the effects of different architectural choices and using context-parallelism to enable training on sequences up to one minute (1M tokens) in length. We present a model, HarmonicRNN, which attains state of the art log-likelihoods and perceptual metrics on small-scale datasets.