Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	E, Ameenudeen P, Narayanan, Charumathi, Ganapathy, Sriram
Format:	Preprint
Published:	2026
Subjects:	Audio and Speech Processing
Online Access:	https://arxiv.org/abs/2604.06702
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866917391285354496
author	E, Ameenudeen P Narayanan, Charumathi Ganapathy, Sriram
author_facet	E, Ameenudeen P Narayanan, Charumathi Ganapathy, Sriram
contents	Self-supervised learning (SSL) has driven impressive advances in speech processing by adopting time-domain prediction objectives, while audio representation learning frameworks operate on time-frequency spectrograms. Models optimized for one paradigm struggle to transfer to the other, highlighting the need for a joint framework. We propose Unified Learning of Transformer Representations for Audio and Speech (ULTRAS), where the masking and predictive modeling is performed over long patches of the data. The model, based on the transformer architecture, encodes spectral-patches of log-mel spectrogram features. The predictive modeling of masked segments is performed on spectral and temporal targets using a combined loss-function, forcing the representations to encode time and frequency traits. Experiments are performed on a variety of speech and audio tasks, where we illustrate that the ULTRAS framework achieves improved performance over other established baselines.
format	Preprint
id	arxiv_https___arxiv_org_abs_2604_06702
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	ULTRAS -- Unified Learning of Transformer Representations for Audio and Speech Signals E, Ameenudeen P Narayanan, Charumathi Ganapathy, Sriram Audio and Speech Processing Self-supervised learning (SSL) has driven impressive advances in speech processing by adopting time-domain prediction objectives, while audio representation learning frameworks operate on time-frequency spectrograms. Models optimized for one paradigm struggle to transfer to the other, highlighting the need for a joint framework. We propose Unified Learning of Transformer Representations for Audio and Speech (ULTRAS), where the masking and predictive modeling is performed over long patches of the data. The model, based on the transformer architecture, encodes spectral-patches of log-mel spectrogram features. The predictive modeling of masked segments is performed on spectral and temporal targets using a combined loss-function, forcing the representations to encode time and frequency traits. Experiments are performed on a variety of speech and audio tasks, where we illustrate that the ULTRAS framework achieves improved performance over other established baselines.
title	ULTRAS -- Unified Learning of Transformer Representations for Audio and Speech Signals
topic	Audio and Speech Processing
url	https://arxiv.org/abs/2604.06702

Similar Items