Saved in:
Bibliographic Details
Main Authors: Sadok, Samir, Lathuilière, Stéphane, Alameda-Pineda, Xavier
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2601.19399
Tags: Add Tag
No Tags, Be the first to tag this record!
Table of Contents:
  • Recent speech modeling relies on explicit attributes such as pitch, content, and speaker identity, but these alone cannot capture the full richness of natural speech. We introduce RT-MAE, a novel masked autoencoder framework that augments the supervised attributes-based modeling with unsupervised residual trainable tokens, designed to encode the information not explained by explicit labeled factors (e.g., timbre variations, noise, emotion etc). Experiments show that RT-MAE improves reconstruction quality, preserving content and speaker similarity while enhancing expressivity. We further demonstrate its applicability to speech enhancement, removing noise at inference while maintaining controllability and naturalness.