Saved in:
Bibliographic Details
Main Authors: Male, Prabash Reddy, Ray, Swayambhu Nath, Arsikere, Harish, Jaiswal, Akshat, Swarup, Prakhar, Sen, Prantik, Chakrabarty, Debmalya, Girish, K V Vijay, Bhave, Nikhil, Weber, Frederick, Bhattacharya, Sambuddha, Garimella, Sri
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2505.19774
Tags: Add Tag
No Tags, Be the first to tag this record!
Table of Contents:
  • Recent advancements in speech encoders have drawn attention due to their integration with Large Language Models for various speech tasks. While most research has focused on either causal or full-context speech encoders, there's limited exploration to effectively handle both streaming and non-streaming applications, while achieving state-of-the-art performance. We introduce DuRep, a Dual-mode Speech Representation learning setup, which enables a single speech encoder to function efficiently in both offline and online modes without additional parameters or mode-specific adjustments, across downstream tasks. DuRep-200M, our 200M parameter dual-mode encoder, achieves 12% and 11.6% improvements in streaming and non-streaming modes, over baseline encoders on Multilingual ASR. Scaling this approach to 2B parameters, DuRep-2B sets new performance benchmarks across ASR and non-ASR tasks. Our analysis reveals interesting trade-offs between acoustic and semantic information across encoder layers.