:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Sankar, Sanjana, Lenglet, Martin, Bailly, Gerard, Beautemps, Denis, Hueber, Thomas
Format:	Preprint
Published:	2025
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2501.04799
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Scaling Speech-Text Pre-training with Synthetic Interleaved Data
by: Zeng, Aohan, et al.
Published: (2024)

Is Self-Supervised Learning Enough to Fill in the Gap? A Study on Speech Inpainting
by: Asaad, Ihab, et al.
Published: (2024)

Generative Pre-training for Speech with Flow Matching
by: Liu, Alexander H., et al.
Published: (2023)

Speech-FT: Merging Pre-trained And Fine-Tuned Speech Representation Models For Cross-Task Generalization
by: Lin, Tzu-Quan, et al.
Published: (2025)

Mark My Words: A Robust Multilingual Model for Punctuation in Text and Speech Transcripts
by: Pulipaka, Sidharth, et al.
Published: (2025)

Generative Pre-trained Speech Language Model with Efficient Hierarchical Transformer
by: Zhu, Yongxin, et al.
Published: (2024)

Bridging Speech and Text: Enhancing ASR with Pinyin-to-Character Pre-training in LLMs
by: Yuhang, Yang, et al.
Published: (2024)

Robust Audiovisual Speech Recognition Models with Mixture-of-Experts
by: Wu, Yihan, et al.
Published: (2024)

Listening or Reading? Evaluating Speech Awareness in Chain-of-Thought Speech-to-Text Translation
by: Romero-Díaz, Jacobo, et al.
Published: (2025)

Revisiting Direct Speech-to-Text Translation with Speech LLMs: Better Scaling than CoT Prompting?
by: Pareras, Oriol, et al.
Published: (2025)

UniWav: Towards Unified Pre-training for Speech Representation Learning and Generation
by: Liu, Alexander H., et al.
Published: (2025)

A Simple Method to Enhance Pre-trained Language Models with Speech Tokens for Classification
by: Calbucura, Nicolas, et al.
Published: (2025)

Speech is More Than Words: Do Speech-to-Text Translation Systems Leverage Prosody?
by: Tsiamas, Ioannis, et al.
Published: (2024)

Leveraging Speech PTM, Text LLM, and Emotional TTS for Speech Emotion Recognition
by: Ma, Ziyang, et al.
Published: (2023)

MOSS-Speech: Towards True Speech-to-Speech Models Without Text Guidance
by: Zhao, Xingjian, et al.
Published: (2025)

RASMALAI: Resources for Adaptive Speech Modeling in Indian Languages with Accents and Intonations
by: Sankar, Ashwin, et al.
Published: (2025)

SpeechAlign: a Framework for Speech Translation Alignment Evaluation
by: Alastruey, Belen, et al.
Published: (2023)

Generating Data with Text-to-Speech and Large-Language Models for Conversational Speech Recognition
by: Cornell, Samuele, et al.
Published: (2024)

An Empirical Analysis of Discrete Unit Representations in Speech Language Modeling Pre-training
by: Labrak, Yanis, et al.
Published: (2025)

On Leveraging Encoder-only Pre-trained Language Models for Effective Keyphrase Generation
by: Wu, Di, et al.
Published: (2024)

InSerter: Speech Instruction Following with Unsupervised Interleaved Pre-training
by: Wang, Dingdong, et al.
Published: (2025)

STTATTS: Unified Speech-To-Text And Text-To-Speech Model
by: Toyin, Hawau Olamide, et al.
Published: (2024)

Leveraging Unit Language Guidance to Advance Speech Modeling in Textless Speech-to-Speech Translation
by: Zhang, Yuhao, et al.
Published: (2025)

Text-to-Code Generation with Modality-relative Pre-training
by: Christopoulou, Fenia, et al.
Published: (2024)

Continuous Speech Tokenizer in Text To Speech
by: Li, Yixing, et al.
Published: (2024)

LLM-to-Speech: A Synthetic Data Pipeline for Training Dialectal Text-to-Speech Models
by: Khamis, Ahmed Khaled, et al.
Published: (2026)

Speech Recognition Rescoring with Large Speech-Text Foundation Models
by: Shivakumar, Prashanth Gurunath, et al.
Published: (2024)

Attentive Merging of Hidden Embeddings from Pre-trained Speech Model for Anti-spoofing Detection
by: Pan, Zihan, et al.
Published: (2024)

Chunk Based Speech Pre-training with High Resolution Finite Scalar Quantization
by: Tang, Yun, et al.
Published: (2025)

BLSP-KD: Bootstrapping Language-Speech Pre-training via Knowledge Distillation
by: Wang, Chen, et al.
Published: (2024)

MS-HuBERT: Mitigating Pre-training and Inference Mismatch in Masked Language Modelling methods for learning Speech Representations
by: Yadav, Hemant, et al.
Published: (2024)

Speech-Copilot: Leveraging Large Language Models for Speech Processing via Task Decomposition, Modularization, and Program Generation
by: Kuan, Chun-Yi, et al.
Published: (2024)

A multilingual training strategy for low resource Text to Speech
by: Amalas, Asma, et al.
Published: (2024)

On the Problem of Text-To-Speech Model Selection for Synthetic Data Generation in Automatic Speech Recognition
by: Rossenbach, Nick, et al.
Published: (2024)

LatentSpeech: Latent Diffusion for Text-To-Speech Generation
by: Lou, Haowei, et al.
Published: (2024)

Streaming Speech-to-Text Translation with a SpeechLLM
by: Parcollet, Titouan, et al.
Published: (2026)

Scheduled Interleaved Speech-Text Training for Speech-to-Speech Translation with LLMs
by: Futami, Hayato, et al.
Published: (2025)

Streaming Speech-to-Confusion Network Speech Recognition
by: Filimonov, Denis, et al.
Published: (2023)

Revisiting Interpolation Augmentation for Speech-to-Text Generation
by: Xu, Chen, et al.
Published: (2024)

Enhancing Generalization of Speech Large Language Models with Multi-Task Behavior Imitation and Speech-Text Interleaving
by: Xie, Jingran, et al.
Published: (2025)