:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Guo, Jinxi, Moritz, Niko, Ma, Yingyi, Seide, Frank, Wu, Chunyang, Mahadeokar, Jay, Kalinli, Ozlem, Fuegen, Christian, Seltzer, Mike
Format:	Preprint
Published:	2024
Subjects:	Audio and Speech Processing Artificial Intelligence Computation and Language Machine Learning
Online Access:	https://arxiv.org/abs/2404.01716
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Transducer-Llama: Integrating LLMs into Streamable Transducer-based Speech Recognition
by: Deng, Keqi, et al.
Published: (2024)

Efficient Streaming LLM for Speech Recognition
by: Jia, Junteng, et al.
Published: (2024)

Can Speech LLMs Think while Listening?
by: Shih, Yi-Jen, et al.
Published: (2025)

Faster Speech-LLaMA Inference with Multi-token Prediction
by: Raj, Desh, et al.
Published: (2024)

Effective Text Adaptation for LLM-based ASR through Soft Prompt Fine-Tuning
by: Ma, Yingyi, et al.
Published: (2024)

CJST: CTC Compressor based Joint Speech and Text Training for Decoder-Only ASR
by: Zhou, Wei, et al.
Published: (2024)

AGADIR: Towards Array-Geometry Agnostic Directional Speech Recognition
by: Lin, Ju, et al.
Published: (2024)

Dynamic ASR Pathways: An Adaptive Masking Approach Towards Efficient Pruning of A Multilingual ASR Model
by: Xie, Jiamin, et al.
Published: (2023)

M-BEST-RQ: A Multi-Channel Speech Foundation Model for Smart Glasses
by: Yang, Yufeng, et al.
Published: (2024)

Transcribing and Translating, Fast and Slow: Joint Speech Translation and Recognition
by: Moritz, Niko, et al.
Published: (2024)

Frozen Large Language Models Can Perceive Paralinguistic Aspects of Speech
by: Kang, Wonjune, et al.
Published: (2024)

MELD: Mel-Spectrogram-Based Speech Language Modeling with Discrete Latent Variables
by: Yeh, Sung-Lin, et al.
Published: (2026)

Towards scalable efficient on-device ASR with transfer learning
by: Pandey, Laxmi, et al.
Published: (2024)

Token-Weighted RNN-T for Learning from Flawed Data
by: Keren, Gil, et al.
Published: (2024)

Towards measuring fairness in speech recognition: Fair-Speech dataset
by: Veliche, Irina-Elena, et al.
Published: (2024)

Textless Streaming Speech-to-Speech Translation using Semantic Speech Tokens
by: Zhao, Jinzheng, et al.
Published: (2024)

Speech ReaLLM -- Real-time Streaming Speech Recognition with Multimodal LLMs by Teaching the Flow of Time
by: Seide, Frank, et al.
Published: (2024)

Conversational Speech Naturalness Predictor
by: Xu, Anfeng, et al.
Published: (2026)

Navigating the Minefield of MT Beam Search in Cascaded Streaming Speech Translation
by: Rabatin, Rastislav, et al.
Published: (2024)

Directional Source Separation for Robust Speech Recognition on Smart Glasses
by: Feng, Tiantian, et al.
Published: (2023)

Get Large Language Models Ready to Speak: A Late-fusion Approach for Speech Generation
by: Shen, Maohao, et al.
Published: (2024)

Towards audio language modeling -- an overview
by: Wu, Haibin, et al.
Published: (2024)

Single-channel speech enhancement by using psychoacoustical model inspired fusion framework
by: Samui, Suman
Published: (2022)

AudioChatLlama: Towards General-Purpose Speech Abilities for LLMs
by: Fathullah, Yassir, et al.
Published: (2023)

SLM-S2ST: A multimodal language model for direct speech-to-speech translation
by: Hu, Yuxuan, et al.
Published: (2025)

Speaker anonymization using neural audio codec language models
by: Panariello, Michele, et al.
Published: (2023)

PROCTER: PROnunciation-aware ConTextual adaptER for personalized speech recognition in neural transducers
by: Pandey, Rahul, et al.
Published: (2023)

Progressive unsupervised domain adaptation for ASR using ensemble models and multi-stage training
by: Ahmad, Rehan, et al.
Published: (2024)

Selecting N-lowest scores for training MOS prediction models
by: Kondo, Yuto, et al.
Published: (2025)

Self-consistent context aware conformer transducer for speech recognition
by: Kolokolov, Konstantin, et al.
Published: (2024)

Building English ASR model with regional language support
by: Agrawal, Purvi, et al.
Published: (2025)

Phoneme-based speech recognition driven by large language models and sampling marginalization
by: Ma, Te, et al.
Published: (2025)

Bridging the gap between training and inference in LM-based TTS models
by: Zhang, Ruonan, et al.
Published: (2025)

ParaCLAP -- Towards a general language-audio model for computational paralinguistic tasks
by: Jing, Xin, et al.
Published: (2024)

Encoding of lexical tone in self-supervised models of spoken language
by: Shen, Gaofei, et al.
Published: (2024)

How to train your ears: Auditory-model emulation for large-dynamic-range inputs and mild-to-severe hearing losses
by: Leer, Peter, et al.
Published: (2024)

Mellow: a small audio language model for reasoning
by: Deshmukh, Soham, et al.
Published: (2025)

Advancing Multi-grained Alignment for Contrastive Language-Audio Pre-training
by: Li, Yiming, et al.
Published: (2024)

A Domain Adaptation Framework for Speech Recognition Systems with Only Synthetic data
by: Tran, Minh, et al.
Published: (2025)

Word-wise intonation model for cross-language TTS systems
by: A., Tomilov A., et al.
Published: (2024)