Saved in:
| Main Authors: | Guo, Jinxi, Moritz, Niko, Ma, Yingyi, Seide, Frank, Wu, Chunyang, Mahadeokar, Jay, Kalinli, Ozlem, Fuegen, Christian, Seltzer, Mike |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2404.01716 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Transducer-Llama: Integrating LLMs into Streamable Transducer-based Speech Recognition
by: Deng, Keqi, et al.
Published: (2024)
by: Deng, Keqi, et al.
Published: (2024)
Efficient Streaming LLM for Speech Recognition
by: Jia, Junteng, et al.
Published: (2024)
by: Jia, Junteng, et al.
Published: (2024)
Can Speech LLMs Think while Listening?
by: Shih, Yi-Jen, et al.
Published: (2025)
by: Shih, Yi-Jen, et al.
Published: (2025)
Faster Speech-LLaMA Inference with Multi-token Prediction
by: Raj, Desh, et al.
Published: (2024)
by: Raj, Desh, et al.
Published: (2024)
Effective Text Adaptation for LLM-based ASR through Soft Prompt Fine-Tuning
by: Ma, Yingyi, et al.
Published: (2024)
by: Ma, Yingyi, et al.
Published: (2024)
CJST: CTC Compressor based Joint Speech and Text Training for Decoder-Only ASR
by: Zhou, Wei, et al.
Published: (2024)
by: Zhou, Wei, et al.
Published: (2024)
AGADIR: Towards Array-Geometry Agnostic Directional Speech Recognition
by: Lin, Ju, et al.
Published: (2024)
by: Lin, Ju, et al.
Published: (2024)
Dynamic ASR Pathways: An Adaptive Masking Approach Towards Efficient Pruning of A Multilingual ASR Model
by: Xie, Jiamin, et al.
Published: (2023)
by: Xie, Jiamin, et al.
Published: (2023)
M-BEST-RQ: A Multi-Channel Speech Foundation Model for Smart Glasses
by: Yang, Yufeng, et al.
Published: (2024)
by: Yang, Yufeng, et al.
Published: (2024)
Transcribing and Translating, Fast and Slow: Joint Speech Translation and Recognition
by: Moritz, Niko, et al.
Published: (2024)
by: Moritz, Niko, et al.
Published: (2024)
Frozen Large Language Models Can Perceive Paralinguistic Aspects of Speech
by: Kang, Wonjune, et al.
Published: (2024)
by: Kang, Wonjune, et al.
Published: (2024)
MELD: Mel-Spectrogram-Based Speech Language Modeling with Discrete Latent Variables
by: Yeh, Sung-Lin, et al.
Published: (2026)
by: Yeh, Sung-Lin, et al.
Published: (2026)
Towards scalable efficient on-device ASR with transfer learning
by: Pandey, Laxmi, et al.
Published: (2024)
by: Pandey, Laxmi, et al.
Published: (2024)
Token-Weighted RNN-T for Learning from Flawed Data
by: Keren, Gil, et al.
Published: (2024)
by: Keren, Gil, et al.
Published: (2024)
Towards measuring fairness in speech recognition: Fair-Speech dataset
by: Veliche, Irina-Elena, et al.
Published: (2024)
by: Veliche, Irina-Elena, et al.
Published: (2024)
Textless Streaming Speech-to-Speech Translation using Semantic Speech Tokens
by: Zhao, Jinzheng, et al.
Published: (2024)
by: Zhao, Jinzheng, et al.
Published: (2024)
Speech ReaLLM -- Real-time Streaming Speech Recognition with Multimodal LLMs by Teaching the Flow of Time
by: Seide, Frank, et al.
Published: (2024)
by: Seide, Frank, et al.
Published: (2024)
Conversational Speech Naturalness Predictor
by: Xu, Anfeng, et al.
Published: (2026)
by: Xu, Anfeng, et al.
Published: (2026)
Navigating the Minefield of MT Beam Search in Cascaded Streaming Speech Translation
by: Rabatin, Rastislav, et al.
Published: (2024)
by: Rabatin, Rastislav, et al.
Published: (2024)
Directional Source Separation for Robust Speech Recognition on Smart Glasses
by: Feng, Tiantian, et al.
Published: (2023)
by: Feng, Tiantian, et al.
Published: (2023)
Get Large Language Models Ready to Speak: A Late-fusion Approach for Speech Generation
by: Shen, Maohao, et al.
Published: (2024)
by: Shen, Maohao, et al.
Published: (2024)
Towards audio language modeling -- an overview
by: Wu, Haibin, et al.
Published: (2024)
by: Wu, Haibin, et al.
Published: (2024)
Single-channel speech enhancement by using psychoacoustical model inspired fusion framework
by: Samui, Suman
Published: (2022)
by: Samui, Suman
Published: (2022)
AudioChatLlama: Towards General-Purpose Speech Abilities for LLMs
by: Fathullah, Yassir, et al.
Published: (2023)
by: Fathullah, Yassir, et al.
Published: (2023)
SLM-S2ST: A multimodal language model for direct speech-to-speech translation
by: Hu, Yuxuan, et al.
Published: (2025)
by: Hu, Yuxuan, et al.
Published: (2025)
Speaker anonymization using neural audio codec language models
by: Panariello, Michele, et al.
Published: (2023)
by: Panariello, Michele, et al.
Published: (2023)
PROCTER: PROnunciation-aware ConTextual adaptER for personalized speech recognition in neural transducers
by: Pandey, Rahul, et al.
Published: (2023)
by: Pandey, Rahul, et al.
Published: (2023)
Progressive unsupervised domain adaptation for ASR using ensemble models and multi-stage training
by: Ahmad, Rehan, et al.
Published: (2024)
by: Ahmad, Rehan, et al.
Published: (2024)
Selecting N-lowest scores for training MOS prediction models
by: Kondo, Yuto, et al.
Published: (2025)
by: Kondo, Yuto, et al.
Published: (2025)
Self-consistent context aware conformer transducer for speech recognition
by: Kolokolov, Konstantin, et al.
Published: (2024)
by: Kolokolov, Konstantin, et al.
Published: (2024)
Building English ASR model with regional language support
by: Agrawal, Purvi, et al.
Published: (2025)
by: Agrawal, Purvi, et al.
Published: (2025)
Phoneme-based speech recognition driven by large language models and sampling marginalization
by: Ma, Te, et al.
Published: (2025)
by: Ma, Te, et al.
Published: (2025)
Bridging the gap between training and inference in LM-based TTS models
by: Zhang, Ruonan, et al.
Published: (2025)
by: Zhang, Ruonan, et al.
Published: (2025)
ParaCLAP -- Towards a general language-audio model for computational paralinguistic tasks
by: Jing, Xin, et al.
Published: (2024)
by: Jing, Xin, et al.
Published: (2024)
Encoding of lexical tone in self-supervised models of spoken language
by: Shen, Gaofei, et al.
Published: (2024)
by: Shen, Gaofei, et al.
Published: (2024)
How to train your ears: Auditory-model emulation for large-dynamic-range inputs and mild-to-severe hearing losses
by: Leer, Peter, et al.
Published: (2024)
by: Leer, Peter, et al.
Published: (2024)
Mellow: a small audio language model for reasoning
by: Deshmukh, Soham, et al.
Published: (2025)
by: Deshmukh, Soham, et al.
Published: (2025)
Advancing Multi-grained Alignment for Contrastive Language-Audio Pre-training
by: Li, Yiming, et al.
Published: (2024)
by: Li, Yiming, et al.
Published: (2024)
A Domain Adaptation Framework for Speech Recognition Systems with Only Synthetic data
by: Tran, Minh, et al.
Published: (2025)
by: Tran, Minh, et al.
Published: (2025)
Word-wise intonation model for cross-language TTS systems
by: A., Tomilov A., et al.
Published: (2024)
by: A., Tomilov A., et al.
Published: (2024)
Similar Items
-
Transducer-Llama: Integrating LLMs into Streamable Transducer-based Speech Recognition
by: Deng, Keqi, et al.
Published: (2024) -
Efficient Streaming LLM for Speech Recognition
by: Jia, Junteng, et al.
Published: (2024) -
Can Speech LLMs Think while Listening?
by: Shih, Yi-Jen, et al.
Published: (2025) -
Faster Speech-LLaMA Inference with Multi-token Prediction
by: Raj, Desh, et al.
Published: (2024) -
Effective Text Adaptation for LLM-based ASR through Soft Prompt Fine-Tuning
by: Ma, Yingyi, et al.
Published: (2024)