Saved in:
| Main Authors: | Wang, Liming, Ni, Junrui, Chang, Kai-Wei, Bhati, Saurabhchand, Harwath, David, Hasegawa-Johnson, Mark, Glass, James R. |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2510.03639 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Towards Audio Token Compression in Large Audio Language Models
by: Bhati, Saurabhchand, et al.
Published: (2025)
by: Bhati, Saurabhchand, et al.
Published: (2025)
SyllableLM: Learning Coarse Semantic Units for Speech Language Models
by: Baade, Alan, et al.
Published: (2024)
by: Baade, Alan, et al.
Published: (2024)
Towards Unsupervised Speech Recognition Without Pronunciation Models
by: Ni, Junrui, et al.
Published: (2024)
by: Ni, Junrui, et al.
Published: (2024)
Recognizing Dementia from Neuropsychological Tests with State Space Models
by: Wang, Liming, et al.
Published: (2025)
by: Wang, Liming, et al.
Published: (2025)
USAD: Universal Speech and Audio Representation via Distillation
by: Chang, Heng-Jui, et al.
Published: (2025)
by: Chang, Heng-Jui, et al.
Published: (2025)
State-Space Large Audio Language Models
by: Bhati, Saurabhchand, et al.
Published: (2024)
by: Bhati, Saurabhchand, et al.
Published: (2024)
findsylls: A Language-Agnostic Toolkit for Syllable-Level Speech Tokenization and Embedding
by: Martínez, Héctor Javier Vázquez
Published: (2026)
by: Martínez, Héctor Javier Vázquez
Published: (2026)
TICL+: A Case Study On Speech In-Context Learning for Children's Speech Recognition
by: Zheng, Haolong, et al.
Published: (2025)
by: Zheng, Haolong, et al.
Published: (2025)
Self-supervised Speech Models for Word-Level Stuttered Speech Detection
by: Shih, Yi-Jen, et al.
Published: (2024)
by: Shih, Yi-Jen, et al.
Published: (2024)
Unifying Model and Layer Fusion for Speech Foundation Models
by: Shih, Yi-Jen, et al.
Published: (2025)
by: Shih, Yi-Jen, et al.
Published: (2025)
TICL: Text-Embedding KNN For Speech In-Context Learning Unlocks Speech Recognition Abilities of Large Multimodal Models
by: Zheng, Haolong, et al.
Published: (2025)
by: Zheng, Haolong, et al.
Published: (2025)
ParaSpeechCLAP: A Dual-Encoder Speech-Text Model for Rich Stylistic Language-Audio Pretraining
by: Diwan, Anuj, et al.
Published: (2026)
by: Diwan, Anuj, et al.
Published: (2026)
MetaSICL: Adapting Audiroty LLM via Meta Speech In-Context Learning
by: Zheng, Haolong, et al.
Published: (2026)
by: Zheng, Haolong, et al.
Published: (2026)
Scaling Rich Style-Prompted Text-to-Speech Datasets
by: Diwan, Anuj, et al.
Published: (2025)
by: Diwan, Anuj, et al.
Published: (2025)
UITron-Speech: Towards Automated GUI Agents Based on Speech Instructions
by: Han, Wenkang, et al.
Published: (2025)
by: Han, Wenkang, et al.
Published: (2025)
Song Form-aware Full-Song Text-to-Lyrics Generation with Multi-Level Granularity Syllable Count Control
by: Chae, Yunkee, et al.
Published: (2024)
by: Chae, Yunkee, et al.
Published: (2024)
Beyond Single-Shot: Multi-step Tool Retrieval via Query Planning
by: Fang, Wei, et al.
Published: (2026)
by: Fang, Wei, et al.
Published: (2026)
VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild
by: Peng, Puyuan, et al.
Published: (2024)
by: Peng, Puyuan, et al.
Published: (2024)
Towards Robust Speech Recognition for Jamaican Patois Music Transcription
by: Madden, Jordan, et al.
Published: (2025)
by: Madden, Jordan, et al.
Published: (2025)
TiCo: Time-Controllable Spoken Dialogue Model
by: Chang, Kai-Wei, et al.
Published: (2026)
by: Chang, Kai-Wei, et al.
Published: (2026)
SpeechJudge: Towards Human-Level Judgment for Speech Naturalness
by: Zhang, Xueyao, et al.
Published: (2025)
by: Zhang, Xueyao, et al.
Published: (2025)
Syllable-level lyrics generation from melody exploiting character-level language model
by: Zhang, Zhe, et al.
Published: (2023)
by: Zhang, Zhe, et al.
Published: (2023)
ConfPO: Exploiting Policy Model Confidence for Critical Token Selection in Preference Optimization
by: Yoon, Hee Suk, et al.
Published: (2025)
by: Yoon, Hee Suk, et al.
Published: (2025)
Interactive ASR: Towards Human-Like Interaction and Semantic Coherence Evaluation for Agentic Speech Recognition
by: Wang, Peng, et al.
Published: (2026)
by: Wang, Peng, et al.
Published: (2026)
Incorporating Error Level Noise Embedding for Improving LLM-Assisted Robustness in Persian Speech Recognition
by: Rahmani, Zahra, et al.
Published: (2025)
by: Rahmani, Zahra, et al.
Published: (2025)
Error-preserving Automatic Speech Recognition of Young English Learners' Language
by: Michot, Janick, et al.
Published: (2024)
by: Michot, Janick, et al.
Published: (2024)
Something from Nothing: Data Augmentation for Robust Severity Level Estimation of Dysarthric Speech
by: Bae, Jaesung, et al.
Published: (2026)
by: Bae, Jaesung, et al.
Published: (2026)
ProsodyLM: Uncovering the Emerging Prosody Processing Capabilities in Speech Language Models
by: Qian, Kaizhi, et al.
Published: (2025)
by: Qian, Kaizhi, et al.
Published: (2025)
Killkan: The Automatic Speech Recognition Dataset for Kichwa with Morphosyntactic Information
by: Taguchi, Chihiro, et al.
Published: (2024)
by: Taguchi, Chihiro, et al.
Published: (2024)
Exploring In-Context Learning of Textless Speech Language Model for Speech Classification Tasks
by: Hsu, Ming-Hao, et al.
Published: (2023)
by: Hsu, Ming-Hao, et al.
Published: (2023)
Breeze Taigi: Benchmarks and Models for Taiwanese Hokkien Speech Recognition and Synthesis
by: Lan, Yu-Siang, et al.
Published: (2026)
by: Lan, Yu-Siang, et al.
Published: (2026)
PLAY2PROMPT: Zero-shot Tool Instruction Optimization for LLM Agents via Tool Play
by: Fang, Wei, et al.
Published: (2025)
by: Fang, Wei, et al.
Published: (2025)
Automatic Speech Recognition for Documenting Endangered Languages: Case Study of Ikema Miyakoan
by: Taguchi, Chihiro, et al.
Published: (2026)
by: Taguchi, Chihiro, et al.
Published: (2026)
Language Complexity and Speech Recognition Accuracy: Orthographic Complexity Hurts, Phonological Complexity Doesn't
by: Taguchi, Chihiro, et al.
Published: (2024)
by: Taguchi, Chihiro, et al.
Published: (2024)
Improved Cross-Lingual Transfer Learning For Automatic Speech Translation
by: Khurana, Sameer, et al.
Published: (2023)
by: Khurana, Sameer, et al.
Published: (2023)
Multi-Stage Multi-Modal Pre-Training for Automatic Speech Recognition
by: Jain, Yash, et al.
Published: (2024)
by: Jain, Yash, et al.
Published: (2024)
PACR: Progressively Ascending Confidence Reward for LLM Reasoning
by: Yoon, Eunseop, et al.
Published: (2025)
by: Yoon, Eunseop, et al.
Published: (2025)
Improving Speech Recognition Error Prediction for Modern and Off-the-shelf Speech Recognizers
by: Serai, Prashant, et al.
Published: (2024)
by: Serai, Prashant, et al.
Published: (2024)
ASKD-Whisper: Adaptive Self-knowledge Distillation for Efficient and Low-Latency Automatic Speech Recognition
by: Lee, Junseok, et al.
Published: (2026)
by: Lee, Junseok, et al.
Published: (2026)
Codec2Vec: Self-Supervised Speech Representation Learning Using Neural Speech Codecs
by: Tseng, Wei-Cheng, et al.
Published: (2025)
by: Tseng, Wei-Cheng, et al.
Published: (2025)
Similar Items
-
Towards Audio Token Compression in Large Audio Language Models
by: Bhati, Saurabhchand, et al.
Published: (2025) -
SyllableLM: Learning Coarse Semantic Units for Speech Language Models
by: Baade, Alan, et al.
Published: (2024) -
Towards Unsupervised Speech Recognition Without Pronunciation Models
by: Ni, Junrui, et al.
Published: (2024) -
Recognizing Dementia from Neuropsychological Tests with State Space Models
by: Wang, Liming, et al.
Published: (2025) -
USAD: Universal Speech and Audio Representation via Distillation
by: Chang, Heng-Jui, et al.
Published: (2025)