Saved in:
| Main Authors: | Kang, Yongqi, Zhao, Yong |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2510.02320 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Combining TF-GridNet and Mixture Encoder for Continuous Speech Separation for Meeting Transcription
by: Vieting, Peter, et al.
Published: (2023)
by: Vieting, Peter, et al.
Published: (2023)
Coupling Speech Encoders with Downstream Text Models
by: Chelba, Ciprian, et al.
Published: (2024)
by: Chelba, Ciprian, et al.
Published: (2024)
Aligning Spoken Dialogue Models from User Interactions
by: Wu, Anne, et al.
Published: (2025)
by: Wu, Anne, et al.
Published: (2025)
Conversational Rubert for Detecting Competitive Interruptions in ASR-Transcribed Dialogues
by: Galimzianov, Dmitrii, et al.
Published: (2024)
by: Galimzianov, Dmitrii, et al.
Published: (2024)
WavChat: A Survey of Spoken Dialogue Models
by: Ji, Shengpeng, et al.
Published: (2024)
by: Ji, Shengpeng, et al.
Published: (2024)
Beyond Turn-Based Interfaces: Synchronous LLMs as Full-Duplex Dialogue Agents
by: Veluri, Bandhav, et al.
Published: (2024)
by: Veluri, Bandhav, et al.
Published: (2024)
Task Oriented Dialogue as a Catalyst for Self-Supervised Automatic Speech Recognition
by: Chan, David M., et al.
Published: (2024)
by: Chan, David M., et al.
Published: (2024)
Lightweight Zero-shot Text-to-Speech with Mixture of Adapters
by: Fujita, Kenichi, et al.
Published: (2024)
by: Fujita, Kenichi, et al.
Published: (2024)
Style Mixture of Experts for Expressive Text-To-Speech Synthesis
by: Jawaid, Ahad, et al.
Published: (2024)
by: Jawaid, Ahad, et al.
Published: (2024)
MoWE-Audio: Multitask AudioLLMs with Mixture of Weak Encoders
by: Zhang, Wenyu, et al.
Published: (2024)
by: Zhang, Wenyu, et al.
Published: (2024)
A Joint Spectro-Temporal Relational Thinking Based Acoustic Modeling Framework
by: Nan, Zheng, et al.
Published: (2024)
by: Nan, Zheng, et al.
Published: (2024)
Transformer-based Model for ASR N-Best Rescoring and Rewriting
by: Kang, Iwen E., et al.
Published: (2024)
by: Kang, Iwen E., et al.
Published: (2024)
CS-Dialogue: A 104-Hour Dataset of Spontaneous Mandarin-English Code-Switching Dialogues for Speech Recognition
by: Zhou, Jiaming, et al.
Published: (2025)
by: Zhou, Jiaming, et al.
Published: (2025)
Error Analysis in a Modular Meeting Transcription System
by: Vieting, Peter, et al.
Published: (2025)
by: Vieting, Peter, et al.
Published: (2025)
An Analysis of Linear Complexity Attention Substitutes with BEST-RQ
by: Whetten, Ryan, et al.
Published: (2024)
by: Whetten, Ryan, et al.
Published: (2024)
Align With Purpose: Optimize Desired Properties in CTC Models with a General Plug-and-Play Framework
by: Segev, Eliya, et al.
Published: (2023)
by: Segev, Eliya, et al.
Published: (2023)
RUMAA: Repeat-Aware Unified Music Audio Analysis for Score-Performance Alignment, Transcription, and Mistake Detection
by: Chang, Sungkyun, et al.
Published: (2025)
by: Chang, Sungkyun, et al.
Published: (2025)
TESU-LLM: Training Speech-LLMs Without Speech via Unified Encoder Alignment
by: Kim, Taesoo, et al.
Published: (2025)
by: Kim, Taesoo, et al.
Published: (2025)
PSLM: Parallel Generation of Text and Speech with LLMs for Low-Latency Spoken Dialogue Systems
by: Mitsui, Kentaro, et al.
Published: (2024)
by: Mitsui, Kentaro, et al.
Published: (2024)
AIR-Bench: Benchmarking Large Audio-Language Models via Generative Comprehension
by: Yang, Qian, et al.
Published: (2024)
by: Yang, Qian, et al.
Published: (2024)
Audio Dialogues: Dialogues dataset for audio and music understanding
by: Goel, Arushi, et al.
Published: (2024)
by: Goel, Arushi, et al.
Published: (2024)
DeepDialogue: A Multi-Turn Emotionally-Rich Spoken Dialogue Dataset
by: Koudounas, Alkis, et al.
Published: (2025)
by: Koudounas, Alkis, et al.
Published: (2025)
Making Flow-Matching-Based Zero-Shot Text-to-Speech Laugh as You Like
by: Kanda, Naoyuki, et al.
Published: (2024)
by: Kanda, Naoyuki, et al.
Published: (2024)
Omni-Router: Sharing Routing Decisions in Sparse Mixture-of-Experts for Speech Recognition
by: Gu, Zijin, et al.
Published: (2025)
by: Gu, Zijin, et al.
Published: (2025)
Lyrics Transcription for Humans: A Readability-Aware Benchmark
by: Cífka, Ondřej, et al.
Published: (2024)
by: Cífka, Ondřej, et al.
Published: (2024)
TAU: A Benchmark for Cultural Sound Understanding Beyond Semantics
by: Lin, Yi-Cheng, et al.
Published: (2025)
by: Lin, Yi-Cheng, et al.
Published: (2025)
The State Of TTS: A Case Study with Human Fooling Rates
by: Varadhan, Praveen Srinivasa, et al.
Published: (2025)
by: Varadhan, Praveen Srinivasa, et al.
Published: (2025)
Speech Robust Bench: A Robustness Benchmark For Speech Recognition
by: Shah, Muhammad A., et al.
Published: (2024)
by: Shah, Muhammad A., et al.
Published: (2024)
A multilingual training strategy for low resource Text to Speech
by: Amalas, Asma, et al.
Published: (2024)
by: Amalas, Asma, et al.
Published: (2024)
SELMA: A Speech-Enabled Language Model for Virtual Assistant Interactions
by: Wagner, Dominik, et al.
Published: (2025)
by: Wagner, Dominik, et al.
Published: (2025)
MelHuBERT: A simplified HuBERT on Mel spectrograms
by: Lin, Tzu-Quan, et al.
Published: (2022)
by: Lin, Tzu-Quan, et al.
Published: (2022)
ProMode: A Speech Prosody Model Conditioned on Acoustic and Textual Inputs
by: Eren, Eray, et al.
Published: (2025)
by: Eren, Eray, et al.
Published: (2025)
DC-Spin: A Speaker-invariant Speech Tokenizer for Spoken Language Models
by: Chang, Heng-Jui, et al.
Published: (2024)
by: Chang, Heng-Jui, et al.
Published: (2024)
Sortformer: A Novel Approach for Permutation-Resolved Speaker Supervision in Speech-to-Text Systems
by: Park, Taejin, et al.
Published: (2024)
by: Park, Taejin, et al.
Published: (2024)
Remastering Divide and Remaster: A Cinematic Audio Source Separation Dataset with Multilingual Support
by: Watcharasupat, Karn N., et al.
Published: (2024)
by: Watcharasupat, Karn N., et al.
Published: (2024)
Prosodic ABX: A Language-Agnostic Method for Measuring Prosodic Contrast in Speech Representations
by: Sun, Haitong, et al.
Published: (2026)
by: Sun, Haitong, et al.
Published: (2026)
A low latency attention module for streaming self-supervised speech representation learning
by: Ma, Jianbo, et al.
Published: (2023)
by: Ma, Jianbo, et al.
Published: (2023)
A Closer Look at Neural Codec Resynthesis: Bridging the Gap between Codec and Waveform Generation
by: Liu, Alexander H., et al.
Published: (2024)
by: Liu, Alexander H., et al.
Published: (2024)
A light-weight and efficient punctuation and word casing prediction model for on-device streaming ASR
by: You, Jian, et al.
Published: (2024)
by: You, Jian, et al.
Published: (2024)
Sonos Voice Control Bias Assessment Dataset: A Methodology for Demographic Bias Assessment in Voice Assistants
by: Sekkat, Chloé, et al.
Published: (2024)
by: Sekkat, Chloé, et al.
Published: (2024)
Similar Items
-
Combining TF-GridNet and Mixture Encoder for Continuous Speech Separation for Meeting Transcription
by: Vieting, Peter, et al.
Published: (2023) -
Coupling Speech Encoders with Downstream Text Models
by: Chelba, Ciprian, et al.
Published: (2024) -
Aligning Spoken Dialogue Models from User Interactions
by: Wu, Anne, et al.
Published: (2025) -
Conversational Rubert for Detecting Competitive Interruptions in ASR-Transcribed Dialogues
by: Galimzianov, Dmitrii, et al.
Published: (2024) -
WavChat: A Survey of Spoken Dialogue Models
by: Ji, Shengpeng, et al.
Published: (2024)