:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Kang, Yongqi, Zhao, Yong
Format:	Preprint
Published:	2025
Subjects:	Audio and Speech Processing Computation and Language Machine Learning Sound
Online Access:	https://arxiv.org/abs/2510.02320
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Combining TF-GridNet and Mixture Encoder for Continuous Speech Separation for Meeting Transcription
by: Vieting, Peter, et al.
Published: (2023)

Coupling Speech Encoders with Downstream Text Models
by: Chelba, Ciprian, et al.
Published: (2024)

Aligning Spoken Dialogue Models from User Interactions
by: Wu, Anne, et al.
Published: (2025)

Conversational Rubert for Detecting Competitive Interruptions in ASR-Transcribed Dialogues
by: Galimzianov, Dmitrii, et al.
Published: (2024)

WavChat: A Survey of Spoken Dialogue Models
by: Ji, Shengpeng, et al.
Published: (2024)

Beyond Turn-Based Interfaces: Synchronous LLMs as Full-Duplex Dialogue Agents
by: Veluri, Bandhav, et al.
Published: (2024)

Task Oriented Dialogue as a Catalyst for Self-Supervised Automatic Speech Recognition
by: Chan, David M., et al.
Published: (2024)

Lightweight Zero-shot Text-to-Speech with Mixture of Adapters
by: Fujita, Kenichi, et al.
Published: (2024)

Style Mixture of Experts for Expressive Text-To-Speech Synthesis
by: Jawaid, Ahad, et al.
Published: (2024)

MoWE-Audio: Multitask AudioLLMs with Mixture of Weak Encoders
by: Zhang, Wenyu, et al.
Published: (2024)

A Joint Spectro-Temporal Relational Thinking Based Acoustic Modeling Framework
by: Nan, Zheng, et al.
Published: (2024)

Transformer-based Model for ASR N-Best Rescoring and Rewriting
by: Kang, Iwen E., et al.
Published: (2024)

CS-Dialogue: A 104-Hour Dataset of Spontaneous Mandarin-English Code-Switching Dialogues for Speech Recognition
by: Zhou, Jiaming, et al.
Published: (2025)

Error Analysis in a Modular Meeting Transcription System
by: Vieting, Peter, et al.
Published: (2025)

An Analysis of Linear Complexity Attention Substitutes with BEST-RQ
by: Whetten, Ryan, et al.
Published: (2024)

Align With Purpose: Optimize Desired Properties in CTC Models with a General Plug-and-Play Framework
by: Segev, Eliya, et al.
Published: (2023)

RUMAA: Repeat-Aware Unified Music Audio Analysis for Score-Performance Alignment, Transcription, and Mistake Detection
by: Chang, Sungkyun, et al.
Published: (2025)

TESU-LLM: Training Speech-LLMs Without Speech via Unified Encoder Alignment
by: Kim, Taesoo, et al.
Published: (2025)

PSLM: Parallel Generation of Text and Speech with LLMs for Low-Latency Spoken Dialogue Systems
by: Mitsui, Kentaro, et al.
Published: (2024)

AIR-Bench: Benchmarking Large Audio-Language Models via Generative Comprehension
by: Yang, Qian, et al.
Published: (2024)

Audio Dialogues: Dialogues dataset for audio and music understanding
by: Goel, Arushi, et al.
Published: (2024)

DeepDialogue: A Multi-Turn Emotionally-Rich Spoken Dialogue Dataset
by: Koudounas, Alkis, et al.
Published: (2025)

Making Flow-Matching-Based Zero-Shot Text-to-Speech Laugh as You Like
by: Kanda, Naoyuki, et al.
Published: (2024)

Omni-Router: Sharing Routing Decisions in Sparse Mixture-of-Experts for Speech Recognition
by: Gu, Zijin, et al.
Published: (2025)

Lyrics Transcription for Humans: A Readability-Aware Benchmark
by: Cífka, Ondřej, et al.
Published: (2024)

TAU: A Benchmark for Cultural Sound Understanding Beyond Semantics
by: Lin, Yi-Cheng, et al.
Published: (2025)

The State Of TTS: A Case Study with Human Fooling Rates
by: Varadhan, Praveen Srinivasa, et al.
Published: (2025)

Speech Robust Bench: A Robustness Benchmark For Speech Recognition
by: Shah, Muhammad A., et al.
Published: (2024)

A multilingual training strategy for low resource Text to Speech
by: Amalas, Asma, et al.
Published: (2024)

SELMA: A Speech-Enabled Language Model for Virtual Assistant Interactions
by: Wagner, Dominik, et al.
Published: (2025)

MelHuBERT: A simplified HuBERT on Mel spectrograms
by: Lin, Tzu-Quan, et al.
Published: (2022)

ProMode: A Speech Prosody Model Conditioned on Acoustic and Textual Inputs
by: Eren, Eray, et al.
Published: (2025)

DC-Spin: A Speaker-invariant Speech Tokenizer for Spoken Language Models
by: Chang, Heng-Jui, et al.
Published: (2024)

Sortformer: A Novel Approach for Permutation-Resolved Speaker Supervision in Speech-to-Text Systems
by: Park, Taejin, et al.
Published: (2024)

Remastering Divide and Remaster: A Cinematic Audio Source Separation Dataset with Multilingual Support
by: Watcharasupat, Karn N., et al.
Published: (2024)

Prosodic ABX: A Language-Agnostic Method for Measuring Prosodic Contrast in Speech Representations
by: Sun, Haitong, et al.
Published: (2026)

A low latency attention module for streaming self-supervised speech representation learning
by: Ma, Jianbo, et al.
Published: (2023)

A Closer Look at Neural Codec Resynthesis: Bridging the Gap between Codec and Waveform Generation
by: Liu, Alexander H., et al.
Published: (2024)

A light-weight and efficient punctuation and word casing prediction model for on-device streaming ASR
by: You, Jian, et al.
Published: (2024)

Sonos Voice Control Bias Assessment Dataset: A Methodology for Demographic Bias Assessment in Voice Assistants
by: Sekkat, Chloé, et al.
Published: (2024)