:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Li, Zehan, Yang, Yan, Li, Xueqing, Kang, Jian, Zhang, Xiao-Lei, Li, Jie
Format:	Preprint
Published:	2025
Subjects:	Audio and Speech Processing
Online Access:	https://arxiv.org/abs/2509.01900
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Bridging the Gap between Continuous and Informative Discrete Representations by Random Product Quantization
by: Li, Xueqing, et al.
Published: (2025)

EmoSSLSphere: Multilingual Emotional Speech Synthesis with Spherical Vectors and Discrete Speech Tokens
by: Park, Joonyong, et al.
Published: (2025)

Phone-purity Guided Discrete Tokens for Dysarthric Speech Recognition
by: Wang, Huimeng, et al.
Published: (2025)

Audio-Cogito: Towards Deep Audio Reasoning in Large Audio Language Models
by: Li, Longhao, et al.
Published: (2026)

SELM: Speech Enhancement Using Discrete Tokens and Language Models
by: Wang, Ziqian, et al.
Published: (2023)

Leveraging LLM and Self-Supervised Training Models for Speech Recognition in Chinese Dialects: A Comparative Analysis
by: Xu, Tianyi, et al.
Published: (2025)

BoSS: Beyond-Semantic Speech
by: Wang, Qing, et al.
Published: (2025)

Fusion of Discrete Representations and Self-Augmented Representations for Multilingual Automatic Speech Recognition
by: Wang, Shih-heng, et al.
Published: (2024)

GOAT-TTS: Expressive and Realistic Speech Generation via A Dual-Branch LLM
by: Song, Yaodong, et al.
Published: (2025)

Exploring SSL Discrete Tokens for Multilingual ASR
by: Cui, Mingyu, et al.
Published: (2024)

Children's Speech Recognition through Discrete Token Enhancement
by: Sukhadia, Vrunda N., et al.
Published: (2024)

Streaming Decoder-Only Automatic Speech Recognition with Discrete Speech Units: A Pilot Study
by: Chen, Peikun, et al.
Published: (2024)

High Fidelity Text-to-Speech Via Discrete Tokens Using Token Transducer and Group Masked Language Model
by: Lee, Joun Yeop, et al.
Published: (2024)

Efficient Training for Multilingual Visual Speech Recognition: Pre-training with Discretized Visual Speech Representation
by: Kim, Minsu, et al.
Published: (2024)

Confidence-based Filtering for Speech Dataset Curation with Generative Speech Enhancement Using Discrete Tokens
by: Yamauchi, Kazuki, et al.
Published: (2026)

Discrete Diffusion for Generative Modeling of Text-Aligned Speech Tokens
by: Ku, Pin-Jui, et al.
Published: (2025)

A Composite Predictive-Generative Approach to Monaural Universal Speech Enhancement
by: Zhang, Jie, et al.
Published: (2025)

Acoustic BPE for Speech Generation with Discrete Tokens
by: Shen, Feiyu, et al.
Published: (2023)

Selective Invocation for Multilingual ASR: A Cost-effective Approach Adapting to Speech Recognition Difficulty
by: Xue, Hongfei, et al.
Published: (2025)

Exploring Cross-Utterance Speech Contexts for Conformer-Transducer Speech Recognition Systems
by: Cui, Mingyu, et al.
Published: (2025)

Low Bitrate High-Quality RVQGAN-based Discrete Speech Tokenizer
by: Shechtman, Slava, et al.
Published: (2024)

Lessons Learnt: Revisit Key Training Strategies for Effective Speech Emotion Recognition in the Wild
by: Tzeng, Jing-Tong, et al.
Published: (2025)

Continuous Speech Tokenizer in Text To Speech
by: Li, Yixing, et al.
Published: (2024)

Benchmarking Large Pretrained Multilingual Models on Québec French Speech Recognition
by: Serrand, Coralie, et al.
Published: (2025)

SSHR: Leveraging Self-supervised Hierarchical Representations for Multilingual Automatic Speech Recognition
by: Xue, Hongfei, et al.
Published: (2023)

Enhancing Fully Formatted End-to-End Speech Recognition with Knowledge Distillation via Multi-Codebook Vector Quantization
by: You, Jian, et al.
Published: (2025)

Recent Advances in Discrete Speech Tokens: A Review
by: Guo, Yiwei, et al.
Published: (2025)

SpeechTokenizer: Unified Speech Tokenizer for Speech Large Language Models
by: Zhang, Xin, et al.
Published: (2023)

Prosody as Supervision: Bridging the Non-Verbal--Verbal for Multilingual Speech Emotion Recognition
by: Girish, et al.
Published: (2026)

Bridging the Modality Gap: Softly Discretizing Audio Representation for LLM-based Automatic Speech Recognition
by: Yang, Mu, et al.
Published: (2025)

Speech Token Prediction via Compressed-to-fine Language Modeling for Speech Generation
by: Liu, Wenrui, et al.
Published: (2025)

Identifying and Calibrating Overconfidence in Noisy Speech Recognition
by: Huo, Mingyue, et al.
Published: (2025)

Speech Emotion Recognition with ASR Integration
by: Li, Yuanchao
Published: (2026)

Cross-lingual Embedding Clustering for Hierarchical Softmax in Low-Resource Multilingual Speech Recognition
by: Yang, Zhengdong, et al.
Published: (2025)

Speaker Contrastive Learning for Source Speaker Tracing
by: Wang, Qing, et al.
Published: (2024)

Accent Normalization Using Self-Supervised Discrete Tokens with Non-Parallel Data
by: Bai, Qibing, et al.
Published: (2025)

Evaluating Text-to-Speech Synthesis from a Large Discrete Token-based Speech Language Model
by: Wang, Siyang, et al.
Published: (2024)

Rare Word Recognition and Translation Without Fine-Tuning via Task Vector in Speech Models
by: Jing, Ruihao, et al.
Published: (2025)

S2ST-Omni: Hierarchical Language-Aware SpeechLLM Adaptation for Multilingual Speech-to-Speech Translation
by: Pan, Yu, et al.
Published: (2025)

Multi-Scale Temporal Transformer For Speech Emotion Recognition
by: Li, Zhipeng, et al.
Published: (2024)