:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Han, Minglun, Bai, Ye, Shen, Chen, Huang, Youjia, Huang, Mingkun, Lin, Zehua, Dong, Linhao, Lu, Lu, Wang, Yuxuan
Format:	Preprint
Published:	2024
Subjects:	Audio and Speech Processing Artificial Intelligence Computation and Language
Online Access:	https://arxiv.org/abs/2409.08680
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Seed-ASR: Understanding Diverse Speech and Contexts with LLM-based Speech Recognition
by: Bai, Ye, et al.
Published: (2024)

BiRQ: Bi-Level Self-Labeling Random Quantization for Self-Supervised Speech Recognition
by: Jiang, Liuyuan, et al.
Published: (2025)

Self-Supervised Singing Voice Pre-Training towards Speech-to-Singing Conversion
by: Li, Ruiqi, et al.
Published: (2024)

OMAR-RQ: Open Music Audio Representation Model Trained with Multi-Feature Masked Token Prediction
by: Alonso-Jiménez, Pablo, et al.
Published: (2025)

NEST: Self-supervised Fast Conformer as All-purpose Seasoning to Speech Processing Tasks
by: Huang, He, et al.
Published: (2024)

Comparison of Self-Supervised Speech Pre-Training Methods on Flemish Dutch
by: Poncelet, Jakob, et al.
Published: (2021)

M-BEST-RQ: A Multi-Channel Speech Foundation Model for Smart Glasses
by: Yang, Yufeng, et al.
Published: (2024)

SA-SOT: Speaker-Aware Serialized Output Training for Multi-Talker ASR
by: Fan, Zhiyun, et al.
Published: (2024)

Exploring Prediction Targets in Masked Pre-Training for Speech Foundation Models
by: Chen, Li-Wei, et al.
Published: (2024)

Adaptive Federated Fine-Tuning of Self-Supervised Speech Representations
by: Guo, Xin, et al.
Published: (2026)

SA-WavLM: Speaker-Aware Self-Supervised Pre-training for Mixture Speech
by: Lin, Jingru, et al.
Published: (2024)

Speech Token Prediction via Compressed-to-fine Language Modeling for Speech Generation
by: Liu, Wenrui, et al.
Published: (2025)

Comparing Self-Supervised Learning Models Pre-Trained on Human Speech and Animal Vocalizations for Bioacoustics Processing
by: Sarkar, Eklavya, et al.
Published: (2025)

Investigating Zero-Shot Generalizability on Mandarin-English Code-Switched ASR and Speech-to-text Translation of Recent Foundation Models with Self-Supervision and Weak Supervision
by: Yang, Chih-Kai, et al.
Published: (2023)

Next Tokens Denoising for Speech Synthesis
by: Liu, Yanqing, et al.
Published: (2025)

Multilingual Zero Resource Speech Recognition Base on Self-Supervise Pre-Trained Acoustic Models
by: Wang, Haoyu, et al.
Published: (2022)

Text-guided HuBERT: Self-Supervised Speech Pre-training via Generative Adversarial Networks
by: Ma, Duo, et al.
Published: (2024)

A Study of Data Selection Strategies for Pre-training Self-Supervised Speech Models
by: Whetten, Ryan, et al.
Published: (2026)

Low-latency Speech Enhancement via Speech Token Generation
by: Xue, Huaying, et al.
Published: (2023)

Accent Normalization Using Self-Supervised Discrete Tokens with Non-Parallel Data
by: Bai, Qibing, et al.
Published: (2025)

Ambisonizer: Neural Upmixing as Spherical Harmonics Generation
by: Zang, Yongyi, et al.
Published: (2024)

Generative Audio Language Modeling with Continuous-valued Tokens and Masked Next-Token Prediction
by: Yang, Shu-wen, et al.
Published: (2025)

Rethinking Mamba in Speech Processing by Self-Supervised Models
by: Zhang, Xiangyu, et al.
Published: (2024)

Self-Supervised Speech Quality Assessment (S3QA): Leveraging Speech Foundation Models for a Scalable Speech Quality Metric
by: Ogg, Mattson, et al.
Published: (2025)

Comparing Unsupervised and Supervised Semantic Speech Tokens: A Case Study of Child ASR
by: Shi, Mohan, et al.
Published: (2025)

Emotion-Coherent Speech Data Augmentation and Self-Supervised Contrastive Style Training for Enhancing Kids's Story Speech Synthesis
by: Chung, Raymond
Published: (2026)

Discrete Diffusion for Generative Modeling of Text-Aligned Speech Tokens
by: Ku, Pin-Jui, et al.
Published: (2025)

HYFuse: Aligning Heterogeneous Speech Pre-Trained Representations in Hyperbolic Space for Speech Emotion Recognition
by: Phukan, Orchid Chetia, et al.
Published: (2025)

SpeechTokenizer: Unified Speech Tokenizer for Speech Large Language Models
by: Zhang, Xin, et al.
Published: (2023)

Hybrid Pruning: In-Situ Compression of Self-Supervised Speech Models for Speaker Verification and Anti-Spoofing
by: Peng, Junyi, et al.
Published: (2025)

Analysis of Self-Supervised Speech Models on Children's Speech and Infant Vocalizations
by: Li, Jialu, et al.
Published: (2024)

Fast Word Error Rate Estimation Using Self-Supervised Representations for Speech and Text
by: Park, Chanho, et al.
Published: (2023)

Low-Resourced Speech Recognition for Iu Mien Language via Weakly-Supervised Phoneme-based Multilingual Pre-training
by: Dong, Lukuan, et al.
Published: (2024)

RepCodec: A Speech Representation Codec for Speech Tokenization
by: Huang, Zhichao, et al.
Published: (2023)

Speaker-Conditioned Phrase Break Prediction for Text-to-Speech with Phoneme-Level Pre-trained Language Model
by: Yang, Dong, et al.
Published: (2025)

Acoustic BPE for Speech Generation with Discrete Tokens
by: Shen, Feiyu, et al.
Published: (2023)

Objective Evaluation of Prosody and Intelligibility in Speech Synthesis via Conditional Prediction of Discrete Tokens
by: Ulgen, Ismail Rasim, et al.
Published: (2025)

Large Language Model Guided Decoding for Self-Supervised Speech Recognition
by: Cohen, Eyal, et al.
Published: (2025)

Multilingual Speech Recognition Using Discrete Tokens with a Two-step Training Strategy
by: Li, Zehan, et al.
Published: (2025)

Africa-Centric Self-Supervised Pre-Training for Multilingual Speech Representation in a Sub-Saharan Context
by: Caubrière, Antoine, et al.
Published: (2024)