Saved in:
| Main Authors: | Han, Minglun, Bai, Ye, Shen, Chen, Huang, Youjia, Huang, Mingkun, Lin, Zehua, Dong, Linhao, Lu, Lu, Wang, Yuxuan |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2409.08680 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Seed-ASR: Understanding Diverse Speech and Contexts with LLM-based Speech Recognition
by: Bai, Ye, et al.
Published: (2024)
by: Bai, Ye, et al.
Published: (2024)
BiRQ: Bi-Level Self-Labeling Random Quantization for Self-Supervised Speech Recognition
by: Jiang, Liuyuan, et al.
Published: (2025)
by: Jiang, Liuyuan, et al.
Published: (2025)
Self-Supervised Singing Voice Pre-Training towards Speech-to-Singing Conversion
by: Li, Ruiqi, et al.
Published: (2024)
by: Li, Ruiqi, et al.
Published: (2024)
OMAR-RQ: Open Music Audio Representation Model Trained with Multi-Feature Masked Token Prediction
by: Alonso-Jiménez, Pablo, et al.
Published: (2025)
by: Alonso-Jiménez, Pablo, et al.
Published: (2025)
NEST: Self-supervised Fast Conformer as All-purpose Seasoning to Speech Processing Tasks
by: Huang, He, et al.
Published: (2024)
by: Huang, He, et al.
Published: (2024)
Comparison of Self-Supervised Speech Pre-Training Methods on Flemish Dutch
by: Poncelet, Jakob, et al.
Published: (2021)
by: Poncelet, Jakob, et al.
Published: (2021)
M-BEST-RQ: A Multi-Channel Speech Foundation Model for Smart Glasses
by: Yang, Yufeng, et al.
Published: (2024)
by: Yang, Yufeng, et al.
Published: (2024)
SA-SOT: Speaker-Aware Serialized Output Training for Multi-Talker ASR
by: Fan, Zhiyun, et al.
Published: (2024)
by: Fan, Zhiyun, et al.
Published: (2024)
Exploring Prediction Targets in Masked Pre-Training for Speech Foundation Models
by: Chen, Li-Wei, et al.
Published: (2024)
by: Chen, Li-Wei, et al.
Published: (2024)
Adaptive Federated Fine-Tuning of Self-Supervised Speech Representations
by: Guo, Xin, et al.
Published: (2026)
by: Guo, Xin, et al.
Published: (2026)
SA-WavLM: Speaker-Aware Self-Supervised Pre-training for Mixture Speech
by: Lin, Jingru, et al.
Published: (2024)
by: Lin, Jingru, et al.
Published: (2024)
Speech Token Prediction via Compressed-to-fine Language Modeling for Speech Generation
by: Liu, Wenrui, et al.
Published: (2025)
by: Liu, Wenrui, et al.
Published: (2025)
Comparing Self-Supervised Learning Models Pre-Trained on Human Speech and Animal Vocalizations for Bioacoustics Processing
by: Sarkar, Eklavya, et al.
Published: (2025)
by: Sarkar, Eklavya, et al.
Published: (2025)
Investigating Zero-Shot Generalizability on Mandarin-English Code-Switched ASR and Speech-to-text Translation of Recent Foundation Models with Self-Supervision and Weak Supervision
by: Yang, Chih-Kai, et al.
Published: (2023)
by: Yang, Chih-Kai, et al.
Published: (2023)
Next Tokens Denoising for Speech Synthesis
by: Liu, Yanqing, et al.
Published: (2025)
by: Liu, Yanqing, et al.
Published: (2025)
Multilingual Zero Resource Speech Recognition Base on Self-Supervise Pre-Trained Acoustic Models
by: Wang, Haoyu, et al.
Published: (2022)
by: Wang, Haoyu, et al.
Published: (2022)
Text-guided HuBERT: Self-Supervised Speech Pre-training via Generative Adversarial Networks
by: Ma, Duo, et al.
Published: (2024)
by: Ma, Duo, et al.
Published: (2024)
A Study of Data Selection Strategies for Pre-training Self-Supervised Speech Models
by: Whetten, Ryan, et al.
Published: (2026)
by: Whetten, Ryan, et al.
Published: (2026)
Low-latency Speech Enhancement via Speech Token Generation
by: Xue, Huaying, et al.
Published: (2023)
by: Xue, Huaying, et al.
Published: (2023)
Accent Normalization Using Self-Supervised Discrete Tokens with Non-Parallel Data
by: Bai, Qibing, et al.
Published: (2025)
by: Bai, Qibing, et al.
Published: (2025)
Ambisonizer: Neural Upmixing as Spherical Harmonics Generation
by: Zang, Yongyi, et al.
Published: (2024)
by: Zang, Yongyi, et al.
Published: (2024)
Generative Audio Language Modeling with Continuous-valued Tokens and Masked Next-Token Prediction
by: Yang, Shu-wen, et al.
Published: (2025)
by: Yang, Shu-wen, et al.
Published: (2025)
Rethinking Mamba in Speech Processing by Self-Supervised Models
by: Zhang, Xiangyu, et al.
Published: (2024)
by: Zhang, Xiangyu, et al.
Published: (2024)
Self-Supervised Speech Quality Assessment (S3QA): Leveraging Speech Foundation Models for a Scalable Speech Quality Metric
by: Ogg, Mattson, et al.
Published: (2025)
by: Ogg, Mattson, et al.
Published: (2025)
Comparing Unsupervised and Supervised Semantic Speech Tokens: A Case Study of Child ASR
by: Shi, Mohan, et al.
Published: (2025)
by: Shi, Mohan, et al.
Published: (2025)
Emotion-Coherent Speech Data Augmentation and Self-Supervised Contrastive Style Training for Enhancing Kids's Story Speech Synthesis
by: Chung, Raymond
Published: (2026)
by: Chung, Raymond
Published: (2026)
Discrete Diffusion for Generative Modeling of Text-Aligned Speech Tokens
by: Ku, Pin-Jui, et al.
Published: (2025)
by: Ku, Pin-Jui, et al.
Published: (2025)
HYFuse: Aligning Heterogeneous Speech Pre-Trained Representations in Hyperbolic Space for Speech Emotion Recognition
by: Phukan, Orchid Chetia, et al.
Published: (2025)
by: Phukan, Orchid Chetia, et al.
Published: (2025)
SpeechTokenizer: Unified Speech Tokenizer for Speech Large Language Models
by: Zhang, Xin, et al.
Published: (2023)
by: Zhang, Xin, et al.
Published: (2023)
Hybrid Pruning: In-Situ Compression of Self-Supervised Speech Models for Speaker Verification and Anti-Spoofing
by: Peng, Junyi, et al.
Published: (2025)
by: Peng, Junyi, et al.
Published: (2025)
Analysis of Self-Supervised Speech Models on Children's Speech and Infant Vocalizations
by: Li, Jialu, et al.
Published: (2024)
by: Li, Jialu, et al.
Published: (2024)
Fast Word Error Rate Estimation Using Self-Supervised Representations for Speech and Text
by: Park, Chanho, et al.
Published: (2023)
by: Park, Chanho, et al.
Published: (2023)
Low-Resourced Speech Recognition for Iu Mien Language via Weakly-Supervised Phoneme-based Multilingual Pre-training
by: Dong, Lukuan, et al.
Published: (2024)
by: Dong, Lukuan, et al.
Published: (2024)
RepCodec: A Speech Representation Codec for Speech Tokenization
by: Huang, Zhichao, et al.
Published: (2023)
by: Huang, Zhichao, et al.
Published: (2023)
Speaker-Conditioned Phrase Break Prediction for Text-to-Speech with Phoneme-Level Pre-trained Language Model
by: Yang, Dong, et al.
Published: (2025)
by: Yang, Dong, et al.
Published: (2025)
Acoustic BPE for Speech Generation with Discrete Tokens
by: Shen, Feiyu, et al.
Published: (2023)
by: Shen, Feiyu, et al.
Published: (2023)
Objective Evaluation of Prosody and Intelligibility in Speech Synthesis via Conditional Prediction of Discrete Tokens
by: Ulgen, Ismail Rasim, et al.
Published: (2025)
by: Ulgen, Ismail Rasim, et al.
Published: (2025)
Large Language Model Guided Decoding for Self-Supervised Speech Recognition
by: Cohen, Eyal, et al.
Published: (2025)
by: Cohen, Eyal, et al.
Published: (2025)
Multilingual Speech Recognition Using Discrete Tokens with a Two-step Training Strategy
by: Li, Zehan, et al.
Published: (2025)
by: Li, Zehan, et al.
Published: (2025)
Africa-Centric Self-Supervised Pre-Training for Multilingual Speech Representation in a Sub-Saharan Context
by: Caubrière, Antoine, et al.
Published: (2024)
by: Caubrière, Antoine, et al.
Published: (2024)
Similar Items
-
Seed-ASR: Understanding Diverse Speech and Contexts with LLM-based Speech Recognition
by: Bai, Ye, et al.
Published: (2024) -
BiRQ: Bi-Level Self-Labeling Random Quantization for Self-Supervised Speech Recognition
by: Jiang, Liuyuan, et al.
Published: (2025) -
Self-Supervised Singing Voice Pre-Training towards Speech-to-Singing Conversion
by: Li, Ruiqi, et al.
Published: (2024) -
OMAR-RQ: Open Music Audio Representation Model Trained with Multi-Feature Masked Token Prediction
by: Alonso-Jiménez, Pablo, et al.
Published: (2025) -
NEST: Self-supervised Fast Conformer as All-purpose Seasoning to Speech Processing Tasks
by: Huang, He, et al.
Published: (2024)