Saved in:
| Main Authors: | Wang, Mingxuan, Nakamura, Satoshi |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2510.06201 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
RALL-E: Robust Codec Language Modeling with Chain-of-Thought Prompting for Text-to-Speech Synthesis
by: Xin, Detai, et al.
Published: (2024)
by: Xin, Detai, et al.
Published: (2024)
LM-SPT: LM-Aligned Semantic Distillation for Speech Tokenization
by: Jo, Daejin, et al.
Published: (2025)
by: Jo, Daejin, et al.
Published: (2025)
Utilizing Neural Transducers for Two-Stage Text-to-Speech via Semantic Token Prediction
by: Kim, Minchan, et al.
Published: (2024)
by: Kim, Minchan, et al.
Published: (2024)
Scaling Speech Tokenizers with Diffusion Autoencoders
by: Wang, Yuancheng, et al.
Published: (2026)
by: Wang, Yuancheng, et al.
Published: (2026)
dMel: Speech Tokenization made Simple
by: Bai, Richard He, et al.
Published: (2024)
by: Bai, Richard He, et al.
Published: (2024)
Discrete Audio Tokens: More Than a Survey!
by: Mousavi, Pooneh, et al.
Published: (2025)
by: Mousavi, Pooneh, et al.
Published: (2025)
PAST: Phonetic-Acoustic Speech Tokenizer
by: Har-Tuv, Nadav, et al.
Published: (2025)
by: Har-Tuv, Nadav, et al.
Published: (2025)
SpeechPrune: Context-aware Token Pruning for Speech Information Retrieval
by: Lin, Yueqian, et al.
Published: (2024)
by: Lin, Yueqian, et al.
Published: (2024)
DC-Spin: A Speaker-invariant Speech Tokenizer for Spoken Language Models
by: Chang, Heng-Jui, et al.
Published: (2024)
by: Chang, Heng-Jui, et al.
Published: (2024)
DM-Codec: Distilling Multimodal Representations for Speech Tokenization
by: Ahasan, Md Mubtasim, et al.
Published: (2024)
by: Ahasan, Md Mubtasim, et al.
Published: (2024)
NTPP: Generative Speech Language Modeling for Dual-Channel Spoken Dialogue via Next-Token-Pair Prediction
by: Wang, Qichao, et al.
Published: (2025)
by: Wang, Qichao, et al.
Published: (2025)
SeamlessExpressiveLM: Speech Language Model for Expressive Speech-to-Speech Translation with Chain-of-Thought
by: Gong, Hongyu, et al.
Published: (2024)
by: Gong, Hongyu, et al.
Published: (2024)
Discrete Speech Unit Extraction via Independent Component Analysis
by: Nakamura, Tomohiko, et al.
Published: (2025)
by: Nakamura, Tomohiko, et al.
Published: (2025)
How Should We Extract Discrete Audio Tokens from Self-Supervised Models?
by: Mousavi, Pooneh, et al.
Published: (2024)
by: Mousavi, Pooneh, et al.
Published: (2024)
Unsupervised Speech Segmentation: A General Approach Using Speech Language Models
by: Elmakies, Avishai, et al.
Published: (2025)
by: Elmakies, Avishai, et al.
Published: (2025)
STTATTS: Unified Speech-To-Text And Text-To-Speech Model
by: Toyin, Hawau Olamide, et al.
Published: (2024)
by: Toyin, Hawau Olamide, et al.
Published: (2024)
A Comparative Study of Discrete Speech Tokens for Semantic-Related Tasks with Large Language Models
by: Wang, Dingdong, et al.
Published: (2024)
by: Wang, Dingdong, et al.
Published: (2024)
VocalNet: Speech LLM with Multi-Token Prediction for Faster and High-Quality Generation
by: Wang, Yuhao, et al.
Published: (2025)
by: Wang, Yuhao, et al.
Published: (2025)
Advancing Speech Understanding in Speech-Aware Language Models with GRPO
by: Elmakies, Avishai, et al.
Published: (2025)
by: Elmakies, Avishai, et al.
Published: (2025)
TESU-LLM: Training Speech-LLMs Without Speech via Unified Encoder Alignment
by: Kim, Taesoo, et al.
Published: (2025)
by: Kim, Taesoo, et al.
Published: (2025)
NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models
by: Ju, Zeqian, et al.
Published: (2024)
by: Ju, Zeqian, et al.
Published: (2024)
DiTAR: Diffusion Transformer Autoregressive Modeling for Speech Generation
by: Jia, Dongya, et al.
Published: (2025)
by: Jia, Dongya, et al.
Published: (2025)
Token-Based Audio Inpainting via Discrete Diffusion
by: Dror, Tali, et al.
Published: (2025)
by: Dror, Tali, et al.
Published: (2025)
Rethinking Discrete Speech Representation Tokens for Accent Generation
by: Zhong, Jinzuomu, et al.
Published: (2026)
by: Zhong, Jinzuomu, et al.
Published: (2026)
FlashSpeech: Efficient Zero-Shot Speech Synthesis
by: Ye, Zhen, et al.
Published: (2024)
by: Ye, Zhen, et al.
Published: (2024)
Benchmarking Prosody Encoding in Discrete Speech Tokens
by: Onda, Kentaro, et al.
Published: (2025)
by: Onda, Kentaro, et al.
Published: (2025)
Autoregressive Diffusion Transformer for Text-to-Speech Synthesis
by: Liu, Zhijun, et al.
Published: (2024)
by: Liu, Zhijun, et al.
Published: (2024)
VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild
by: Peng, Puyuan, et al.
Published: (2024)
by: Peng, Puyuan, et al.
Published: (2024)
Large Language Models are Efficient Learners of Noise-Robust Speech Recognition
by: Hu, Yuchen, et al.
Published: (2024)
by: Hu, Yuchen, et al.
Published: (2024)
Self-Taught Recognizer: Toward Unsupervised Adaptation for Speech Foundation Models
by: Hu, Yuchen, et al.
Published: (2024)
by: Hu, Yuchen, et al.
Published: (2024)
JEPA as a Neural Tokenizer: Learning Robust Speech Representations with Density Adaptive Attention
by: Ioannides, Georgios, et al.
Published: (2025)
by: Ioannides, Georgios, et al.
Published: (2025)
High-Fidelity Speech Enhancement via Discrete Audio Tokens
by: Lanzendörfer, Luca A., et al.
Published: (2025)
by: Lanzendörfer, Luca A., et al.
Published: (2025)
SpeechTokenizer: Unified Speech Tokenizer for Speech Large Language Models
by: Zhang, Xin, et al.
Published: (2023)
by: Zhang, Xin, et al.
Published: (2023)
Slamming: Training a Speech Language Model on One GPU in a Day
by: Maimon, Gallil, et al.
Published: (2025)
by: Maimon, Gallil, et al.
Published: (2025)
GenTranslate: Large Language Models are Generative Multilingual Speech and Machine Translators
by: Hu, Yuchen, et al.
Published: (2024)
by: Hu, Yuchen, et al.
Published: (2024)
Listen Again and Choose the Right Answer: A New Paradigm for Automatic Speech Recognition with Large Language Models
by: Hu, Yuchen, et al.
Published: (2024)
by: Hu, Yuchen, et al.
Published: (2024)
Generative Speech Recognition Error Correction with Large Language Models and Task-Activating Prompting
by: Yang, Chao-Han Huck, et al.
Published: (2023)
by: Yang, Chao-Han Huck, et al.
Published: (2023)
Aligner-Guided Training Paradigm: Advancing Text-to-Speech Models with Aligner Guided Duration
by: Lou, Haowei, et al.
Published: (2024)
by: Lou, Haowei, et al.
Published: (2024)
Joint Fine-tuning and Conversion of Pretrained Speech and Language Models towards Linear Complexity
by: He, Mutian, et al.
Published: (2024)
by: He, Mutian, et al.
Published: (2024)
On the Role of Speech Data in Reducing Toxicity Detection Bias
by: Bell, Samuel J., et al.
Published: (2024)
by: Bell, Samuel J., et al.
Published: (2024)
Similar Items
-
RALL-E: Robust Codec Language Modeling with Chain-of-Thought Prompting for Text-to-Speech Synthesis
by: Xin, Detai, et al.
Published: (2024) -
LM-SPT: LM-Aligned Semantic Distillation for Speech Tokenization
by: Jo, Daejin, et al.
Published: (2025) -
Utilizing Neural Transducers for Two-Stage Text-to-Speech via Semantic Token Prediction
by: Kim, Minchan, et al.
Published: (2024) -
Scaling Speech Tokenizers with Diffusion Autoencoders
by: Wang, Yuancheng, et al.
Published: (2026) -
dMel: Speech Tokenization made Simple
by: Bai, Richard He, et al.
Published: (2024)