Saved in:
| Main Authors: | Xie, Zhifei, Ma, Ziyang, Liu, Zihang, Pang, Kaiyu, Li, Hongyu, Zhang, Jialin, Liao, Yue, Ye, Deheng, Miao, Chunyan, Yan, Shuicheng |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2508.15827 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Mega-ASR: Towards In-the-wild^2 Speech Recognition via Scaling up Real-world Acoustic Simulation
by: Xie, Zhifei, et al.
Published: (2026)
by: Xie, Zhifei, et al.
Published: (2026)
Audio-Reasoner: Improving Reasoning Capability in Large Audio Language Models
by: Xie, Zhifei, et al.
Published: (2025)
by: Xie, Zhifei, et al.
Published: (2025)
Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming
by: Xie, Zhifei, et al.
Published: (2024)
by: Xie, Zhifei, et al.
Published: (2024)
Mini-Omni2: Towards Open-source GPT-4o with Vision, Speech and Duplex Capabilities
by: Xie, Zhifei, et al.
Published: (2024)
by: Xie, Zhifei, et al.
Published: (2024)
MiniMind-O Technical Report: An Open Small-Scale Speech-Native Omni Model
by: Gong, Jingyao
Published: (2026)
by: Gong, Jingyao
Published: (2026)
Anonymization, Not Elimination: Utility-Preserved Speech Anonymization
by: Xiao, Yunchong, et al.
Published: (2026)
by: Xiao, Yunchong, et al.
Published: (2026)
Speech-Omni-Lite: Portable Speech Interfaces for Vision-Language Models
by: Tao, Dehua, et al.
Published: (2026)
by: Tao, Dehua, et al.
Published: (2026)
Textless Streaming Speech-to-Speech Translation using Semantic Speech Tokens
by: Zhao, Jinzheng, et al.
Published: (2024)
by: Zhao, Jinzheng, et al.
Published: (2024)
Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens
by: Wang, Xinsheng, et al.
Published: (2025)
by: Wang, Xinsheng, et al.
Published: (2025)
LongCat-Audio-Codec: An Audio Tokenizer and Detokenizer Solution Designed for Speech Large Language Models
by: Zhao, Xiaohan, et al.
Published: (2025)
by: Zhao, Xiaohan, et al.
Published: (2025)
Single-Codec: Single-Codebook Speech Codec towards High-Performance Speech Generation
by: Li, Hanzhao, et al.
Published: (2024)
by: Li, Hanzhao, et al.
Published: (2024)
Acoustic BPE for Speech Generation with Discrete Tokens
by: Shen, Feiyu, et al.
Published: (2023)
by: Shen, Feiyu, et al.
Published: (2023)
LTS-VoiceAgent: A Listen-Think-Speak Framework for Efficient Streaming Voice Interaction via Semantic Triggering and Incremental Reasoning
by: Zou, Wenhao, et al.
Published: (2026)
by: Zou, Wenhao, et al.
Published: (2026)
Continuous Speech Tokenizer in Text To Speech
by: Li, Yixing, et al.
Published: (2024)
by: Li, Yixing, et al.
Published: (2024)
S2ST-Omni: Hierarchical Language-Aware SpeechLLM Adaptation for Multilingual Speech-to-Speech Translation
by: Pan, Yu, et al.
Published: (2025)
by: Pan, Yu, et al.
Published: (2025)
Word-Level Emotional Expression Control in Zero-Shot Text-to-Speech Synthesis
by: Wang, Tianrui, et al.
Published: (2025)
by: Wang, Tianrui, et al.
Published: (2025)
SLAM-Omni: Timbre-Controllable Voice Interaction System with Single-Stage Training
by: Chen, Wenxi, et al.
Published: (2024)
by: Chen, Wenxi, et al.
Published: (2024)
EmoOmni: Bridging Emotional Understanding and Expression in Omni-Modal LLMs
by: Tian, Wenjie, et al.
Published: (2026)
by: Tian, Wenjie, et al.
Published: (2026)
Speaking from Coarse to Fine: Improving Neural Codec Language Model via Multi-Scale Speech Coding and Generation
by: Guo, Haohan, et al.
Published: (2024)
by: Guo, Haohan, et al.
Published: (2024)
MiniMax-Speech: Intrinsic Zero-Shot Text-to-Speech with a Learnable Speaker Encoder
by: Zhang, Bowen, et al.
Published: (2025)
by: Zhang, Bowen, et al.
Published: (2025)
HAM-TTS: Hierarchical Acoustic Modeling for Token-Based Zero-Shot Text-to-Speech with Model and Data Scaling
by: Wang, Chunhui, et al.
Published: (2024)
by: Wang, Chunhui, et al.
Published: (2024)
Speech Token Prediction via Compressed-to-fine Language Modeling for Speech Generation
by: Liu, Wenrui, et al.
Published: (2025)
by: Liu, Wenrui, et al.
Published: (2025)
OmniVoice: Towards Omnilingual Zero-Shot Text-to-Speech with Diffusion Language Models
by: Zhu, Han, et al.
Published: (2026)
by: Zhu, Han, et al.
Published: (2026)
Phone-purity Guided Discrete Tokens for Dysarthric Speech Recognition
by: Wang, Huimeng, et al.
Published: (2025)
by: Wang, Huimeng, et al.
Published: (2025)
Interleaved Speech-Text Language Models for Simple Streaming Text-to-Speech Synthesis
by: Yang, Yifan, et al.
Published: (2024)
by: Yang, Yifan, et al.
Published: (2024)
Mobile Recording Device Recognition Based Cross-Scale and Multi-Level Representation Learning
by: Zeng, Chunyan, et al.
Published: (2024)
by: Zeng, Chunyan, et al.
Published: (2024)
OmniCodec: Low Frame Rate Universal Audio Codec with Semantic-Acoustic Disentanglement
by: Hu, Jingbin, et al.
Published: (2026)
by: Hu, Jingbin, et al.
Published: (2026)
Token-Level Logits Matter: A Closer Look at Speech Foundation Models for Ambiguous Emotion Recognition
by: Halim, Jule Valendo, et al.
Published: (2025)
by: Halim, Jule Valendo, et al.
Published: (2025)
EmoSSLSphere: Multilingual Emotional Speech Synthesis with Spherical Vectors and Discrete Speech Tokens
by: Park, Joonyong, et al.
Published: (2025)
by: Park, Joonyong, et al.
Published: (2025)
Who is Speaking or Who is Depressed? A Controlled Study of Speaker Leakage in Speech-Based Depression Detection
by: Yeh, Hsiang-Chen, et al.
Published: (2026)
by: Yeh, Hsiang-Chen, et al.
Published: (2026)
JIS: A Speech Corpus of Japanese Idol Speakers with Various Speaking Styles
by: Kondo, Yuto, et al.
Published: (2025)
by: Kondo, Yuto, et al.
Published: (2025)
NAST: Noise Aware Speech Tokenization for Speech Language Models
by: Messica, Shoval, et al.
Published: (2024)
by: Messica, Shoval, et al.
Published: (2024)
SpeechTokenizer: Unified Speech Tokenizer for Speech Large Language Models
by: Zhang, Xin, et al.
Published: (2023)
by: Zhang, Xin, et al.
Published: (2023)
Thinking in Directivity: Speech Large Language Model for Multi-Talker Directional Speech Recognition
by: Xie, Jiamin, et al.
Published: (2025)
by: Xie, Jiamin, et al.
Published: (2025)
FireRedTTS: A Foundation Text-To-Speech Framework for Industry-Level Generative Speech Applications
by: Guo, Hao-Han, et al.
Published: (2024)
by: Guo, Hao-Han, et al.
Published: (2024)
DialoSpeech: Dual-Speaker Dialogue Generation with LLM and Flow Matching
by: Xie, Hanke, et al.
Published: (2025)
by: Xie, Hanke, et al.
Published: (2025)
Seeing the Context: Rich Visual Context-Aware Speech Recognition via Multimodal Reasoning
by: Tian, Wenjie, et al.
Published: (2026)
by: Tian, Wenjie, et al.
Published: (2026)
Discrete Diffusion for Generative Modeling of Text-Aligned Speech Tokens
by: Ku, Pin-Jui, et al.
Published: (2025)
by: Ku, Pin-Jui, et al.
Published: (2025)
Efficient Long Speech Sequence Modelling for Time-Domain Depression Level Estimation
by: Li, Shuanglin, et al.
Published: (2025)
by: Li, Shuanglin, et al.
Published: (2025)
Accelerating Flow-Matching-Based Text-to-Speech via Empirically Pruned Step Sampling
by: Zheng, Qixi, et al.
Published: (2025)
by: Zheng, Qixi, et al.
Published: (2025)
Similar Items
-
Mega-ASR: Towards In-the-wild^2 Speech Recognition via Scaling up Real-world Acoustic Simulation
by: Xie, Zhifei, et al.
Published: (2026) -
Audio-Reasoner: Improving Reasoning Capability in Large Audio Language Models
by: Xie, Zhifei, et al.
Published: (2025) -
Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming
by: Xie, Zhifei, et al.
Published: (2024) -
Mini-Omni2: Towards Open-source GPT-4o with Vision, Speech and Duplex Capabilities
by: Xie, Zhifei, et al.
Published: (2024) -
MiniMind-O Technical Report: An Open Small-Scale Speech-Native Omni Model
by: Gong, Jingyao
Published: (2026)