:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Xie, Zhifei, Ma, Ziyang, Liu, Zihang, Pang, Kaiyu, Li, Hongyu, Zhang, Jialin, Liao, Yue, Ye, Deheng, Miao, Chunyan, Yan, Shuicheng
Format:	Preprint
Published:	2025
Subjects:	Computation and Language Artificial Intelligence Machine Learning Audio and Speech Processing
Online Access:	https://arxiv.org/abs/2508.15827
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Mega-ASR: Towards In-the-wild^2 Speech Recognition via Scaling up Real-world Acoustic Simulation
by: Xie, Zhifei, et al.
Published: (2026)

Audio-Reasoner: Improving Reasoning Capability in Large Audio Language Models
by: Xie, Zhifei, et al.
Published: (2025)

Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming
by: Xie, Zhifei, et al.
Published: (2024)

Mini-Omni2: Towards Open-source GPT-4o with Vision, Speech and Duplex Capabilities
by: Xie, Zhifei, et al.
Published: (2024)

MiniMind-O Technical Report: An Open Small-Scale Speech-Native Omni Model
by: Gong, Jingyao
Published: (2026)

Anonymization, Not Elimination: Utility-Preserved Speech Anonymization
by: Xiao, Yunchong, et al.
Published: (2026)

Speech-Omni-Lite: Portable Speech Interfaces for Vision-Language Models
by: Tao, Dehua, et al.
Published: (2026)

Textless Streaming Speech-to-Speech Translation using Semantic Speech Tokens
by: Zhao, Jinzheng, et al.
Published: (2024)

Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens
by: Wang, Xinsheng, et al.
Published: (2025)

LongCat-Audio-Codec: An Audio Tokenizer and Detokenizer Solution Designed for Speech Large Language Models
by: Zhao, Xiaohan, et al.
Published: (2025)

Single-Codec: Single-Codebook Speech Codec towards High-Performance Speech Generation
by: Li, Hanzhao, et al.
Published: (2024)

Acoustic BPE for Speech Generation with Discrete Tokens
by: Shen, Feiyu, et al.
Published: (2023)

LTS-VoiceAgent: A Listen-Think-Speak Framework for Efficient Streaming Voice Interaction via Semantic Triggering and Incremental Reasoning
by: Zou, Wenhao, et al.
Published: (2026)

Continuous Speech Tokenizer in Text To Speech
by: Li, Yixing, et al.
Published: (2024)

S2ST-Omni: Hierarchical Language-Aware SpeechLLM Adaptation for Multilingual Speech-to-Speech Translation
by: Pan, Yu, et al.
Published: (2025)

Word-Level Emotional Expression Control in Zero-Shot Text-to-Speech Synthesis
by: Wang, Tianrui, et al.
Published: (2025)

SLAM-Omni: Timbre-Controllable Voice Interaction System with Single-Stage Training
by: Chen, Wenxi, et al.
Published: (2024)

EmoOmni: Bridging Emotional Understanding and Expression in Omni-Modal LLMs
by: Tian, Wenjie, et al.
Published: (2026)

Speaking from Coarse to Fine: Improving Neural Codec Language Model via Multi-Scale Speech Coding and Generation
by: Guo, Haohan, et al.
Published: (2024)

MiniMax-Speech: Intrinsic Zero-Shot Text-to-Speech with a Learnable Speaker Encoder
by: Zhang, Bowen, et al.
Published: (2025)

HAM-TTS: Hierarchical Acoustic Modeling for Token-Based Zero-Shot Text-to-Speech with Model and Data Scaling
by: Wang, Chunhui, et al.
Published: (2024)

Speech Token Prediction via Compressed-to-fine Language Modeling for Speech Generation
by: Liu, Wenrui, et al.
Published: (2025)

OmniVoice: Towards Omnilingual Zero-Shot Text-to-Speech with Diffusion Language Models
by: Zhu, Han, et al.
Published: (2026)

Phone-purity Guided Discrete Tokens for Dysarthric Speech Recognition
by: Wang, Huimeng, et al.
Published: (2025)

Interleaved Speech-Text Language Models for Simple Streaming Text-to-Speech Synthesis
by: Yang, Yifan, et al.
Published: (2024)

Mobile Recording Device Recognition Based Cross-Scale and Multi-Level Representation Learning
by: Zeng, Chunyan, et al.
Published: (2024)

OmniCodec: Low Frame Rate Universal Audio Codec with Semantic-Acoustic Disentanglement
by: Hu, Jingbin, et al.
Published: (2026)

Token-Level Logits Matter: A Closer Look at Speech Foundation Models for Ambiguous Emotion Recognition
by: Halim, Jule Valendo, et al.
Published: (2025)

EmoSSLSphere: Multilingual Emotional Speech Synthesis with Spherical Vectors and Discrete Speech Tokens
by: Park, Joonyong, et al.
Published: (2025)

Who is Speaking or Who is Depressed? A Controlled Study of Speaker Leakage in Speech-Based Depression Detection
by: Yeh, Hsiang-Chen, et al.
Published: (2026)

JIS: A Speech Corpus of Japanese Idol Speakers with Various Speaking Styles
by: Kondo, Yuto, et al.
Published: (2025)

NAST: Noise Aware Speech Tokenization for Speech Language Models
by: Messica, Shoval, et al.
Published: (2024)

SpeechTokenizer: Unified Speech Tokenizer for Speech Large Language Models
by: Zhang, Xin, et al.
Published: (2023)

Thinking in Directivity: Speech Large Language Model for Multi-Talker Directional Speech Recognition
by: Xie, Jiamin, et al.
Published: (2025)

FireRedTTS: A Foundation Text-To-Speech Framework for Industry-Level Generative Speech Applications
by: Guo, Hao-Han, et al.
Published: (2024)

DialoSpeech: Dual-Speaker Dialogue Generation with LLM and Flow Matching
by: Xie, Hanke, et al.
Published: (2025)

Seeing the Context: Rich Visual Context-Aware Speech Recognition via Multimodal Reasoning
by: Tian, Wenjie, et al.
Published: (2026)

Discrete Diffusion for Generative Modeling of Text-Aligned Speech Tokens
by: Ku, Pin-Jui, et al.
Published: (2025)

Efficient Long Speech Sequence Modelling for Time-Domain Depression Level Estimation
by: Li, Shuanglin, et al.
Published: (2025)

Accelerating Flow-Matching-Based Text-to-Speech via Empirically Pruned Step Sampling
by: Zheng, Qixi, et al.
Published: (2025)