:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Cheng, Ziyang, Wang, Yuhao, Liu, Heyang, Wu, Ronghua, Gu, Qunshan, Wang, Yanfeng, Wang, Yu
Format:	Preprint
Published:	2026
Subjects:	Computation and Language Sound
Online Access:	https://arxiv.org/abs/2602.08607
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

VocalNet: Speech LLM with Multi-Token Prediction for Faster and High-Quality Generation
by: Wang, Yuhao, et al.
Published: (2025)

VocalNet-M2: Advancing Low-Latency Spoken Language Modeling via Integrated Multi-Codebook Tokenization and Multi-Token Prediction
by: Wang, Yuhao, et al.
Published: (2025)

SOVA-Bench: Benchmarking the Speech Conversation Ability for LLM-based Voice Assistant
by: Hou, Yixuan, et al.
Published: (2025)

VocalBench: Benchmarking the Vocal Conversational Abilities for Speech Interaction Models
by: Liu, Heyang, et al.
Published: (2025)

VocalBench-zh: Decomposing and Benchmarking the Speech Conversational Abilities in Mandarin Context
by: Liu, Heyang, et al.
Published: (2025)

CS3-Bench: Evaluating and Enhancing Speech-to-Speech LLMs for Mandarin-English Code-Switching
by: Liu, Heyang, et al.
Published: (2025)

LaSR: Context-Aware Speech Recognition via Latent Reasoning
by: Liu, Heyang, et al.
Published: (2026)

Post-decoder Biasing for End-to-End Speech Recognition of Multi-turn Medical Interview
by: Liu, Heyang, et al.
Published: (2024)

NVBench: A Benchmark for Speech Synthesis with Non-Verbal Vocalizations
by: Xue, Liumeng, et al.
Published: (2026)

VocalBench-DF: A Benchmark for Evaluating Speech LLM Robustness to Disfluency
by: Liu, Hongcheng, et al.
Published: (2025)

Towards an End-to-End Framework for Invasive Brain Signal Decoding with Large Language Models
by: Feng, Sheng, et al.
Published: (2024)

Exploring Effective Distillation of Self-Supervised Speech Models for Automatic Speech Recognition
by: Wang, Yujin, et al.
Published: (2022)

Decoding Linguistic Representations of Human Brain
by: Wang, Yu, et al.
Published: (2024)

Accelerating Flow-Matching-Based Text-to-Speech via Empirically Pruned Step Sampling
by: Zheng, Qixi, et al.
Published: (2025)

UniVocal: Unified Speech-Singing Code-Switching Synthesis
by: Shi, Yufei, et al.
Published: (2026)

StreamFlow: Streaming Flow Matching with Block-wise Guided Attention Mask for Speech Token Decoding
by: Guo, Dake, et al.
Published: (2025)

Ti-Audio: The First Multi-Dialectal End-to-End Speech LLM for Tibetan
by: Wang, Jialing, et al.
Published: (2026)

Analysis of Self-Supervised Speech Models on Children's Speech and Infant Vocalizations
by: Li, Jialu, et al.
Published: (2024)

LaDA-Band: Language Diffusion Models for Vocal-to-Accompaniment Generation
by: Wang, Qi, et al.
Published: (2026)

Extract and Diffuse: Latent Integration for Improved Diffusion-based Speech and Vocal Enhancement
by: Yang, Yudong, et al.
Published: (2024)

On the Distillation Loss Functions of Speech VAE for Unified Reconstruction, Understanding, and Generation
by: Cheng, Changhao, et al.
Published: (2026)

Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens
by: Wang, Xinsheng, et al.
Published: (2025)

NV-Bench: Benchmark of Nonverbal Vocalization Synthesis for Expressive Text-to-Speech Generation
by: Ni, Qinke, et al.
Published: (2026)

REST: Diffusion-based Real-time End-to-end Streaming Talking Head Generation via ID-Context Caching and Asynchronous Streaming Distillation
by: Wang, Haotian, et al.
Published: (2025)

LCM-SVC: Latent Diffusion Model Based Singing Voice Conversion with Inference Acceleration via Latent Consistency Distillation
by: Chen, Shihao, et al.
Published: (2024)

NVSpeech: An Integrated and Scalable Pipeline for Human-Like Speech Modeling with Paralinguistic Vocalizations
by: Liao, Huan, et al.
Published: (2025)

EchoX: Towards Mitigating Acoustic-Semantic Gap via Echo Training for Speech-to-Speech LLMs
by: Zhang, Yuhao, et al.
Published: (2025)

SpecASR: Accelerating LLM-based Automatic Speech Recognition via Speculative Decoding
by: Wei, Linye, et al.
Published: (2025)

FocalCodec-Stream: Streaming Low-Bitrate Speech Coding via Causal Distillation
by: Della Libera, Luca, et al.
Published: (2025)

Mel-RoFormer for Vocal Separation and Vocal Melody Transcription
by: Wang, Ju-Chiang, et al.
Published: (2024)

LLaDA-TTS: Unifying Speech Synthesis and Zero-Shot Editing via Masked Diffusion Modeling
by: Fan, Xiaoyu, et al.
Published: (2026)

Affectron: Emotional Speech Synthesis with Affective and Contextually Aligned Nonverbal Vocalizations
by: Cho, Deok-Hyeon, et al.
Published: (2026)

SAC: Neural Speech Codec with Semantic-Acoustic Dual-Stream Quantization
by: Chen, Wenxi, et al.
Published: (2025)

Selective Masking Adversarial Attack on Automatic Speech Recognition Systems
by: Fang, Zheng, et al.
Published: (2025)

MoVE: Translating Laughter and Tears via Mixture of Vocalization Experts in Speech-to-Speech Translation
by: Chen, Szu-Chi, et al.
Published: (2026)

Efficient Streaming LLM for Speech Recognition
by: Jia, Junteng, et al.
Published: (2024)

Unsupervised Multi-channel Speech Dereverberation via Diffusion
by: Wu, Yulun, et al.
Published: (2025)

Coding Speech through Vocal Tract Kinematics
by: Cho, Cheol Jun, et al.
Published: (2024)

From Reactive to Proactive: Assessing the Proactivity of Voice Agents via ProVoice-Bench
by: Xu, Ke, et al.
Published: (2026)

AzeroS: Extending LLM to Speech with Self-Generated Instruction-Free Tuning
by: Shao, Yiwen, et al.
Published: (2025)