Saved in:
| Main Authors: | Cheng, Ziyang, Wang, Yuhao, Liu, Heyang, Wu, Ronghua, Gu, Qunshan, Wang, Yanfeng, Wang, Yu |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2602.08607 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
VocalNet: Speech LLM with Multi-Token Prediction for Faster and High-Quality Generation
by: Wang, Yuhao, et al.
Published: (2025)
by: Wang, Yuhao, et al.
Published: (2025)
VocalNet-M2: Advancing Low-Latency Spoken Language Modeling via Integrated Multi-Codebook Tokenization and Multi-Token Prediction
by: Wang, Yuhao, et al.
Published: (2025)
by: Wang, Yuhao, et al.
Published: (2025)
SOVA-Bench: Benchmarking the Speech Conversation Ability for LLM-based Voice Assistant
by: Hou, Yixuan, et al.
Published: (2025)
by: Hou, Yixuan, et al.
Published: (2025)
VocalBench: Benchmarking the Vocal Conversational Abilities for Speech Interaction Models
by: Liu, Heyang, et al.
Published: (2025)
by: Liu, Heyang, et al.
Published: (2025)
VocalBench-zh: Decomposing and Benchmarking the Speech Conversational Abilities in Mandarin Context
by: Liu, Heyang, et al.
Published: (2025)
by: Liu, Heyang, et al.
Published: (2025)
CS3-Bench: Evaluating and Enhancing Speech-to-Speech LLMs for Mandarin-English Code-Switching
by: Liu, Heyang, et al.
Published: (2025)
by: Liu, Heyang, et al.
Published: (2025)
LaSR: Context-Aware Speech Recognition via Latent Reasoning
by: Liu, Heyang, et al.
Published: (2026)
by: Liu, Heyang, et al.
Published: (2026)
Post-decoder Biasing for End-to-End Speech Recognition of Multi-turn Medical Interview
by: Liu, Heyang, et al.
Published: (2024)
by: Liu, Heyang, et al.
Published: (2024)
NVBench: A Benchmark for Speech Synthesis with Non-Verbal Vocalizations
by: Xue, Liumeng, et al.
Published: (2026)
by: Xue, Liumeng, et al.
Published: (2026)
VocalBench-DF: A Benchmark for Evaluating Speech LLM Robustness to Disfluency
by: Liu, Hongcheng, et al.
Published: (2025)
by: Liu, Hongcheng, et al.
Published: (2025)
Towards an End-to-End Framework for Invasive Brain Signal Decoding with Large Language Models
by: Feng, Sheng, et al.
Published: (2024)
by: Feng, Sheng, et al.
Published: (2024)
Exploring Effective Distillation of Self-Supervised Speech Models for Automatic Speech Recognition
by: Wang, Yujin, et al.
Published: (2022)
by: Wang, Yujin, et al.
Published: (2022)
Decoding Linguistic Representations of Human Brain
by: Wang, Yu, et al.
Published: (2024)
by: Wang, Yu, et al.
Published: (2024)
Accelerating Flow-Matching-Based Text-to-Speech via Empirically Pruned Step Sampling
by: Zheng, Qixi, et al.
Published: (2025)
by: Zheng, Qixi, et al.
Published: (2025)
UniVocal: Unified Speech-Singing Code-Switching Synthesis
by: Shi, Yufei, et al.
Published: (2026)
by: Shi, Yufei, et al.
Published: (2026)
StreamFlow: Streaming Flow Matching with Block-wise Guided Attention Mask for Speech Token Decoding
by: Guo, Dake, et al.
Published: (2025)
by: Guo, Dake, et al.
Published: (2025)
Ti-Audio: The First Multi-Dialectal End-to-End Speech LLM for Tibetan
by: Wang, Jialing, et al.
Published: (2026)
by: Wang, Jialing, et al.
Published: (2026)
Analysis of Self-Supervised Speech Models on Children's Speech and Infant Vocalizations
by: Li, Jialu, et al.
Published: (2024)
by: Li, Jialu, et al.
Published: (2024)
LaDA-Band: Language Diffusion Models for Vocal-to-Accompaniment Generation
by: Wang, Qi, et al.
Published: (2026)
by: Wang, Qi, et al.
Published: (2026)
Extract and Diffuse: Latent Integration for Improved Diffusion-based Speech and Vocal Enhancement
by: Yang, Yudong, et al.
Published: (2024)
by: Yang, Yudong, et al.
Published: (2024)
On the Distillation Loss Functions of Speech VAE for Unified Reconstruction, Understanding, and Generation
by: Cheng, Changhao, et al.
Published: (2026)
by: Cheng, Changhao, et al.
Published: (2026)
Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens
by: Wang, Xinsheng, et al.
Published: (2025)
by: Wang, Xinsheng, et al.
Published: (2025)
NV-Bench: Benchmark of Nonverbal Vocalization Synthesis for Expressive Text-to-Speech Generation
by: Ni, Qinke, et al.
Published: (2026)
by: Ni, Qinke, et al.
Published: (2026)
REST: Diffusion-based Real-time End-to-end Streaming Talking Head Generation via ID-Context Caching and Asynchronous Streaming Distillation
by: Wang, Haotian, et al.
Published: (2025)
by: Wang, Haotian, et al.
Published: (2025)
LCM-SVC: Latent Diffusion Model Based Singing Voice Conversion with Inference Acceleration via Latent Consistency Distillation
by: Chen, Shihao, et al.
Published: (2024)
by: Chen, Shihao, et al.
Published: (2024)
NVSpeech: An Integrated and Scalable Pipeline for Human-Like Speech Modeling with Paralinguistic Vocalizations
by: Liao, Huan, et al.
Published: (2025)
by: Liao, Huan, et al.
Published: (2025)
EchoX: Towards Mitigating Acoustic-Semantic Gap via Echo Training for Speech-to-Speech LLMs
by: Zhang, Yuhao, et al.
Published: (2025)
by: Zhang, Yuhao, et al.
Published: (2025)
SpecASR: Accelerating LLM-based Automatic Speech Recognition via Speculative Decoding
by: Wei, Linye, et al.
Published: (2025)
by: Wei, Linye, et al.
Published: (2025)
FocalCodec-Stream: Streaming Low-Bitrate Speech Coding via Causal Distillation
by: Della Libera, Luca, et al.
Published: (2025)
by: Della Libera, Luca, et al.
Published: (2025)
Mel-RoFormer for Vocal Separation and Vocal Melody Transcription
by: Wang, Ju-Chiang, et al.
Published: (2024)
by: Wang, Ju-Chiang, et al.
Published: (2024)
LLaDA-TTS: Unifying Speech Synthesis and Zero-Shot Editing via Masked Diffusion Modeling
by: Fan, Xiaoyu, et al.
Published: (2026)
by: Fan, Xiaoyu, et al.
Published: (2026)
Affectron: Emotional Speech Synthesis with Affective and Contextually Aligned Nonverbal Vocalizations
by: Cho, Deok-Hyeon, et al.
Published: (2026)
by: Cho, Deok-Hyeon, et al.
Published: (2026)
SAC: Neural Speech Codec with Semantic-Acoustic Dual-Stream Quantization
by: Chen, Wenxi, et al.
Published: (2025)
by: Chen, Wenxi, et al.
Published: (2025)
Selective Masking Adversarial Attack on Automatic Speech Recognition Systems
by: Fang, Zheng, et al.
Published: (2025)
by: Fang, Zheng, et al.
Published: (2025)
MoVE: Translating Laughter and Tears via Mixture of Vocalization Experts in Speech-to-Speech Translation
by: Chen, Szu-Chi, et al.
Published: (2026)
by: Chen, Szu-Chi, et al.
Published: (2026)
Efficient Streaming LLM for Speech Recognition
by: Jia, Junteng, et al.
Published: (2024)
by: Jia, Junteng, et al.
Published: (2024)
Unsupervised Multi-channel Speech Dereverberation via Diffusion
by: Wu, Yulun, et al.
Published: (2025)
by: Wu, Yulun, et al.
Published: (2025)
Coding Speech through Vocal Tract Kinematics
by: Cho, Cheol Jun, et al.
Published: (2024)
by: Cho, Cheol Jun, et al.
Published: (2024)
From Reactive to Proactive: Assessing the Proactivity of Voice Agents via ProVoice-Bench
by: Xu, Ke, et al.
Published: (2026)
by: Xu, Ke, et al.
Published: (2026)
AzeroS: Extending LLM to Speech with Self-Generated Instruction-Free Tuning
by: Shao, Yiwen, et al.
Published: (2025)
by: Shao, Yiwen, et al.
Published: (2025)
Similar Items
-
VocalNet: Speech LLM with Multi-Token Prediction for Faster and High-Quality Generation
by: Wang, Yuhao, et al.
Published: (2025) -
VocalNet-M2: Advancing Low-Latency Spoken Language Modeling via Integrated Multi-Codebook Tokenization and Multi-Token Prediction
by: Wang, Yuhao, et al.
Published: (2025) -
SOVA-Bench: Benchmarking the Speech Conversation Ability for LLM-based Voice Assistant
by: Hou, Yixuan, et al.
Published: (2025) -
VocalBench: Benchmarking the Vocal Conversational Abilities for Speech Interaction Models
by: Liu, Heyang, et al.
Published: (2025) -
VocalBench-zh: Decomposing and Benchmarking the Speech Conversational Abilities in Mandarin Context
by: Liu, Heyang, et al.
Published: (2025)