:: Library Catalog

Buchumschlag

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Yang, Yang, Li, Yunpeng, Sung, George, Shih, Shao-Fu, Dooley, Craig, Centazzo, Alessio, Rajeswaran, Ramanan
Format:	Preprint
Veröffentlicht:	2025
Schlagworte:	Audio and Speech Processing Machine Learning
Online-Zugang:	https://arxiv.org/abs/2506.22362
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Ähnliche Einträge

StreamFlow: Streaming Flow Matching with Block-wise Guided Attention Mask for Speech Token Decoding
von: Guo, Dake, et al.
Veröffentlicht: (2025)

StreamVC: Real-Time Low-Latency Voice Conversion
von: Yang, Yang, et al.
Veröffentlicht: (2024)

Textless Streaming Speech-to-Speech Translation using Semantic Speech Tokens
von: Zhao, Jinzheng, et al.
Veröffentlicht: (2024)

Binaural Angular Separation Network
von: Yang, Yang, et al.
Veröffentlicht: (2024)

Streaming Decoder-Only Automatic Speech Recognition with Discrete Speech Units: A Pilot Study
von: Chen, Peikun, et al.
Veröffentlicht: (2024)

DiffDSR: Dysarthric Speech Reconstruction Using Latent Diffusion Model
von: Chen, Xueyuan, et al.
Veröffentlicht: (2025)

Adapting Whisper for Streaming Speech Recognition via Two-Pass Decoding
von: Zhou, Haoran, et al.
Veröffentlicht: (2025)

Speech Token Prediction via Compressed-to-fine Language Modeling for Speech Generation
von: Liu, Wenrui, et al.
Veröffentlicht: (2025)

DualSpeechLM: Towards Unified Speech Understanding and Generation via Dual Speech Token Modeling with Large Language Models
von: Wang, Yuanyuan, et al.
Veröffentlicht: (2025)

Addressing Index Collapse of Large-Codebook Speech Tokenizer with Dual-Decoding Product-Quantized Variational Auto-Encoder
von: Guo, Haohan, et al.
Veröffentlicht: (2024)

Discrete Diffusion for Generative Modeling of Text-Aligned Speech Tokens
von: Ku, Pin-Jui, et al.
Veröffentlicht: (2025)

Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens
von: Wang, Xinsheng, et al.
Veröffentlicht: (2025)

Diff-ETS: Learning a Diffusion Probabilistic Model for Electromyography-to-Speech Conversion
von: Ren, Zhao, et al.
Veröffentlicht: (2024)

Decoder-only Architecture for Streaming End-to-end Speech Recognition
von: Tsunoo, Emiru, et al.
Veröffentlicht: (2024)

Streaming Speech Recognition with Decoder-Only Large Language Models and Latency Optimization
von: Wan, Genshun, et al.
Veröffentlicht: (2026)

An Ultra-Low Latency, End-to-End Streaming Speech Synthesis Architecture via Block-Wise Generation and Depth-Wise Codec Decoding
von: Su, Tianhui, et al.
Veröffentlicht: (2026)

Diff-VS: Efficient Audio-Aware Diffusion U-Net for Vocals Separation
von: Yun-Ning, et al.
Veröffentlicht: (2026)

Interleaved Speech-Text Language Models for Simple Streaming Text-to-Speech Synthesis
von: Yang, Yifan, et al.
Veröffentlicht: (2024)

DiffRhythm 2: Efficient and High Fidelity Song Generation via Block Flow Matching
von: Jiang, Yuepeng, et al.
Veröffentlicht: (2025)

SoCodec: A Semantic-Ordered Multi-Stream Speech Codec for Efficient Language Model Based Text-to-Speech Synthesis
von: Guo, Haohan, et al.
Veröffentlicht: (2024)

Speech Synthesis From Continuous Features Using Per-Token Latent Diffusion
von: Turetzky, Arnon, et al.
Veröffentlicht: (2024)

Speech Loudness in Broadcasting and Streaming
von: Torcoli, Matteo, et al.
Veröffentlicht: (2024)

SimpleSpeech: Towards Simple and Efficient Text-to-Speech with Scalar Latent Transformer Diffusion Models
von: Yang, Dongchao, et al.
Veröffentlicht: (2024)

DiffSound: Differentiable Modal Sound Rendering and Inverse Rendering for Diverse Inference Tasks
von: Jin, Xutong, et al.
Veröffentlicht: (2024)

CodeSep: Low-Bitrate Codec-Driven Speech Separation with Base-Token Disentanglement and Auxiliary-Token Serial Prediction
von: Du, Hui-Peng, et al.
Veröffentlicht: (2026)

Chunkwise Aligners for Streaming Speech Recognition
von: Teo, Wen Shen, et al.
Veröffentlicht: (2026)

Joint Optimization of Streaming and Non-Streaming Automatic Speech Recognition with Multi-Decoder and Knowledge Distillation
von: Shakeel, Muhammad, et al.
Veröffentlicht: (2024)

VALL-T: Decoder-Only Generative Transducer for Robust and Decoding-Controllable Text-to-Speech
von: Du, Chenpeng, et al.
Veröffentlicht: (2024)

VoCodec: An Efficient Lightweight Low-Bitrate Speech Codec
von: Yang, Leyan, et al.
Veröffentlicht: (2026)

DiffCSS: Diverse and Expressive Conversational Speech Synthesis with Diffusion Models
von: wu, Weihao, et al.
Veröffentlicht: (2025)

StreamUni: Achieving Streaming Speech Translation with a Unified Large Speech-Language Model
von: Guo, Shoutao, et al.
Veröffentlicht: (2025)

BEST-STD2.0: Balanced and Efficient Speech Tokenizer for Spoken Term Detection
von: Singh, Anup, et al.
Veröffentlicht: (2025)

Chunked Attention-based Encoder-Decoder Model for Streaming Speech Recognition
von: Zeineldeen, Mohammad, et al.
Veröffentlicht: (2023)

Efficient Speech Watermarking for Speech Synthesis via Progressive Knowledge Distillation
von: Cui, Yang, et al.
Veröffentlicht: (2025)

Multilingual Speech Recognition Using Discrete Tokens with a Two-step Training Strategy
von: Li, Zehan, et al.
Veröffentlicht: (2025)

Low-latency Speech Enhancement via Speech Token Generation
von: Xue, Huaying, et al.
Veröffentlicht: (2023)

Hierarchical Sparse Sound Field Reconstruction with Spherical and Linear Microphone Arrays
von: Xu, Shunxi, et al.
Veröffentlicht: (2025)

Efficient Speech Language Modeling via Energy Distance in Continuous Latent Space
von: Ma, Zhengrui, et al.
Veröffentlicht: (2025)

DiffAR: Denoising Diffusion Autoregressive Model for Raw Speech Waveform Generation
von: Benita, Roi, et al.
Veröffentlicht: (2023)

StreamAAD: Decoding Spatial Auditory Attention with a Streaming Architecture
von: Qiu, Zelin, et al.
Veröffentlicht: (2024)