:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Li, Zongyi, Hu, Shujie, Liu, Shujie, Zhou, Long, Choi, Jeongsoo, Meng, Lingwei, Guo, Xun, Li, Jinyu, Ling, Hefei, Wei, Furu
Format:	Preprint
Published:	2024
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2410.20502
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Autoregressive Speech Synthesis without Vector Quantization
by: Meng, Lingwei, et al.
Published: (2024)

Boosting Large Language Model for Speech Synthesis: An Empirical Study
by: Hao, Hongkun, et al.
Published: (2023)

V2SFlow: Video-to-Speech Generation with Speech Decomposition and Rectified Flow
by: Choi, Jeongsoo, et al.
Published: (2024)

WavLLM: Towards Robust and Adaptive Speech Large Language Model
by: Hu, Shujie, et al.
Published: (2024)

VALL-E R: Robust and Efficient Zero-Shot Text-to-Speech Synthesis via Monotonic Alignment
by: Han, Bing, et al.
Published: (2024)

StreamMel: Real-Time Zero-shot Text-to-Speech via Interleaved Continuous Autoregressive Modeling
by: Wang, Hui, et al.
Published: (2025)

FELLE: Autoregressive Speech Synthesis with Token-Wise Coarse-to-Fine Flow Matching
by: Wang, Hui, et al.
Published: (2025)

Advanced Long-Content Speech Recognition With Factorized Neural Transducer
by: Gong, Xun, et al.
Published: (2024)

VALL-E 2: Neural Codec Language Models are Human Parity Zero-Shot Text to Speech Synthesizers
by: Chen, Sanyuan, et al.
Published: (2024)

Pseudo-Autoregressive Neural Codec Language Models for Efficient Zero-Shot Text-to-Speech Synthesis
by: Yang, Yifan, et al.
Published: (2025)

WavMark: Watermarking for Audio Generation
by: Chen, Guangyu, et al.
Published: (2023)

Zero-Shot Streaming Text to Speech Synthesis with Transducer and Auto-Regressive Modeling
by: Sun, Haiyang, et al.
Published: (2025)

EmotionThinker: Prosody-Aware Reinforcement Learning for Explainable Speech Emotion Reasoning
by: Wang, Dingdong, et al.
Published: (2026)

A proof for the conjecture on superlinear problems with Ambrosetti-Rabinowitz condition
by: Li, Chong, et al.
Published: (2026)

Large Language Model Can Transcribe Speech in Multi-Talker Scenarios with Versatile Instructions
by: Meng, Lingwei, et al.
Published: (2024)

Risk Factors of Thrombocytopenia After Cardiac Surgery with Cardiopulmonary Bypass
by: Shujie Yan
Published: (2023)

AlignFormer: Modality Matching Can Achieve Better Zero-shot Instruction-Following Speech-LLM
by: Fan, Ruchao, et al.
Published: (2024)

Next Block Prediction: Video Generation via Semi-Autoregressive Modeling
by: Ren, Shuhuai, et al.
Published: (2025)

DBDH: A Dual-Branch Dual-Head Neural Network for Invisible Embedded Regions Localization
by: Zhao, Chengxin, et al.
Published: (2024)

Representation Alignment Contrastive Regularization for Multi-Object Tracking
by: Liu, Zhonglin, et al.
Published: (2024)

Long-range hopping in a quasiperiodic potential weakens the non-Hermitian skin effect
by: Peng, Dechi, et al.
Published: (2024)

Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion
by: Huang, Xun, et al.
Published: (2025)

Interleaved Speech-Text Language Models for Simple Streaming Text-to-Speech Synthesis
by: Yang, Yifan, et al.
Published: (2024)

Towards One-bit ASR: Extremely Low-bit Conformer Quantization Using Co-training and Stochastic Precision
by: Li, Zhaoqing, et al.
Published: (2025)

Privacy-Preserving Community Detection for Locally Distributed Multiple Networks
by: Guo, Xiao, et al.
Published: (2023)

Where Do Flow Semantics Reside? A Protocol-Native Tabular Pretraining Paradigm for Encrypted Traffic Classification
by: Huang, Sizhe, et al.
Published: (2026)

Generalization and Risk Bounds for Recurrent Neural Networks
by: Cheng, Xuewei, et al.
Published: (2024)

A joint modeling approach to treatment effects estimation with unmeasured confounders
by: Lee, Namhwa, et al.
Published: (2024)

Transfer Learning for High Dimensional Robust Regression
by: Yuan, Xiaohui, et al.
Published: (2024)

Learning to Identify Conflicts in RPKI
by: Schulmann, Haya, et al.
Published: (2025)

TASER: Task-Aware Spectral Energy Refine for Backdoor Suppression in UAV Swarms Decentralized Federated Learning
by: Huang, Sizhe, et al.
Published: (2026)

AlignDiT: Multimodal Aligned Diffusion Transformer for Synchronized Speech Generation
by: Choi, Jeongsoo, et al.
Published: (2025)

LongCat-AudioDiT: High-Fidelity Diffusion Text-to-Speech in the Waveform Latent Space
by: Xin, Detai, et al.
Published: (2026)

Thermodynamic modes of a quasiperiodic mobility-edge system in a quantum Otto cycle
by: Zhou, Ao, et al.
Published: (2026)

DEER: Draft with Diffusion, Verify with Autoregressive Models
by: Cheng, Zicong, et al.
Published: (2025)

LaVieID: Local Autoregressive Diffusion Transformers for Identity-Preserving Video Creation
by: Song, Wenhui, et al.
Published: (2025)

Wigner distribution, Wigner entropy, and Anomalous Transport of a Generalized Aubry-André model
by: Lu, Feng, et al.
Published: (2025)

Semantic-VAE: Semantic-Alignment Latent Representation for Better Speech Synthesis
by: Niu, Zhikang, et al.
Published: (2025)

DySink: Dynamic Frame Sinks for Autoregressive Long Video Generation
by: Ye, Bo, et al.
Published: (2026)

A$^2$RD: Agentic Autoregressive Diffusion for Long Video Consistency
by: Long, Do Xuan, et al.
Published: (2026)