Saved in:
| Main Authors: | Sung-Bin, Kim, Choi, Jeongsoo, Peng, Puyuan, Chung, Joon Son, Oh, Tae-Hyun, Harwath, David |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2504.02386 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Dub-S2ST: Textless Speech-to-Speech Translation for Seamless Dubbing
by: Choi, Jeongsoo, et al.
Published: (2025)
by: Choi, Jeongsoo, et al.
Published: (2025)
VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild
by: Peng, Puyuan, et al.
Published: (2024)
by: Peng, Puyuan, et al.
Published: (2024)
VoiceCraft-X: Unifying Multilingual, Voice-Cloning Speech Synthesis and Speech Editing
by: Zheng, Zhisheng, et al.
Published: (2025)
by: Zheng, Zhisheng, et al.
Published: (2025)
DubWise: Video-Guided Speech Duration Control in Multimodal LLM-based Text-to-Speech for Dubbing
by: Sahipjohn, Neha, et al.
Published: (2024)
by: Sahipjohn, Neha, et al.
Published: (2024)
VoiceStar: Robust Zero-Shot Autoregressive TTS with Duration Control and Extrapolation
by: Peng, Puyuan, et al.
Published: (2025)
by: Peng, Puyuan, et al.
Published: (2025)
From Faces to Voices: Learning Hierarchical Representations for High-quality Video-to-Speech
by: Kim, Ji-Hoon, et al.
Published: (2025)
by: Kim, Ji-Hoon, et al.
Published: (2025)
PS-TTS: Phonetic Synchronization in Text-to-Speech for Achieving Natural Automated Dubbing
by: Hong, Changi, et al.
Published: (2026)
by: Hong, Changi, et al.
Published: (2026)
Accelerating Codec-based Speech Synthesis with Multi-Token Prediction and Speculative Decoding
by: Nguyen, Tan Dat, et al.
Published: (2024)
by: Nguyen, Tan Dat, et al.
Published: (2024)
SyllableLM: Learning Coarse Semantic Units for Speech Language Models
by: Baade, Alan, et al.
Published: (2024)
by: Baade, Alan, et al.
Published: (2024)
LRS-VoxMM: A benchmark for in-the-wild audio-visual speech recognition
by: Kwak, Doyeop, et al.
Published: (2026)
by: Kwak, Doyeop, et al.
Published: (2026)
Probing the Robustness Properties of Neural Speech Codecs
by: Tseng, Wei-Cheng, et al.
Published: (2025)
by: Tseng, Wei-Cheng, et al.
Published: (2025)
Length Aware Speech Translation for Video Dubbing
by: Chadha, Harveen Singh, et al.
Published: (2025)
by: Chadha, Harveen Singh, et al.
Published: (2025)
Codec2Vec: Self-Supervised Speech Representation Learning Using Neural Speech Codecs
by: Tseng, Wei-Cheng, et al.
Published: (2025)
by: Tseng, Wei-Cheng, et al.
Published: (2025)
V2SFlow: Video-to-Speech Generation with Speech Decomposition and Rectified Flow
by: Choi, Jeongsoo, et al.
Published: (2024)
by: Choi, Jeongsoo, et al.
Published: (2024)
Towards Expressive Video Dubbing with Multiscale Multimodal Context Interaction
by: Zhao, Yuan, et al.
Published: (2024)
by: Zhao, Yuan, et al.
Published: (2024)
Accelerating Diffusion-based Text-to-Speech Model Training with Dual Modality Alignment
by: Choi, Jeongsoo, et al.
Published: (2025)
by: Choi, Jeongsoo, et al.
Published: (2025)
MCDubber: Multimodal Context-Aware Expressive Video Dubbing
by: Zhao, Yuan, et al.
Published: (2024)
by: Zhao, Yuan, et al.
Published: (2024)
EmoDubber: Towards High Quality and Emotion Controllable Movie Dubbing
by: Cong, Gaoxiang, et al.
Published: (2024)
by: Cong, Gaoxiang, et al.
Published: (2024)
DualDub: Video-to-Soundtrack Generation via Joint Speech and Background Audio Synthesis
by: Tian, Wenjie, et al.
Published: (2025)
by: Tian, Wenjie, et al.
Published: (2025)
SyncVoice: Towards Video Dubbing with Vision-Augmented Pretrained TTS Model
by: Wang, Kaidi, et al.
Published: (2025)
by: Wang, Kaidi, et al.
Published: (2025)
BAT: Learning to Reason about Spatial Sounds with Large Language Models
by: Zheng, Zhisheng, et al.
Published: (2024)
by: Zheng, Zhisheng, et al.
Published: (2024)
Can Hierarchical Cross-Modal Fusion Predict Human Perception of AI Dubbed Content?
by: Dasare, Ashwini, et al.
Published: (2026)
by: Dasare, Ashwini, et al.
Published: (2026)
ANIM-400K: A Large-Scale Dataset for Automated End-To-End Dubbing of Video
by: Cai, Kevin, et al.
Published: (2024)
by: Cai, Kevin, et al.
Published: (2024)
Joint Multi-scale Cross-lingual Speaking Style Transfer with Bidirectional Attention Mechanism for Automatic Dubbing
by: Li, Jingbei, et al.
Published: (2023)
by: Li, Jingbei, et al.
Published: (2023)
AdaptVC: High Quality Voice Conversion with Adaptive Learning
by: Kim, Jaehun, et al.
Published: (2025)
by: Kim, Jaehun, et al.
Published: (2025)
FlowDubber: Movie Dubbing with LLM-based Semantic-aware Learning and Flow Matching based Voice Enhancing
by: Cong, Gaoxiang, et al.
Published: (2025)
by: Cong, Gaoxiang, et al.
Published: (2025)
UNMIXX: Untangling Highly Correlated Singing Voices Mixtures
by: Jung, Jihoo, et al.
Published: (2026)
by: Jung, Jihoo, et al.
Published: (2026)
Aligning Sight and Sound: Advanced Sound Source Localization Through Audio-Visual Alignment
by: Senocak, Arda, et al.
Published: (2024)
by: Senocak, Arda, et al.
Published: (2024)
CLaM-TTS: Improving Neural Codec Language Model for Zero-Shot Text-to-Speech
by: Kim, Jaehyeon, et al.
Published: (2024)
by: Kim, Jaehyeon, et al.
Published: (2024)
Deep Dubbing: End-to-End Auto-Audiobook System with Text-to-Timbre and Context-Aware Instruct-TTS
by: Dai, Ziqi, et al.
Published: (2025)
by: Dai, Ziqi, et al.
Published: (2025)
Prosody-Enhanced Acoustic Pre-training and Acoustic-Disentangled Prosody Adapting for Movie Dubbing
by: Zhang, Zhedong, et al.
Published: (2025)
by: Zhang, Zhedong, et al.
Published: (2025)
SCORE: Scaling audio generation using Standardized COmposite REwards
by: Jung, Jaemin, et al.
Published: (2025)
by: Jung, Jaemin, et al.
Published: (2025)
HILCodec: High-Fidelity and Lightweight Neural Audio Codec
by: Ahn, Sunghwan, et al.
Published: (2024)
by: Ahn, Sunghwan, et al.
Published: (2024)
Let There Be Sound: Reconstructing High Quality Speech from Silent Videos
by: Kim, Ji-Hoon, et al.
Published: (2023)
by: Kim, Ji-Hoon, et al.
Published: (2023)
VoiceDiT: Dual-Condition Diffusion Transformer for Environment-Aware Speech Synthesis
by: Jung, Jaemin, et al.
Published: (2024)
by: Jung, Jaemin, et al.
Published: (2024)
LP-CFM: Perceptual Invariance-Aware Conditional Flow Matching for Speech Modeling
by: Kwak, Doyeop, et al.
Published: (2025)
by: Kwak, Doyeop, et al.
Published: (2025)
Action2Sound: Ambient-Aware Generation of Action Sounds from Egocentric Videos
by: Chen, Changan, et al.
Published: (2024)
by: Chen, Changan, et al.
Published: (2024)
Sound2Vision: Generating Diverse Visuals from Audio through Cross-Modal Latent Alignment
by: Sung-Bin, Kim, et al.
Published: (2024)
by: Sung-Bin, Kim, et al.
Published: (2024)
VoxSim: A perceptual voice similarity dataset
by: Ahn, Junseok, et al.
Published: (2024)
by: Ahn, Junseok, et al.
Published: (2024)
EDNet: A Versatile Speech Enhancement Framework with Gating Mamba Mechanism and Phase Shift-Invariant Training
by: Kwak, Doyeop, et al.
Published: (2025)
by: Kwak, Doyeop, et al.
Published: (2025)
Similar Items
-
Dub-S2ST: Textless Speech-to-Speech Translation for Seamless Dubbing
by: Choi, Jeongsoo, et al.
Published: (2025) -
VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild
by: Peng, Puyuan, et al.
Published: (2024) -
VoiceCraft-X: Unifying Multilingual, Voice-Cloning Speech Synthesis and Speech Editing
by: Zheng, Zhisheng, et al.
Published: (2025) -
DubWise: Video-Guided Speech Duration Control in Multimodal LLM-based Text-to-Speech for Dubbing
by: Sahipjohn, Neha, et al.
Published: (2024) -
VoiceStar: Robust Zero-Shot Autoregressive TTS with Duration Control and Extrapolation
by: Peng, Puyuan, et al.
Published: (2025)