:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Sung-Bin, Kim, Choi, Jeongsoo, Peng, Puyuan, Chung, Joon Son, Oh, Tae-Hyun, Harwath, David
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition Audio and Speech Processing
Online Access:	https://arxiv.org/abs/2504.02386
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Dub-S2ST: Textless Speech-to-Speech Translation for Seamless Dubbing
by: Choi, Jeongsoo, et al.
Published: (2025)

VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild
by: Peng, Puyuan, et al.
Published: (2024)

VoiceCraft-X: Unifying Multilingual, Voice-Cloning Speech Synthesis and Speech Editing
by: Zheng, Zhisheng, et al.
Published: (2025)

DubWise: Video-Guided Speech Duration Control in Multimodal LLM-based Text-to-Speech for Dubbing
by: Sahipjohn, Neha, et al.
Published: (2024)

VoiceStar: Robust Zero-Shot Autoregressive TTS with Duration Control and Extrapolation
by: Peng, Puyuan, et al.
Published: (2025)

From Faces to Voices: Learning Hierarchical Representations for High-quality Video-to-Speech
by: Kim, Ji-Hoon, et al.
Published: (2025)

PS-TTS: Phonetic Synchronization in Text-to-Speech for Achieving Natural Automated Dubbing
by: Hong, Changi, et al.
Published: (2026)

Accelerating Codec-based Speech Synthesis with Multi-Token Prediction and Speculative Decoding
by: Nguyen, Tan Dat, et al.
Published: (2024)

SyllableLM: Learning Coarse Semantic Units for Speech Language Models
by: Baade, Alan, et al.
Published: (2024)

LRS-VoxMM: A benchmark for in-the-wild audio-visual speech recognition
by: Kwak, Doyeop, et al.
Published: (2026)

Probing the Robustness Properties of Neural Speech Codecs
by: Tseng, Wei-Cheng, et al.
Published: (2025)

Length Aware Speech Translation for Video Dubbing
by: Chadha, Harveen Singh, et al.
Published: (2025)

Codec2Vec: Self-Supervised Speech Representation Learning Using Neural Speech Codecs
by: Tseng, Wei-Cheng, et al.
Published: (2025)

V2SFlow: Video-to-Speech Generation with Speech Decomposition and Rectified Flow
by: Choi, Jeongsoo, et al.
Published: (2024)

Towards Expressive Video Dubbing with Multiscale Multimodal Context Interaction
by: Zhao, Yuan, et al.
Published: (2024)

Accelerating Diffusion-based Text-to-Speech Model Training with Dual Modality Alignment
by: Choi, Jeongsoo, et al.
Published: (2025)

MCDubber: Multimodal Context-Aware Expressive Video Dubbing
by: Zhao, Yuan, et al.
Published: (2024)

EmoDubber: Towards High Quality and Emotion Controllable Movie Dubbing
by: Cong, Gaoxiang, et al.
Published: (2024)

DualDub: Video-to-Soundtrack Generation via Joint Speech and Background Audio Synthesis
by: Tian, Wenjie, et al.
Published: (2025)

SyncVoice: Towards Video Dubbing with Vision-Augmented Pretrained TTS Model
by: Wang, Kaidi, et al.
Published: (2025)

BAT: Learning to Reason about Spatial Sounds with Large Language Models
by: Zheng, Zhisheng, et al.
Published: (2024)

Can Hierarchical Cross-Modal Fusion Predict Human Perception of AI Dubbed Content?
by: Dasare, Ashwini, et al.
Published: (2026)

ANIM-400K: A Large-Scale Dataset for Automated End-To-End Dubbing of Video
by: Cai, Kevin, et al.
Published: (2024)

Joint Multi-scale Cross-lingual Speaking Style Transfer with Bidirectional Attention Mechanism for Automatic Dubbing
by: Li, Jingbei, et al.
Published: (2023)

AdaptVC: High Quality Voice Conversion with Adaptive Learning
by: Kim, Jaehun, et al.
Published: (2025)

FlowDubber: Movie Dubbing with LLM-based Semantic-aware Learning and Flow Matching based Voice Enhancing
by: Cong, Gaoxiang, et al.
Published: (2025)

UNMIXX: Untangling Highly Correlated Singing Voices Mixtures
by: Jung, Jihoo, et al.
Published: (2026)

Aligning Sight and Sound: Advanced Sound Source Localization Through Audio-Visual Alignment
by: Senocak, Arda, et al.
Published: (2024)

CLaM-TTS: Improving Neural Codec Language Model for Zero-Shot Text-to-Speech
by: Kim, Jaehyeon, et al.
Published: (2024)

Deep Dubbing: End-to-End Auto-Audiobook System with Text-to-Timbre and Context-Aware Instruct-TTS
by: Dai, Ziqi, et al.
Published: (2025)

Prosody-Enhanced Acoustic Pre-training and Acoustic-Disentangled Prosody Adapting for Movie Dubbing
by: Zhang, Zhedong, et al.
Published: (2025)

SCORE: Scaling audio generation using Standardized COmposite REwards
by: Jung, Jaemin, et al.
Published: (2025)

HILCodec: High-Fidelity and Lightweight Neural Audio Codec
by: Ahn, Sunghwan, et al.
Published: (2024)

Let There Be Sound: Reconstructing High Quality Speech from Silent Videos
by: Kim, Ji-Hoon, et al.
Published: (2023)

VoiceDiT: Dual-Condition Diffusion Transformer for Environment-Aware Speech Synthesis
by: Jung, Jaemin, et al.
Published: (2024)

LP-CFM: Perceptual Invariance-Aware Conditional Flow Matching for Speech Modeling
by: Kwak, Doyeop, et al.
Published: (2025)

Action2Sound: Ambient-Aware Generation of Action Sounds from Egocentric Videos
by: Chen, Changan, et al.
Published: (2024)

Sound2Vision: Generating Diverse Visuals from Audio through Cross-Modal Latent Alignment
by: Sung-Bin, Kim, et al.
Published: (2024)

VoxSim: A perceptual voice similarity dataset
by: Ahn, Junseok, et al.
Published: (2024)

EDNet: A Versatile Speech Enhancement Framework with Gating Mamba Mechanism and Phase Shift-Invariant Training
by: Kwak, Doyeop, et al.
Published: (2025)