:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Yang, Kang, Liang, Yifan, Liu, Fangkun, Xie, Zhenping, Zheng, Chengshi
Format:	Preprint
Published:	2025
Subjects:	Sound Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2509.25670
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

NaturalL2S: End-to-End High-quality Multispeaker Lip-to-Speech Synthesis with Differential Digital Signal Processing
by: Liang, Yifan, et al.
Published: (2025)

Towards Accurate Lip-to-Speech Synthesis in-the-Wild
by: Hegde, Sindhu, et al.
Published: (2024)

SwinLip: An Efficient Visual Speech Encoder for Lip Reading Using Swin Transformer
by: Park, Young-Hu, et al.
Published: (2025)

SLD-L2S: Hierarchical Subspace Latent Diffusion for High-Fidelity Lip to Speech Synthesis
by: Liang, Yifan, et al.
Published: (2026)

Efficient Audio-Visual Speech Separation with Discrete Lip Semantics and Multi-Scale Global-Local Attention
by: Li, Kai, et al.
Published: (2025)

Improving Lip-synchrony in Direct Audio-Visual Speech-to-Speech Translation
by: Goncalves, Lucas, et al.
Published: (2024)

AISHELL6-whisper: A Chinese Mandarin Audio-visual Whisper Speech Dataset with Speech Recognition Baselines
by: Li, Cancan, et al.
Published: (2025)

Contrastive Decoupled Representation Learning and Regularization for Speech-Preserving Facial Expression Manipulation
by: Chen, Tianshui, et al.
Published: (2025)

Learning Separable Hidden Unit Contributions for Speaker-Adaptive Lip-Reading
by: Luo, Songtao, et al.
Published: (2023)

Semantic Gesticulator: Semantics-Aware Co-Speech Gesture Synthesis
by: Zhang, Zeyi, et al.
Published: (2024)

LipSody: Lip-to-Speech Synthesis with Enhanced Prosody Consistency
by: Lee, Jaejun, et al.
Published: (2026)

VoxAging: Continuously Tracking Speaker Aging with a Large-Scale Longitudinal Dataset in English and Mandarin
by: Ai, Zhiqi, et al.
Published: (2025)

Face-StyleSpeech: Enhancing Zero-shot Speech Synthesis from Face Images with Improved Face-to-Speech Mapping
by: Kang, Minki, et al.
Published: (2023)

Lip Reading for Low-resource Languages by Learning and Combining General Speech Knowledge and Language-specific Knowledge
by: Kim, Minsu, et al.
Published: (2023)

Towards Unified Co-Speech Gesture Generation via Hierarchical Implicit Periodicity Learning
by: Guo, Xin, et al.
Published: (2025)

DiVISe: Direct Visual-Input Speech Synthesis Preserving Speaker Characteristics And Intelligibility
by: Liu, Yifan, et al.
Published: (2025)

An Audio-Visual Speech Separation Model Inspired by Cortico-Thalamo-Cortical Circuits
by: Li, Kai, et al.
Published: (2022)

Exploring Phonetic Context-Aware Lip-Sync For Talking Face Generation
by: Park, Se Jin, et al.
Published: (2023)

PAVAS: Physics-Aware Video-to-Audio Synthesis
by: Hyun-Bin, Oh, et al.
Published: (2025)

Cross-Lingual F5-TTS: Towards Language-Agnostic Voice Cloning and Speech Synthesis
by: Liu, Qingyu, et al.
Published: (2025)

XM-ALIGN: Unified Cross-Modal Embedding Alignment for Face-Voice Association
by: Fang, Zhihua, et al.
Published: (2025)

Separate in the Speech Chain: Cross-Modal Conditional Audio-Visual Target Speech Extraction
by: Mu, Zhaoxi, et al.
Published: (2024)

Hierarchical Codec Diffusion for Video-to-Speech Generation
by: Ye, Jiaxin, et al.
Published: (2026)

Mechanisms of Multimodal Synchronization: Insights from Decoder-Based Video-Text-to-Speech Synthesis
by: Gupta, Akshita, et al.
Published: (2024)

GOMPSNR: Reflourish the Signal-to-Noise Ratio Metric for Audio Generation Tasks
by: Dai, Lingling, et al.
Published: (2026)

IMSE: Efficient U-Net-based Speech Enhancement using Inception Depthwise Convolution and Amplitude-Aware Linear Attention
by: Tang, Xinxin, et al.
Published: (2025)

EchoMask: Speech-Queried Attention-based Mask Modeling for Holistic Co-Speech Motion Generation
by: Zhang, Xiangyue, et al.
Published: (2025)

Pay Attention to CTC: Fast and Robust Pseudo-Labelling for Unified Speech Recognition
by: Haliassos, Alexandros, et al.
Published: (2026)

UniCUE: Unified Recognition and Generation Framework for Chinese Cued Speech Video-to-Speech Generation
by: Wang, Jinting, et al.
Published: (2025)

Robust Audiovisual Speech Recognition Models with Mixture-of-Experts
by: Wu, Yihan, et al.
Published: (2024)

UMo: Unified Sparse Motion Modeling for Real-Time Co-Speech Avatars
by: Zhan, Xiaoyu, et al.
Published: (2026)

AlignVSR: Audio-Visual Cross-Modal Alignment for Visual Speech Recognition
by: Liu, Zehua, et al.
Published: (2024)

TARO: Timestep-Adaptive Representation Alignment with Onset-Aware Conditioning for Synchronized Video-to-Audio Synthesis
by: Ton, Tri, et al.
Published: (2025)

MAG: Multi-Modal Aligned Autoregressive Co-Speech Gesture Generation without Vector Quantization
by: Liu, Binjie, et al.
Published: (2025)

DuoGesture: Neuro-Inspired and Biomechanically Informed Dual-Stream Co-Speech Gesture Generation
by: Paar, Ferdinand, et al.
Published: (2026)

Distillation-based Layer Dropping (DLD): Effective End-to-end Framework for Dynamic Speech Networks
by: Hannan, Abdul, et al.
Published: (2026)

Neural Vocoders as Speech Enhancers
by: Li, Andong, et al.
Published: (2025)

Leveraging Large Language Models in Visual Speech Recognition: Model Scaling, Context-Aware Decoding, and Iterative Polishing
by: Liu, Zehua, et al.
Published: (2025)

Speech Audio Generation from dynamic MRI via a Knowledge Enhanced Conditional Variational Autoencoder
by: Li, Yaxuan, et al.
Published: (2025)

Toward Fine-Grained Speech Inpainting Forensics:A Dataset, Method, and Metric for Multi-Region Tampering Localization
by: Vu, Tung, et al.
Published: (2026)