Saved in:
| Main Authors: | Yang, Kang, Liang, Yifan, Liu, Fangkun, Xie, Zhenping, Zheng, Chengshi |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2509.25670 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
NaturalL2S: End-to-End High-quality Multispeaker Lip-to-Speech Synthesis with Differential Digital Signal Processing
by: Liang, Yifan, et al.
Published: (2025)
by: Liang, Yifan, et al.
Published: (2025)
Towards Accurate Lip-to-Speech Synthesis in-the-Wild
by: Hegde, Sindhu, et al.
Published: (2024)
by: Hegde, Sindhu, et al.
Published: (2024)
SwinLip: An Efficient Visual Speech Encoder for Lip Reading Using Swin Transformer
by: Park, Young-Hu, et al.
Published: (2025)
by: Park, Young-Hu, et al.
Published: (2025)
SLD-L2S: Hierarchical Subspace Latent Diffusion for High-Fidelity Lip to Speech Synthesis
by: Liang, Yifan, et al.
Published: (2026)
by: Liang, Yifan, et al.
Published: (2026)
Efficient Audio-Visual Speech Separation with Discrete Lip Semantics and Multi-Scale Global-Local Attention
by: Li, Kai, et al.
Published: (2025)
by: Li, Kai, et al.
Published: (2025)
Improving Lip-synchrony in Direct Audio-Visual Speech-to-Speech Translation
by: Goncalves, Lucas, et al.
Published: (2024)
by: Goncalves, Lucas, et al.
Published: (2024)
AISHELL6-whisper: A Chinese Mandarin Audio-visual Whisper Speech Dataset with Speech Recognition Baselines
by: Li, Cancan, et al.
Published: (2025)
by: Li, Cancan, et al.
Published: (2025)
Contrastive Decoupled Representation Learning and Regularization for Speech-Preserving Facial Expression Manipulation
by: Chen, Tianshui, et al.
Published: (2025)
by: Chen, Tianshui, et al.
Published: (2025)
Learning Separable Hidden Unit Contributions for Speaker-Adaptive Lip-Reading
by: Luo, Songtao, et al.
Published: (2023)
by: Luo, Songtao, et al.
Published: (2023)
Semantic Gesticulator: Semantics-Aware Co-Speech Gesture Synthesis
by: Zhang, Zeyi, et al.
Published: (2024)
by: Zhang, Zeyi, et al.
Published: (2024)
LipSody: Lip-to-Speech Synthesis with Enhanced Prosody Consistency
by: Lee, Jaejun, et al.
Published: (2026)
by: Lee, Jaejun, et al.
Published: (2026)
VoxAging: Continuously Tracking Speaker Aging with a Large-Scale Longitudinal Dataset in English and Mandarin
by: Ai, Zhiqi, et al.
Published: (2025)
by: Ai, Zhiqi, et al.
Published: (2025)
Face-StyleSpeech: Enhancing Zero-shot Speech Synthesis from Face Images with Improved Face-to-Speech Mapping
by: Kang, Minki, et al.
Published: (2023)
by: Kang, Minki, et al.
Published: (2023)
Lip Reading for Low-resource Languages by Learning and Combining General Speech Knowledge and Language-specific Knowledge
by: Kim, Minsu, et al.
Published: (2023)
by: Kim, Minsu, et al.
Published: (2023)
Towards Unified Co-Speech Gesture Generation via Hierarchical Implicit Periodicity Learning
by: Guo, Xin, et al.
Published: (2025)
by: Guo, Xin, et al.
Published: (2025)
DiVISe: Direct Visual-Input Speech Synthesis Preserving Speaker Characteristics And Intelligibility
by: Liu, Yifan, et al.
Published: (2025)
by: Liu, Yifan, et al.
Published: (2025)
An Audio-Visual Speech Separation Model Inspired by Cortico-Thalamo-Cortical Circuits
by: Li, Kai, et al.
Published: (2022)
by: Li, Kai, et al.
Published: (2022)
Exploring Phonetic Context-Aware Lip-Sync For Talking Face Generation
by: Park, Se Jin, et al.
Published: (2023)
by: Park, Se Jin, et al.
Published: (2023)
PAVAS: Physics-Aware Video-to-Audio Synthesis
by: Hyun-Bin, Oh, et al.
Published: (2025)
by: Hyun-Bin, Oh, et al.
Published: (2025)
Cross-Lingual F5-TTS: Towards Language-Agnostic Voice Cloning and Speech Synthesis
by: Liu, Qingyu, et al.
Published: (2025)
by: Liu, Qingyu, et al.
Published: (2025)
XM-ALIGN: Unified Cross-Modal Embedding Alignment for Face-Voice Association
by: Fang, Zhihua, et al.
Published: (2025)
by: Fang, Zhihua, et al.
Published: (2025)
Separate in the Speech Chain: Cross-Modal Conditional Audio-Visual Target Speech Extraction
by: Mu, Zhaoxi, et al.
Published: (2024)
by: Mu, Zhaoxi, et al.
Published: (2024)
Hierarchical Codec Diffusion for Video-to-Speech Generation
by: Ye, Jiaxin, et al.
Published: (2026)
by: Ye, Jiaxin, et al.
Published: (2026)
Mechanisms of Multimodal Synchronization: Insights from Decoder-Based Video-Text-to-Speech Synthesis
by: Gupta, Akshita, et al.
Published: (2024)
by: Gupta, Akshita, et al.
Published: (2024)
GOMPSNR: Reflourish the Signal-to-Noise Ratio Metric for Audio Generation Tasks
by: Dai, Lingling, et al.
Published: (2026)
by: Dai, Lingling, et al.
Published: (2026)
IMSE: Efficient U-Net-based Speech Enhancement using Inception Depthwise Convolution and Amplitude-Aware Linear Attention
by: Tang, Xinxin, et al.
Published: (2025)
by: Tang, Xinxin, et al.
Published: (2025)
EchoMask: Speech-Queried Attention-based Mask Modeling for Holistic Co-Speech Motion Generation
by: Zhang, Xiangyue, et al.
Published: (2025)
by: Zhang, Xiangyue, et al.
Published: (2025)
Pay Attention to CTC: Fast and Robust Pseudo-Labelling for Unified Speech Recognition
by: Haliassos, Alexandros, et al.
Published: (2026)
by: Haliassos, Alexandros, et al.
Published: (2026)
UniCUE: Unified Recognition and Generation Framework for Chinese Cued Speech Video-to-Speech Generation
by: Wang, Jinting, et al.
Published: (2025)
by: Wang, Jinting, et al.
Published: (2025)
Robust Audiovisual Speech Recognition Models with Mixture-of-Experts
by: Wu, Yihan, et al.
Published: (2024)
by: Wu, Yihan, et al.
Published: (2024)
UMo: Unified Sparse Motion Modeling for Real-Time Co-Speech Avatars
by: Zhan, Xiaoyu, et al.
Published: (2026)
by: Zhan, Xiaoyu, et al.
Published: (2026)
AlignVSR: Audio-Visual Cross-Modal Alignment for Visual Speech Recognition
by: Liu, Zehua, et al.
Published: (2024)
by: Liu, Zehua, et al.
Published: (2024)
TARO: Timestep-Adaptive Representation Alignment with Onset-Aware Conditioning for Synchronized Video-to-Audio Synthesis
by: Ton, Tri, et al.
Published: (2025)
by: Ton, Tri, et al.
Published: (2025)
MAG: Multi-Modal Aligned Autoregressive Co-Speech Gesture Generation without Vector Quantization
by: Liu, Binjie, et al.
Published: (2025)
by: Liu, Binjie, et al.
Published: (2025)
DuoGesture: Neuro-Inspired and Biomechanically Informed Dual-Stream Co-Speech Gesture Generation
by: Paar, Ferdinand, et al.
Published: (2026)
by: Paar, Ferdinand, et al.
Published: (2026)
Distillation-based Layer Dropping (DLD): Effective End-to-end Framework for Dynamic Speech Networks
by: Hannan, Abdul, et al.
Published: (2026)
by: Hannan, Abdul, et al.
Published: (2026)
Neural Vocoders as Speech Enhancers
by: Li, Andong, et al.
Published: (2025)
by: Li, Andong, et al.
Published: (2025)
Leveraging Large Language Models in Visual Speech Recognition: Model Scaling, Context-Aware Decoding, and Iterative Polishing
by: Liu, Zehua, et al.
Published: (2025)
by: Liu, Zehua, et al.
Published: (2025)
Speech Audio Generation from dynamic MRI via a Knowledge Enhanced Conditional Variational Autoencoder
by: Li, Yaxuan, et al.
Published: (2025)
by: Li, Yaxuan, et al.
Published: (2025)
Toward Fine-Grained Speech Inpainting Forensics:A Dataset, Method, and Metric for Multi-Region Tampering Localization
by: Vu, Tung, et al.
Published: (2026)
by: Vu, Tung, et al.
Published: (2026)
Similar Items
-
NaturalL2S: End-to-End High-quality Multispeaker Lip-to-Speech Synthesis with Differential Digital Signal Processing
by: Liang, Yifan, et al.
Published: (2025) -
Towards Accurate Lip-to-Speech Synthesis in-the-Wild
by: Hegde, Sindhu, et al.
Published: (2024) -
SwinLip: An Efficient Visual Speech Encoder for Lip Reading Using Swin Transformer
by: Park, Young-Hu, et al.
Published: (2025) -
SLD-L2S: Hierarchical Subspace Latent Diffusion for High-Fidelity Lip to Speech Synthesis
by: Liang, Yifan, et al.
Published: (2026) -
Efficient Audio-Visual Speech Separation with Discrete Lip Semantics and Multi-Scale Global-Local Attention
by: Li, Kai, et al.
Published: (2025)