Saved in:
| Main Authors: | Fang, Zhihua, Tao, Shumei, Wang, Junxu, He, Liang |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2512.06757 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Shared Multi-modal Embedding Space for Face-Voice Association
by: Simic, Christopher, et al.
Published: (2025)
by: Simic, Christopher, et al.
Published: (2025)
Face2VoiceSync: Lightweight Face-Voice Consistency for Text-Driven Talking Face Generation
by: Kang, Fang, et al.
Published: (2025)
by: Kang, Fang, et al.
Published: (2025)
ControlFoley: Unified and Controllable Video-to-Audio Generation with Cross-Modal Conflict Handling
by: Yang, Jianxuan, et al.
Published: (2026)
by: Yang, Jianxuan, et al.
Published: (2026)
WAVE: Learning Unified & Versatile Audio-Visual Embeddings with Multimodal LLM
by: Tang, Changli, et al.
Published: (2025)
by: Tang, Changli, et al.
Published: (2025)
Art2Mus: Artwork-to-Music Generation via Visual Conditioning and Large-Scale Cross-Modal Alignment
by: Rinaldi, Ivan, et al.
Published: (2026)
by: Rinaldi, Ivan, et al.
Published: (2026)
AlignVSR: Audio-Visual Cross-Modal Alignment for Visual Speech Recognition
by: Liu, Zehua, et al.
Published: (2024)
by: Liu, Zehua, et al.
Published: (2024)
OmniSep: Unified Omni-Modality Sound Separation with Query-Mixup
by: Cheng, Xize, et al.
Published: (2024)
by: Cheng, Xize, et al.
Published: (2024)
A Systematic Study of Cross-Modal Typographic Attacks on Audio-Visual Reasoning
by: Chen, Tianle, et al.
Published: (2026)
by: Chen, Tianle, et al.
Published: (2026)
Rethinking Audio-Visual Adversarial Vulnerability from Temporal and Modality Perspectives
by: Zhang, Zeliang, et al.
Published: (2025)
by: Zhang, Zeliang, et al.
Published: (2025)
DiFlowDubber: Discrete Flow Matching for Automated Video Dubbing via Cross-Modal Alignment and Synchronization
by: Nguyen, Ngoc-Son, et al.
Published: (2026)
by: Nguyen, Ngoc-Son, et al.
Published: (2026)
Enhancing Emotion Recognition in Incomplete Data: A Novel Cross-Modal Alignment, Reconstruction, and Refinement Framework
by: Sun, Haoqin, et al.
Published: (2024)
by: Sun, Haoqin, et al.
Published: (2024)
Enhancing Modal Fusion by Alignment and Label Matching for Multimodal Emotion Recognition
by: Li, Qifei, et al.
Published: (2024)
by: Li, Qifei, et al.
Published: (2024)
Sound2Vision: Generating Diverse Visuals from Audio through Cross-Modal Latent Alignment
by: Sung-Bin, Kim, et al.
Published: (2024)
by: Sung-Bin, Kim, et al.
Published: (2024)
Omni-Customizer: End-to-End MultiModal Customization for Joint Audio-Video Generation
by: Chen, Yuheng, et al.
Published: (2026)
by: Chen, Yuheng, et al.
Published: (2026)
EgoAdapt: Enhancing Robustness in Egocentric Interactive Speaker Detection Under Missing Modalities
by: Qian, Xinyuan, et al.
Published: (2026)
by: Qian, Xinyuan, et al.
Published: (2026)
UniTalking: A Unified Audio-Video Framework for Talking Portrait Generation
by: Li, Hebeizi, et al.
Published: (2026)
by: Li, Hebeizi, et al.
Published: (2026)
From Faces to Voices: Learning Hierarchical Representations for High-quality Video-to-Speech
by: Kim, Ji-Hoon, et al.
Published: (2025)
by: Kim, Ji-Hoon, et al.
Published: (2025)
Taming Text-to-Sounding Video Generation via Advanced Modality Condition and Interaction
by: Guan, Kaisi, et al.
Published: (2025)
by: Guan, Kaisi, et al.
Published: (2025)
LTA-L2S: Lexical Tone-Aware Lip-to-Speech Synthesis for Mandarin with Cross-Lingual Transfer Learning
by: Yang, Kang, et al.
Published: (2025)
by: Yang, Kang, et al.
Published: (2025)
Robust Audio-Visual Segmentation via Audio-Guided Visual Convergent Alignment
by: Liu, Chen, et al.
Published: (2025)
by: Liu, Chen, et al.
Published: (2025)
TeMuDance: Contrastive Alignment-Based Textual Control for Music-Driven Dance Generation
by: Liu, Xinran, et al.
Published: (2026)
by: Liu, Xinran, et al.
Published: (2026)
Face-voice Association in Multilingual Environments (FAME) Challenge 2024 Evaluation Plan
by: Saeed, Muhammad Saad, et al.
Published: (2024)
by: Saeed, Muhammad Saad, et al.
Published: (2024)
Ovi: Twin Backbone Cross-Modal Fusion for Audio-Video Generation
by: Low, Chetwin, et al.
Published: (2025)
by: Low, Chetwin, et al.
Published: (2025)
UMo: Unified Sparse Motion Modeling for Real-Time Co-Speech Avatars
by: Zhan, Xiaoyu, et al.
Published: (2026)
by: Zhan, Xiaoyu, et al.
Published: (2026)
MA-AVT: Modality Alignment for Parameter-Efficient Audio-Visual Transformers
by: Mahmud, Tanvir, et al.
Published: (2024)
by: Mahmud, Tanvir, et al.
Published: (2024)
Recursive Joint Cross-Modal Attention for Multimodal Fusion in Dimensional Emotion Recognition
by: Praveen, R. Gnana, et al.
Published: (2024)
by: Praveen, R. Gnana, et al.
Published: (2024)
UniLS: End-to-End Audio-Driven Avatars for Unified Listening and Speaking
by: Chu, Xuangeng, et al.
Published: (2025)
by: Chu, Xuangeng, et al.
Published: (2025)
Pay Attention to CTC: Fast and Robust Pseudo-Labelling for Unified Speech Recognition
by: Haliassos, Alexandros, et al.
Published: (2026)
by: Haliassos, Alexandros, et al.
Published: (2026)
Seeing Your Speech Style: A Novel Zero-Shot Identity-Disentanglement Face-based Voice Conversion
by: Rong, Yan, et al.
Published: (2024)
by: Rong, Yan, et al.
Published: (2024)
Cross-Modal Binary Attention: An Energy-Efficient Fusion Framework for Audio-Visual Learning
by: Saleh, Mohamed, et al.
Published: (2026)
by: Saleh, Mohamed, et al.
Published: (2026)
Emotional Face-to-Speech
by: Ye, Jiaxin, et al.
Published: (2025)
by: Ye, Jiaxin, et al.
Published: (2025)
Voice Pathology Detection Using Phonation
by: Siva, Sri Raksha, et al.
Published: (2025)
by: Siva, Sri Raksha, et al.
Published: (2025)
Hear Your Face: Face-based voice conversion with F0 estimation
by: Lee, Jaejun, et al.
Published: (2024)
by: Lee, Jaejun, et al.
Published: (2024)
JavisDiT++: Unified Modeling and Optimization for Joint Audio-Video Generation
by: Liu, Kai, et al.
Published: (2026)
by: Liu, Kai, et al.
Published: (2026)
AVRT: Audio-Visual Reasoning Transfer through Single-Modality Teachers
by: Araujo, Edson, et al.
Published: (2026)
by: Araujo, Edson, et al.
Published: (2026)
UniSync: A Unified Framework for Audio-Visual Synchronization
by: Feng, Tao, et al.
Published: (2025)
by: Feng, Tao, et al.
Published: (2025)
Omni2Sound: Towards Unified Video-Text-to-Audio Generation
by: Dai, Yusheng, et al.
Published: (2026)
by: Dai, Yusheng, et al.
Published: (2026)
MAG: Multi-Modal Aligned Autoregressive Co-Speech Gesture Generation without Vector Quantization
by: Liu, Binjie, et al.
Published: (2025)
by: Liu, Binjie, et al.
Published: (2025)
Art2Mus: Bridging Visual Arts and Music through Cross-Modal Generation
by: Rinaldi, Ivan, et al.
Published: (2024)
by: Rinaldi, Ivan, et al.
Published: (2024)
LongAV-Compass: Towards Unified Evaluation of Minute-Scale Audio-Visual Generation Across T2AV, I2AV, and V2AV
by: Liu, Tengfei, et al.
Published: (2026)
by: Liu, Tengfei, et al.
Published: (2026)
Similar Items
-
Shared Multi-modal Embedding Space for Face-Voice Association
by: Simic, Christopher, et al.
Published: (2025) -
Face2VoiceSync: Lightweight Face-Voice Consistency for Text-Driven Talking Face Generation
by: Kang, Fang, et al.
Published: (2025) -
ControlFoley: Unified and Controllable Video-to-Audio Generation with Cross-Modal Conflict Handling
by: Yang, Jianxuan, et al.
Published: (2026) -
WAVE: Learning Unified & Versatile Audio-Visual Embeddings with Multimodal LLM
by: Tang, Changli, et al.
Published: (2025) -
Art2Mus: Artwork-to-Music Generation via Visual Conditioning and Large-Scale Cross-Modal Alignment
by: Rinaldi, Ivan, et al.
Published: (2026)