Saved in:
| Main Authors: | Dai, Yusheng, Wang, Chenxi, Li, Chang, Wang, Chen, Du, Jun, Li, Kewei, Wang, Ruoyu, Ma, Jiefeng, Sun, Lei, Gao, Jianqing |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2502.05130 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
A Study of Dropout-Induced Modality Bias on Robustness to Missing Video Frames for Audio-Visual Speech Recognition
by: Dai, Yusheng, et al.
Published: (2024)
by: Dai, Yusheng, et al.
Published: (2024)
Multi-Track MusicLDM: Towards Versatile Music Generation with Latent Diffusion Model
by: Karchkhadze, Tornike, et al.
Published: (2024)
by: Karchkhadze, Tornike, et al.
Published: (2024)
MoMu-Diffusion: On Learning Long-Term Motion-Music Synchronization and Correspondence
by: You, Fuming, et al.
Published: (2024)
by: You, Fuming, et al.
Published: (2024)
Contrastive Conditional Latent Diffusion for Audio-visual Segmentation
by: Mao, Yuxin, et al.
Published: (2023)
by: Mao, Yuxin, et al.
Published: (2023)
FreeAudio: Training-Free Timing Planning for Controllable Long-Form Text-to-Audio Generation
by: Jiang, Yuxuan, et al.
Published: (2025)
by: Jiang, Yuxuan, et al.
Published: (2025)
YuE: Scaling Open Foundation Models for Long-Form Music Generation
by: Yuan, Ruibin, et al.
Published: (2025)
by: Yuan, Ruibin, et al.
Published: (2025)
LoVA: Long-form Video-to-Audio Generation
by: Cheng, Xin, et al.
Published: (2024)
by: Cheng, Xin, et al.
Published: (2024)
Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion Latent Aligners
by: Xing, Yazhou, et al.
Published: (2024)
by: Xing, Yazhou, et al.
Published: (2024)
Towards Streaming Synchronized Spatial Audio Generation via Autoregressive Diffusion Transformer
by: Lei, Ke, et al.
Published: (2026)
by: Lei, Ke, et al.
Published: (2026)
EmoReg: Directional Latent Vector Modeling for Emotional Intensity Regularization in Diffusion-based Voice Conversion
by: Gudmalwar, Ashishkumar, et al.
Published: (2024)
by: Gudmalwar, Ashishkumar, et al.
Published: (2024)
LCB-net: Long-Context Biasing for Audio-Visual Speech Recognition
by: Yu, Fan, et al.
Published: (2024)
by: Yu, Fan, et al.
Published: (2024)
MeloTrans: A Text to Symbolic Music Generation Model Following Human Composition Habit
by: Wang, Yutian, et al.
Published: (2024)
by: Wang, Yutian, et al.
Published: (2024)
Rhythmic Foley: A Framework For Seamless Audio-Visual Alignment In Video-to-Audio Synthesis
by: Huang, Zhiqi, et al.
Published: (2024)
by: Huang, Zhiqi, et al.
Published: (2024)
DSCLAP: Domain-Specific Contrastive Language-Audio Pre-Training
by: Liu, Shengqiang, et al.
Published: (2024)
by: Liu, Shengqiang, et al.
Published: (2024)
M$^{3}$V: A multi-modal multi-view approach for Device-Directed Speech Detection
by: Wang, Anna, et al.
Published: (2024)
by: Wang, Anna, et al.
Published: (2024)
SyncFlow: Toward Temporally Aligned Joint Audio-Video Generation from Text
by: Liu, Haohe, et al.
Published: (2024)
by: Liu, Haohe, et al.
Published: (2024)
DualDub: Video-to-Soundtrack Generation via Joint Speech and Background Audio Synthesis
by: Tian, Wenjie, et al.
Published: (2025)
by: Tian, Wenjie, et al.
Published: (2025)
Multimodal Fish Feeding Intensity Assessment in Aquaculture
by: Cui, Meng, et al.
Published: (2023)
by: Cui, Meng, et al.
Published: (2023)
Zero-Shot Fake Video Detection by Audio-Visual Consistency
by: Li, Xiaolou, et al.
Published: (2024)
by: Li, Xiaolou, et al.
Published: (2024)
Human-Inspired Computing for Robust and Efficient Audio-Visual Speech Recognition
by: Liu, Qianhui, et al.
Published: (2024)
by: Liu, Qianhui, et al.
Published: (2024)
LatentSpeech: Latent Diffusion for Text-To-Speech Generation
by: Lou, Haowei, et al.
Published: (2024)
by: Lou, Haowei, et al.
Published: (2024)
Robust Wake Word Spotting With Frame-Level Cross-Modal Attention Based Audio-Visual Conformer
by: Wang, Haoxu, et al.
Published: (2024)
by: Wang, Haoxu, et al.
Published: (2024)
JEPOO: Highly Accurate Joint Estimation of Pitch, Onset and Offset for Music Information Retrieval
by: Wei, Haojie, et al.
Published: (2023)
by: Wei, Haojie, et al.
Published: (2023)
GACA-DiT: Diffusion-based Dance-to-Music Generation with Genre-Adaptive Rhythm and Context-Aware Alignment
by: Wang, Jinting, et al.
Published: (2025)
by: Wang, Jinting, et al.
Published: (2025)
Improving BERT for Symbolic Music Understanding Using Token Denoising and Pianoroll Prediction
by: Wang, Jun-You, et al.
Published: (2025)
by: Wang, Jun-You, et al.
Published: (2025)
StarVC: A Unified Auto-Regressive Framework for Joint Text and Speech Generation in Voice Conversion
by: Li, Fengjin, et al.
Published: (2025)
by: Li, Fengjin, et al.
Published: (2025)
FastTalker: Jointly Generating Speech and Conversational Gestures from Text
by: Guo, Zixin, et al.
Published: (2024)
by: Guo, Zixin, et al.
Published: (2024)
Music-Aligned Holistic 3D Dance Generation via Hierarchical Motion Modeling
by: Li, Xiaojie, et al.
Published: (2025)
by: Li, Xiaojie, et al.
Published: (2025)
Quality-Aware End-to-End Audio-Visual Neural Speaker Diarization
by: He, Mao-Kui, et al.
Published: (2024)
by: He, Mao-Kui, et al.
Published: (2024)
A Unified Framework for Modality-Agnostic Deepfakes Detection
by: Yu, Cai, et al.
Published: (2023)
by: Yu, Cai, et al.
Published: (2023)
V2A-DPO: Omni-Preference Optimization for Video-to-Audio Generation
by: Chan, Nolan, et al.
Published: (2026)
by: Chan, Nolan, et al.
Published: (2026)
Efficient Adapter Tuning for Joint Singing Voice Beat and Downbeat Tracking with Self-supervised Learning Features
by: Deng, Jiajun, et al.
Published: (2025)
by: Deng, Jiajun, et al.
Published: (2025)
MusFlow: Multimodal Music Generation via Conditional Flow Matching
by: Song, Jiahao, et al.
Published: (2025)
by: Song, Jiahao, et al.
Published: (2025)
Disentangling Score Content and Performance Style for Joint Piano Rendering and Transcription
by: Zeng, Wei, et al.
Published: (2025)
by: Zeng, Wei, et al.
Published: (2025)
Mel-Refine: A Plug-and-Play Approach to Refine Mel-Spectrogram in Audio Generation
by: Guo, Hongming, et al.
Published: (2024)
by: Guo, Hongming, et al.
Published: (2024)
X-CrossNet: A complex spectral mapping approach to target speaker extraction with cross attention speaker embedding fusion
by: Sun, Chang, et al.
Published: (2024)
by: Sun, Chang, et al.
Published: (2024)
Personality-Enhanced Multimodal Depression Detection in the Elderly
by: Wang, Honghong, et al.
Published: (2025)
by: Wang, Honghong, et al.
Published: (2025)
Combining Genre Classification and Harmonic-Percussive Features with Diffusion Models for Music-Video Generation
by: Pina, Leonardo, et al.
Published: (2024)
by: Pina, Leonardo, et al.
Published: (2024)
Voice Evaluation of Reasoning Ability: Diagnosing the Modality-Induced Performance Gap
by: Lin, Yueqian, et al.
Published: (2025)
by: Lin, Yueqian, et al.
Published: (2025)
Sound-VECaps: Improving Audio Generation with Visual Enhanced Captions
by: Yuan, Yi, et al.
Published: (2024)
by: Yuan, Yi, et al.
Published: (2024)
Similar Items
-
A Study of Dropout-Induced Modality Bias on Robustness to Missing Video Frames for Audio-Visual Speech Recognition
by: Dai, Yusheng, et al.
Published: (2024) -
Multi-Track MusicLDM: Towards Versatile Music Generation with Latent Diffusion Model
by: Karchkhadze, Tornike, et al.
Published: (2024) -
MoMu-Diffusion: On Learning Long-Term Motion-Music Synchronization and Correspondence
by: You, Fuming, et al.
Published: (2024) -
Contrastive Conditional Latent Diffusion for Audio-visual Segmentation
by: Mao, Yuxin, et al.
Published: (2023) -
FreeAudio: Training-Free Timing Planning for Controllable Long-Form Text-to-Audio Generation
by: Jiang, Yuxuan, et al.
Published: (2025)