Saved in:
| Main Authors: | Nguyen-Phuoc, Long, Gaboriau, Renald, Delacroix, Dimitri, Navarro, Laurent |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2403.09451 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
PTSD-MDNN : Fusion tardive de réseaux de neurones profonds multimodaux pour la détection du trouble de stress post-traumatique
by: Nguyen-Phuoc, Long, et al.
Published: (2024)
by: Nguyen-Phuoc, Long, et al.
Published: (2024)
Efficient Audiovisual Speech Processing via MUTUD: Multimodal Training and Unimodal Deployment
by: Hong, Joanna, et al.
Published: (2025)
by: Hong, Joanna, et al.
Published: (2025)
PerformSinger: Multimodal Singing Voice Synthesis Leveraging Synchronized Lip Cues from Singing Performance Videos
by: Gu, Ke, et al.
Published: (2025)
by: Gu, Ke, et al.
Published: (2025)
Face-Voice Association for Audiovisual Active Speaker Detection in Egocentric Recordings
by: Clarke, Jason, et al.
Published: (2025)
by: Clarke, Jason, et al.
Published: (2025)
Cinematic Audio Source Separation Using Visual Cues
by: Zhang, Kang, et al.
Published: (2026)
by: Zhang, Kang, et al.
Published: (2026)
Cross-Modal Watermarking for Authentic Audio Recovery and Tamper Localization in Synthesized Audiovisual Forgeries
by: Kim, Minyoung, et al.
Published: (2025)
by: Kim, Minyoung, et al.
Published: (2025)
Multimodal Fish Feeding Intensity Assessment in Aquaculture
by: Cui, Meng, et al.
Published: (2023)
by: Cui, Meng, et al.
Published: (2023)
Mitigating Audiovisual Mismatch in Visual-Guide Audio Captioning
by: Xu, Le, et al.
Published: (2025)
by: Xu, Le, et al.
Published: (2025)
pTSE-T: Presentation Target Speaker Extraction using Unaligned Text Cues
by: Jiang, Ziyang, et al.
Published: (2024)
by: Jiang, Ziyang, et al.
Published: (2024)
RAVSS: Robust Audio-Visual Speech Separation in Multi-Speaker Scenarios with Missing Visual Cues
by: Pan, Tianrui, et al.
Published: (2024)
by: Pan, Tianrui, et al.
Published: (2024)
AGAV-Rater: Adapting Large Multimodal Model for AI-Generated Audio-Visual Quality Assessment
by: Cao, Yuqin, et al.
Published: (2025)
by: Cao, Yuqin, et al.
Published: (2025)
M$^{2}$UGen: Multi-modal Music Understanding and Generation with the Power of Large Language Models
by: Liu, Shansong, et al.
Published: (2023)
by: Liu, Shansong, et al.
Published: (2023)
Towards Source Attribution of Singing Voice Deepfake with Multimodal Foundation Models
by: Phukan, Orchid Chetia, et al.
Published: (2025)
by: Phukan, Orchid Chetia, et al.
Published: (2025)
Semantically consistent Video-to-Audio Generation using Multimodal Language Large Model
by: Chen, Gehui, et al.
Published: (2024)
by: Chen, Gehui, et al.
Published: (2024)
MoMuSE: Momentum Multi-modal Target Speaker Extraction for Real-time Scenarios with Impaired Visual Cues
by: Li, Junjie, et al.
Published: (2024)
by: Li, Junjie, et al.
Published: (2024)
M$^{3}$V: A multi-modal multi-view approach for Device-Directed Speech Detection
by: Wang, Anna, et al.
Published: (2024)
by: Wang, Anna, et al.
Published: (2024)
A Survey on Multimodal Music Emotion Recognition
by: Liyanarachchi, Rashini, et al.
Published: (2025)
by: Liyanarachchi, Rashini, et al.
Published: (2025)
Personality-Enhanced Multimodal Depression Detection in the Elderly
by: Wang, Honghong, et al.
Published: (2025)
by: Wang, Honghong, et al.
Published: (2025)
Multimodal Emotion Recognition from Raw Audio with Sinc-convolution
by: Zhang, Xiaohui, et al.
Published: (2024)
by: Zhang, Xiaohui, et al.
Published: (2024)
ISDrama: Immersive Spatial Drama Generation through Multimodal Prompting
by: Zhang, Yu, et al.
Published: (2025)
by: Zhang, Yu, et al.
Published: (2025)
A Survey on Cross-Modal Interaction Between Music and Multimodal Data
by: Li, Sifei, et al.
Published: (2025)
by: Li, Sifei, et al.
Published: (2025)
MusFlow: Multimodal Music Generation via Conditional Flow Matching
by: Song, Jiahao, et al.
Published: (2025)
by: Song, Jiahao, et al.
Published: (2025)
Learning What To Hear: Boosting Sound-Source Association For Robust Audiovisual Instance Segmentation
by: Seo, Jinbae, et al.
Published: (2025)
by: Seo, Jinbae, et al.
Published: (2025)
Efficient Feature Extraction and Late Fusion Strategy for Audiovisual Emotional Mimicry Intensity Estimation
by: Yu, Jun, et al.
Published: (2024)
by: Yu, Jun, et al.
Published: (2024)
M6: Multi-generator, Multi-domain, Multi-lingual and cultural, Multi-genres, Multi-instrument Machine-Generated Music Detection Databases
by: Li, Yupei, et al.
Published: (2024)
by: Li, Yupei, et al.
Published: (2024)
MuseAgent-1: Interactive Grounded Multimodal Understanding of Music Scores and Performance Audio
by: Zhao, Qihao, et al.
Published: (2026)
by: Zhao, Qihao, et al.
Published: (2026)
Enhancing Expressiveness in Dance Generation via Integrating Frequency and Music Style Information
by: Huang, Qiaochu, et al.
Published: (2024)
by: Huang, Qiaochu, et al.
Published: (2024)
Audio-Visual Speech Separation via Bottleneck Iterative Network
by: Zhang, Sidong, et al.
Published: (2025)
by: Zhang, Sidong, et al.
Published: (2025)
Improving Speech Enhancement by Integrating Inter-Channel and Band Features with Dual-branch Conformer
by: Li, Jizhen, et al.
Published: (2024)
by: Li, Jizhen, et al.
Published: (2024)
CoAVT: A Cognition-Inspired Unified Audio-Visual-Text Pre-Training Model for Multimodal Processing
by: Yue, Xianghu, et al.
Published: (2024)
by: Yue, Xianghu, et al.
Published: (2024)
A Versatile Diffusion Transformer with Mixture of Noise Levels for Audiovisual Generation
by: Kim, Gwanghyun, et al.
Published: (2024)
by: Kim, Gwanghyun, et al.
Published: (2024)
Synchformer: Efficient Synchronization from Sparse Cues
by: Iashin, Vladimir, et al.
Published: (2024)
by: Iashin, Vladimir, et al.
Published: (2024)
Multimodal Music Generation with Explicit Bridges and Retrieval Augmentation
by: Wang, Baisen, et al.
Published: (2024)
by: Wang, Baisen, et al.
Published: (2024)
MCDubber: Multimodal Context-Aware Expressive Video Dubbing
by: Zhao, Yuan, et al.
Published: (2024)
by: Zhao, Yuan, et al.
Published: (2024)
Video-Guided Foley Sound Generation with Multimodal Controls
by: Chen, Ziyang, et al.
Published: (2024)
by: Chen, Ziyang, et al.
Published: (2024)
Digit Recognition using Multimodal Spiking Neural Networks
by: Bjorndahl, William, et al.
Published: (2024)
by: Bjorndahl, William, et al.
Published: (2024)
Looking Similar, Sounding Different: Leveraging Counterfactual Cross-Modal Pairs for Audiovisual Representation Learning
by: Singh, Nikhil, et al.
Published: (2023)
by: Singh, Nikhil, et al.
Published: (2023)
Pilot-guided Multimodal Semantic Communication for Audio-Visual Event Localization
by: Yu, Fei, et al.
Published: (2024)
by: Yu, Fei, et al.
Published: (2024)
MMTrail: A Multimodal Trailer Video Dataset with Language and Music Descriptions
by: Chi, Xiaowei, et al.
Published: (2024)
by: Chi, Xiaowei, et al.
Published: (2024)
Music Audio-Visual Question Answering Requires Specialized Multimodal Designs
by: You, Wenhao, et al.
Published: (2025)
by: You, Wenhao, et al.
Published: (2025)
Similar Items
-
PTSD-MDNN : Fusion tardive de réseaux de neurones profonds multimodaux pour la détection du trouble de stress post-traumatique
by: Nguyen-Phuoc, Long, et al.
Published: (2024) -
Efficient Audiovisual Speech Processing via MUTUD: Multimodal Training and Unimodal Deployment
by: Hong, Joanna, et al.
Published: (2025) -
PerformSinger: Multimodal Singing Voice Synthesis Leveraging Synchronized Lip Cues from Singing Performance Videos
by: Gu, Ke, et al.
Published: (2025) -
Face-Voice Association for Audiovisual Active Speaker Detection in Egocentric Recordings
by: Clarke, Jason, et al.
Published: (2025) -
Cinematic Audio Source Separation Using Visual Cues
by: Zhang, Kang, et al.
Published: (2026)