Saved in:
| Main Authors: | Ge, Mengying, Li, Mingyang, Tang, Dongkai, Li, Pengbo, Liu, Kuo, Deng, Shuhao, Pu, Songbai, Liu, Long, Song, Yang, Zhang, Tao |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2409.18971 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Dopamine Audiobook: A Training-free MLLM Agent for Emotional and Immersive Audiobook Generation
by: Rong, Yan, et al.
Published: (2025)
by: Rong, Yan, et al.
Published: (2025)
Efficient Adapter Tuning for Joint Singing Voice Beat and Downbeat Tracking with Self-supervised Learning Features
by: Deng, Jiajun, et al.
Published: (2025)
by: Deng, Jiajun, et al.
Published: (2025)
Robust LLM-based Audio-Visual Speech Recognition with Sparse Modality Alignment and Visual Unit-Guided Refinement
by: Su, Fei, et al.
Published: (2026)
by: Su, Fei, et al.
Published: (2026)
Voice Evaluation of Reasoning Ability: Diagnosing the Modality-Induced Performance Gap
by: Lin, Yueqian, et al.
Published: (2025)
by: Lin, Yueqian, et al.
Published: (2025)
Sync-TVA: A Graph-Attention Framework for Multimodal Emotion Recognition with Cross-Modal Fusion
by: Deng, Zeyu, et al.
Published: (2025)
by: Deng, Zeyu, et al.
Published: (2025)
A Unified Framework for Modality-Agnostic Deepfakes Detection
by: Yu, Cai, et al.
Published: (2023)
by: Yu, Cai, et al.
Published: (2023)
EmoDubber: Towards High Quality and Emotion Controllable Movie Dubbing
by: Cong, Gaoxiang, et al.
Published: (2024)
by: Cong, Gaoxiang, et al.
Published: (2024)
A Survey on Cross-Modal Interaction Between Music and Multimodal Data
by: Li, Sifei, et al.
Published: (2025)
by: Li, Sifei, et al.
Published: (2025)
Robust Wake Word Spotting With Frame-Level Cross-Modal Attention Based Audio-Visual Conformer
by: Wang, Haoxu, et al.
Published: (2024)
by: Wang, Haoxu, et al.
Published: (2024)
DualDub: Video-to-Soundtrack Generation via Joint Speech and Background Audio Synthesis
by: Tian, Wenjie, et al.
Published: (2025)
by: Tian, Wenjie, et al.
Published: (2025)
SyncFlow: Toward Temporally Aligned Joint Audio-Video Generation from Text
by: Liu, Haohe, et al.
Published: (2024)
by: Liu, Haohe, et al.
Published: (2024)
A Survey on Multimodal Music Emotion Recognition
by: Liyanarachchi, Rashini, et al.
Published: (2025)
by: Liyanarachchi, Rashini, et al.
Published: (2025)
Towards Streaming Synchronized Spatial Audio Generation via Autoregressive Diffusion Transformer
by: Lei, Ke, et al.
Published: (2026)
by: Lei, Ke, et al.
Published: (2026)
M$^{2}$UGen: Multi-modal Music Understanding and Generation with the Power of Large Language Models
by: Liu, Shansong, et al.
Published: (2023)
by: Liu, Shansong, et al.
Published: (2023)
MuMu-LLaMA: Multi-modal Music Understanding and Generation via Large Language Models
by: Liu, Shansong, et al.
Published: (2024)
by: Liu, Shansong, et al.
Published: (2024)
Emotion-Aligned Contrastive Learning Between Images and Music
by: Stewart, Shanti, et al.
Published: (2023)
by: Stewart, Shanti, et al.
Published: (2023)
A Study on Synthesizing Expressive Violin Performances: Approaches and Comparisons
by: Hung, Tzu-Yun, et al.
Published: (2024)
by: Hung, Tzu-Yun, et al.
Published: (2024)
Stimulus Modality Matters: Impact of Perceptual Evaluations from Different Modalities on Speech Emotion Recognition System Performance
by: Chou, Huang-Cheng, et al.
Published: (2024)
by: Chou, Huang-Cheng, et al.
Published: (2024)
Zero-Shot Fake Video Detection by Audio-Visual Consistency
by: Li, Xiaolou, et al.
Published: (2024)
by: Li, Xiaolou, et al.
Published: (2024)
Separate Anything You Describe
by: Liu, Xubo, et al.
Published: (2023)
by: Liu, Xubo, et al.
Published: (2023)
Multimodal Emotion Recognition from Raw Audio with Sinc-convolution
by: Zhang, Xiaohui, et al.
Published: (2024)
by: Zhang, Xiaohui, et al.
Published: (2024)
DSCLAP: Domain-Specific Contrastive Language-Audio Pre-Training
by: Liu, Shengqiang, et al.
Published: (2024)
by: Liu, Shengqiang, et al.
Published: (2024)
M$^{3}$V: A multi-modal multi-view approach for Device-Directed Speech Detection
by: Wang, Anna, et al.
Published: (2024)
by: Wang, Anna, et al.
Published: (2024)
Zero-Shot End-to-End Spoken Language Understanding via Cross-Modal Selective Self-Training
by: He, Jianfeng, et al.
Published: (2023)
by: He, Jianfeng, et al.
Published: (2023)
Representation Learning for Semantic Alignment of Language, Audio, and Visual Modalities
by: Sudarsanam, Parthasaarathy, et al.
Published: (2025)
by: Sudarsanam, Parthasaarathy, et al.
Published: (2025)
Mel-Refine: A Plug-and-Play Approach to Refine Mel-Spectrogram in Audio Generation
by: Guo, Hongming, et al.
Published: (2024)
by: Guo, Hongming, et al.
Published: (2024)
FastTalker: Jointly Generating Speech and Conversational Gestures from Text
by: Guo, Zixin, et al.
Published: (2024)
by: Guo, Zixin, et al.
Published: (2024)
Addressing Emotion Bias in Music Emotion Recognition and Generation with Frechet Audio Distance
by: Li, Yuanchao, et al.
Published: (2024)
by: Li, Yuanchao, et al.
Published: (2024)
Cross-Modal Watermarking for Authentic Audio Recovery and Tamper Localization in Synthesized Audiovisual Forgeries
by: Kim, Minyoung, et al.
Published: (2025)
by: Kim, Minyoung, et al.
Published: (2025)
Multimodal Fish Feeding Intensity Assessment in Aquaculture
by: Cui, Meng, et al.
Published: (2023)
by: Cui, Meng, et al.
Published: (2023)
Human-Inspired Computing for Robust and Efficient Audio-Visual Speech Recognition
by: Liu, Qianhui, et al.
Published: (2024)
by: Liu, Qianhui, et al.
Published: (2024)
A Survey of Foundation Models for Music Understanding
by: Li, Wenjun, et al.
Published: (2024)
by: Li, Wenjun, et al.
Published: (2024)
Personality-Enhanced Multimodal Depression Detection in the Elderly
by: Wang, Honghong, et al.
Published: (2025)
by: Wang, Honghong, et al.
Published: (2025)
MuseAgent-1: Interactive Grounded Multimodal Understanding of Music Scores and Performance Audio
by: Zhao, Qihao, et al.
Published: (2026)
by: Zhao, Qihao, et al.
Published: (2026)
Leveraging LLM Embeddings for Cross Dataset Label Alignment and Zero Shot Music Emotion Prediction
by: Liu, Renhang, et al.
Published: (2024)
by: Liu, Renhang, et al.
Published: (2024)
M6: Multi-generator, Multi-domain, Multi-lingual and cultural, Multi-genres, Multi-instrument Machine-Generated Music Detection Databases
by: Li, Yupei, et al.
Published: (2024)
by: Li, Yupei, et al.
Published: (2024)
OmniCustom: Sync Audio-Video Customization Via Joint Audio-Video Generation Model
by: Li, Maomao, et al.
Published: (2026)
by: Li, Maomao, et al.
Published: (2026)
Efficient Video to Audio Mapper with Visual Scene Detection
by: Yi, Mingjing, et al.
Published: (2024)
by: Yi, Mingjing, et al.
Published: (2024)
An automatic mixing speech enhancement system for multi-track audio
by: Liu, Xiaojing, et al.
Published: (2024)
by: Liu, Xiaojing, et al.
Published: (2024)
Pay More Attention To Audio: Mitigating Imbalance of Cross-Modal Attention in Large Audio Language Models
by: Wang, Junyu, et al.
Published: (2025)
by: Wang, Junyu, et al.
Published: (2025)
Similar Items
-
Dopamine Audiobook: A Training-free MLLM Agent for Emotional and Immersive Audiobook Generation
by: Rong, Yan, et al.
Published: (2025) -
Efficient Adapter Tuning for Joint Singing Voice Beat and Downbeat Tracking with Self-supervised Learning Features
by: Deng, Jiajun, et al.
Published: (2025) -
Robust LLM-based Audio-Visual Speech Recognition with Sparse Modality Alignment and Visual Unit-Guided Refinement
by: Su, Fei, et al.
Published: (2026) -
Voice Evaluation of Reasoning Ability: Diagnosing the Modality-Induced Performance Gap
by: Lin, Yueqian, et al.
Published: (2025) -
Sync-TVA: A Graph-Attention Framework for Multimodal Emotion Recognition with Cross-Modal Fusion
by: Deng, Zeyu, et al.
Published: (2025)