Saved in:
| Main Authors: | Chen, Hao, Hou, Yuqi, Qu, Chenyuan, Testini, Irene, Hong, Xiaohan, Jiao, Jianbo |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2404.00989 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
M$^{2}$UGen: Multi-modal Music Understanding and Generation with the Power of Large Language Models
by: Liu, Shansong, et al.
Published: (2023)
by: Liu, Shansong, et al.
Published: (2023)
MuMu-LLaMA: Multi-modal Music Understanding and Generation via Large Language Models
by: Liu, Shansong, et al.
Published: (2024)
by: Liu, Shansong, et al.
Published: (2024)
pyAMPACT: A Score-Audio Alignment Toolkit for Performance Data Estimation and Multi-modal Processing
by: Devaney, Johanna, et al.
Published: (2024)
by: Devaney, Johanna, et al.
Published: (2024)
Beyond Video-to-SFX: Video to Audio Synthesis with Environmentally Aware Speech
by: Niu, Xinlei, et al.
Published: (2025)
by: Niu, Xinlei, et al.
Published: (2025)
Efficient Video to Audio Mapper with Visual Scene Detection
by: Yi, Mingjing, et al.
Published: (2024)
by: Yi, Mingjing, et al.
Published: (2024)
Gotta Hear Them All: Towards Sound Source Aware Audio Generation
by: Guo, Wei, et al.
Published: (2024)
by: Guo, Wei, et al.
Published: (2024)
Tri-Ergon: Fine-grained Video-to-Audio Generation with Multi-modal Conditions and LUFS Control
by: Li, Bingliang, et al.
Published: (2024)
by: Li, Bingliang, et al.
Published: (2024)
YingSound: Video-Guided Sound Effects Generation with Multi-modal Chain-of-Thought Controls
by: Chen, Zihao, et al.
Published: (2024)
by: Chen, Zihao, et al.
Published: (2024)
Representation Learning for Semantic Alignment of Language, Audio, and Visual Modalities
by: Sudarsanam, Parthasaarathy, et al.
Published: (2025)
by: Sudarsanam, Parthasaarathy, et al.
Published: (2025)
M$^{3}$V: A multi-modal multi-view approach for Device-Directed Speech Detection
by: Wang, Anna, et al.
Published: (2024)
by: Wang, Anna, et al.
Published: (2024)
Sonic4D: Spatial Audio Generation for Immersive 4D Scene Exploration
by: Xie, Siyi, et al.
Published: (2025)
by: Xie, Siyi, et al.
Published: (2025)
MoMuSE: Momentum Multi-modal Target Speaker Extraction for Real-time Scenarios with Impaired Visual Cues
by: Li, Junjie, et al.
Published: (2024)
by: Li, Junjie, et al.
Published: (2024)
SlideAVSR: A Dataset of Paper Explanation Videos for Audio-Visual Speech Recognition
by: Wang, Hao, et al.
Published: (2024)
by: Wang, Hao, et al.
Published: (2024)
MMSD-Net: Towards Multi-modal Stuttering Detection
by: Nie, Liangyu, et al.
Published: (2024)
by: Nie, Liangyu, et al.
Published: (2024)
POLIPHONE: A Dataset for Smartphone Model Identification from Audio Recordings
by: Salvi, Davide, et al.
Published: (2024)
by: Salvi, Davide, et al.
Published: (2024)
Jamendo-QA: A Large-Scale Music Question Answering Dataset
by: Koh, Junyoung, et al.
Published: (2025)
by: Koh, Junyoung, et al.
Published: (2025)
ARECHO: Autoregressive Evaluation via Chain-Based Hypothesis Optimization for Speech Multi-Metric Estimation
by: Shi, Jiatong, et al.
Published: (2025)
by: Shi, Jiatong, et al.
Published: (2025)
MINT: a Multi-modal Image and Narrative Text Dubbing Dataset for Foley Audio Content Planning and Generation
by: Fu, Ruibo, et al.
Published: (2024)
by: Fu, Ruibo, et al.
Published: (2024)
MusicSem: A Semantically Rich Language--Audio Dataset of Natural Music Descriptions
by: Salganik, Rebecca, et al.
Published: (2026)
by: Salganik, Rebecca, et al.
Published: (2026)
M6: Multi-generator, Multi-domain, Multi-lingual and cultural, Multi-genres, Multi-instrument Machine-Generated Music Detection Databases
by: Li, Yupei, et al.
Published: (2024)
by: Li, Yupei, et al.
Published: (2024)
MuseAgent-1: Interactive Grounded Multimodal Understanding of Music Scores and Performance Audio
by: Zhao, Qihao, et al.
Published: (2026)
by: Zhao, Qihao, et al.
Published: (2026)
CtrSVDD: A Benchmark Dataset and Baseline Analysis for Controlled Singing Voice Deepfake Detection
by: Zang, Yongyi, et al.
Published: (2024)
by: Zang, Yongyi, et al.
Published: (2024)
Estimating Indoor Scene Depth Maps from Ultrasonic Echoes
by: Honma, Junpei, et al.
Published: (2024)
by: Honma, Junpei, et al.
Published: (2024)
AudioRole: An Audio Dataset for Character Role-Playing in Large Language Models
by: Li, Wenyu, et al.
Published: (2025)
by: Li, Wenyu, et al.
Published: (2025)
A Unified Framework for Modality-Agnostic Deepfakes Detection
by: Yu, Cai, et al.
Published: (2023)
by: Yu, Cai, et al.
Published: (2023)
Trusted Fake Audio Detection Based on Dirichlet Distribution
by: Ding, Chi, et al.
Published: (2025)
by: Ding, Chi, et al.
Published: (2025)
Song Aesthetics Evaluation with Multi-Stem Attention and Hierarchical Uncertainty Modeling
by: Lv, Yishan, et al.
Published: (2026)
by: Lv, Yishan, et al.
Published: (2026)
Attentive-based Multi-level Feature Fusion for Voice Disorder Diagnosis
by: Shen, Lipeng, et al.
Published: (2024)
by: Shen, Lipeng, et al.
Published: (2024)
Multi-Track MusicLDM: Towards Versatile Music Generation with Latent Diffusion Model
by: Karchkhadze, Tornike, et al.
Published: (2024)
by: Karchkhadze, Tornike, et al.
Published: (2024)
RAVSS: Robust Audio-Visual Speech Separation in Multi-Speaker Scenarios with Missing Visual Cues
by: Pan, Tianrui, et al.
Published: (2024)
by: Pan, Tianrui, et al.
Published: (2024)
Video-Guided Text-to-Music Generation Using Public Domain Movie Collections
by: Kim, Haven, et al.
Published: (2025)
by: Kim, Haven, et al.
Published: (2025)
Building Audio-Visual Digital Twins with Smartphones
by: Lan, Zitong, et al.
Published: (2025)
by: Lan, Zitong, et al.
Published: (2025)
DiffSSD: A Diffusion-Based Dataset For Speech Forensics
by: Bhagtani, Kratika, et al.
Published: (2024)
by: Bhagtani, Kratika, et al.
Published: (2024)
MMTrail: A Multimodal Trailer Video Dataset with Language and Music Descriptions
by: Chi, Xiaowei, et al.
Published: (2024)
by: Chi, Xiaowei, et al.
Published: (2024)
Real Acoustic Fields: An Audio-Visual Room Acoustics Dataset and Benchmark
by: Chen, Ziyang, et al.
Published: (2024)
by: Chen, Ziyang, et al.
Published: (2024)
Audio Is the Achilles' Heel: Red Teaming Audio Large Multimodal Models
by: Yang, Hao, et al.
Published: (2024)
by: Yang, Hao, et al.
Published: (2024)
Reshaping Representation Space to Balance the Safety and Over-rejection in Large Audio Language Models
by: Yang, Hao, et al.
Published: (2025)
by: Yang, Hao, et al.
Published: (2025)
MF-AED-AEC: Speech Emotion Recognition by Leveraging Multimodal Fusion, Asr Error Detection, and Asr Error Correction
by: He, Jiajun, et al.
Published: (2024)
by: He, Jiajun, et al.
Published: (2024)
Auto-ACD: A Large-scale Dataset for Audio-Language Representation Learning
by: Sun, Luoyi, et al.
Published: (2023)
by: Sun, Luoyi, et al.
Published: (2023)
Hearing from Silence: Reasoning Audio Descriptions from Silent Videos via Vision-Language Model
by: Ren, Yong, et al.
Published: (2025)
by: Ren, Yong, et al.
Published: (2025)
Similar Items
-
M$^{2}$UGen: Multi-modal Music Understanding and Generation with the Power of Large Language Models
by: Liu, Shansong, et al.
Published: (2023) -
MuMu-LLaMA: Multi-modal Music Understanding and Generation via Large Language Models
by: Liu, Shansong, et al.
Published: (2024) -
pyAMPACT: A Score-Audio Alignment Toolkit for Performance Data Estimation and Multi-modal Processing
by: Devaney, Johanna, et al.
Published: (2024) -
Beyond Video-to-SFX: Video to Audio Synthesis with Environmentally Aware Speech
by: Niu, Xinlei, et al.
Published: (2025) -
Efficient Video to Audio Mapper with Visual Scene Detection
by: Yi, Mingjing, et al.
Published: (2024)