:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Chen, Hao, Hou, Yuqi, Qu, Chenyuan, Testini, Irene, Hong, Xiaohan, Jiao, Jianbo
Format:	Preprint
Published:	2024
Subjects:	Computer Vision and Pattern Recognition Artificial Intelligence Multimedia Sound Audio and Speech Processing
Online Access:	https://arxiv.org/abs/2404.00989
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

M$^{2}$UGen: Multi-modal Music Understanding and Generation with the Power of Large Language Models
by: Liu, Shansong, et al.
Published: (2023)

MuMu-LLaMA: Multi-modal Music Understanding and Generation via Large Language Models
by: Liu, Shansong, et al.
Published: (2024)

pyAMPACT: A Score-Audio Alignment Toolkit for Performance Data Estimation and Multi-modal Processing
by: Devaney, Johanna, et al.
Published: (2024)

Beyond Video-to-SFX: Video to Audio Synthesis with Environmentally Aware Speech
by: Niu, Xinlei, et al.
Published: (2025)

Efficient Video to Audio Mapper with Visual Scene Detection
by: Yi, Mingjing, et al.
Published: (2024)

Gotta Hear Them All: Towards Sound Source Aware Audio Generation
by: Guo, Wei, et al.
Published: (2024)

Tri-Ergon: Fine-grained Video-to-Audio Generation with Multi-modal Conditions and LUFS Control
by: Li, Bingliang, et al.
Published: (2024)

YingSound: Video-Guided Sound Effects Generation with Multi-modal Chain-of-Thought Controls
by: Chen, Zihao, et al.
Published: (2024)

Representation Learning for Semantic Alignment of Language, Audio, and Visual Modalities
by: Sudarsanam, Parthasaarathy, et al.
Published: (2025)

M$^{3}$V: A multi-modal multi-view approach for Device-Directed Speech Detection
by: Wang, Anna, et al.
Published: (2024)

Sonic4D: Spatial Audio Generation for Immersive 4D Scene Exploration
by: Xie, Siyi, et al.
Published: (2025)

MoMuSE: Momentum Multi-modal Target Speaker Extraction for Real-time Scenarios with Impaired Visual Cues
by: Li, Junjie, et al.
Published: (2024)

SlideAVSR: A Dataset of Paper Explanation Videos for Audio-Visual Speech Recognition
by: Wang, Hao, et al.
Published: (2024)

MMSD-Net: Towards Multi-modal Stuttering Detection
by: Nie, Liangyu, et al.
Published: (2024)

POLIPHONE: A Dataset for Smartphone Model Identification from Audio Recordings
by: Salvi, Davide, et al.
Published: (2024)

Jamendo-QA: A Large-Scale Music Question Answering Dataset
by: Koh, Junyoung, et al.
Published: (2025)

ARECHO: Autoregressive Evaluation via Chain-Based Hypothesis Optimization for Speech Multi-Metric Estimation
by: Shi, Jiatong, et al.
Published: (2025)

MINT: a Multi-modal Image and Narrative Text Dubbing Dataset for Foley Audio Content Planning and Generation
by: Fu, Ruibo, et al.
Published: (2024)

MusicSem: A Semantically Rich Language--Audio Dataset of Natural Music Descriptions
by: Salganik, Rebecca, et al.
Published: (2026)

M6: Multi-generator, Multi-domain, Multi-lingual and cultural, Multi-genres, Multi-instrument Machine-Generated Music Detection Databases
by: Li, Yupei, et al.
Published: (2024)

MuseAgent-1: Interactive Grounded Multimodal Understanding of Music Scores and Performance Audio
by: Zhao, Qihao, et al.
Published: (2026)

CtrSVDD: A Benchmark Dataset and Baseline Analysis for Controlled Singing Voice Deepfake Detection
by: Zang, Yongyi, et al.
Published: (2024)

Estimating Indoor Scene Depth Maps from Ultrasonic Echoes
by: Honma, Junpei, et al.
Published: (2024)

AudioRole: An Audio Dataset for Character Role-Playing in Large Language Models
by: Li, Wenyu, et al.
Published: (2025)

A Unified Framework for Modality-Agnostic Deepfakes Detection
by: Yu, Cai, et al.
Published: (2023)

Trusted Fake Audio Detection Based on Dirichlet Distribution
by: Ding, Chi, et al.
Published: (2025)

Song Aesthetics Evaluation with Multi-Stem Attention and Hierarchical Uncertainty Modeling
by: Lv, Yishan, et al.
Published: (2026)

Attentive-based Multi-level Feature Fusion for Voice Disorder Diagnosis
by: Shen, Lipeng, et al.
Published: (2024)

Multi-Track MusicLDM: Towards Versatile Music Generation with Latent Diffusion Model
by: Karchkhadze, Tornike, et al.
Published: (2024)

RAVSS: Robust Audio-Visual Speech Separation in Multi-Speaker Scenarios with Missing Visual Cues
by: Pan, Tianrui, et al.
Published: (2024)

Video-Guided Text-to-Music Generation Using Public Domain Movie Collections
by: Kim, Haven, et al.
Published: (2025)

Building Audio-Visual Digital Twins with Smartphones
by: Lan, Zitong, et al.
Published: (2025)

DiffSSD: A Diffusion-Based Dataset For Speech Forensics
by: Bhagtani, Kratika, et al.
Published: (2024)

MMTrail: A Multimodal Trailer Video Dataset with Language and Music Descriptions
by: Chi, Xiaowei, et al.
Published: (2024)

Real Acoustic Fields: An Audio-Visual Room Acoustics Dataset and Benchmark
by: Chen, Ziyang, et al.
Published: (2024)

Audio Is the Achilles' Heel: Red Teaming Audio Large Multimodal Models
by: Yang, Hao, et al.
Published: (2024)

Reshaping Representation Space to Balance the Safety and Over-rejection in Large Audio Language Models
by: Yang, Hao, et al.
Published: (2025)

MF-AED-AEC: Speech Emotion Recognition by Leveraging Multimodal Fusion, Asr Error Detection, and Asr Error Correction
by: He, Jiajun, et al.
Published: (2024)

Auto-ACD: A Large-scale Dataset for Audio-Language Representation Learning
by: Sun, Luoyi, et al.
Published: (2023)

Hearing from Silence: Reasoning Audio Descriptions from Silent Videos via Vision-Language Model
by: Ren, Yong, et al.
Published: (2025)