Saved in:
| Main Authors: | Martinez, Helard Becerra, Ragano, Alessandro, Debnath, Diptasree, Ullah, Asad, Lucas, Crisron Rudolf, Walsh, Martin, Hines, Andrew |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2403.15336 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Beyond Correlation: Evaluating Multimedia Quality Models with the Constrained Concordance Index
by: Ragano, Alessandro, et al.
Published: (2024)
by: Ragano, Alessandro, et al.
Published: (2024)
Reduce, Reuse, Recycle: Is Perturbed Data better than Other Language augmentation for Low Resource Self-Supervised Speech Models
by: Ullah, Asad, et al.
Published: (2023)
by: Ullah, Asad, et al.
Published: (2023)
NOMAD: Unsupervised Learning of Perceptual Embeddings for Speech Enhancement and Non-matching Reference Audio Quality Assessment
by: Ragano, Alessandro, et al.
Published: (2023)
by: Ragano, Alessandro, et al.
Published: (2023)
SCOREQ: Speech Quality Assessment with Contrastive Regression
by: Ragano, Alessandro, et al.
Published: (2024)
by: Ragano, Alessandro, et al.
Published: (2024)
MuseAgent-1: Interactive Grounded Multimodal Understanding of Music Scores and Performance Audio
by: Zhao, Qihao, et al.
Published: (2026)
by: Zhao, Qihao, et al.
Published: (2026)
M$^{2}$UGen: Multi-modal Music Understanding and Generation with the Power of Large Language Models
by: Liu, Shansong, et al.
Published: (2023)
by: Liu, Shansong, et al.
Published: (2023)
MuMu-LLaMA: Multi-modal Music Understanding and Generation via Large Language Models
by: Liu, Shansong, et al.
Published: (2024)
by: Liu, Shansong, et al.
Published: (2024)
BINAQUAL: A Full-Reference Objective Localization Similarity Metric for Binaural Audio
by: Panah, Davoud Shariat, et al.
Published: (2025)
by: Panah, Davoud Shariat, et al.
Published: (2025)
Binamix -- A Python Library for Generating Binaural Audio Datasets
by: Barry, Dan, et al.
Published: (2025)
by: Barry, Dan, et al.
Published: (2025)
StereoFoley: Object-Aware Stereo Audio Generation from Video
by: Karchkhadze, Tornike, et al.
Published: (2025)
by: Karchkhadze, Tornike, et al.
Published: (2025)
Can Large Language Models Predict Audio Effects Parameters from Natural Language?
by: Doh, Seungheon, et al.
Published: (2025)
by: Doh, Seungheon, et al.
Published: (2025)
SongBloom: Coherent Song Generation via Interleaved Autoregressive Sketching and Diffusion Refinement
by: Yang, Chenyu, et al.
Published: (2025)
by: Yang, Chenyu, et al.
Published: (2025)
Target Speech Diarization with Multimodal Prompts
by: Jiang, Yidi, et al.
Published: (2024)
by: Jiang, Yidi, et al.
Published: (2024)
Iola Walker: A Mobile Footfall Detection System for Music Composition
by: James, William B.
Published: (2025)
by: James, William B.
Published: (2025)
LLaQo: Towards a Query-Based Coach in Expressive Music Performance Assessment
by: Zhang, Huan, et al.
Published: (2024)
by: Zhang, Huan, et al.
Published: (2024)
M3SD: Multi-modal, Multi-scenario and Multi-language Speaker Diarization Dataset
by: Wu, Shilong
Published: (2025)
by: Wu, Shilong
Published: (2025)
MOS-FAD: Improving Fake Audio Detection Via Automatic Mean Opinion Score Prediction
by: Zhou, Wangjin, et al.
Published: (2024)
by: Zhou, Wangjin, et al.
Published: (2024)
RenderBox: Expressive Performance Rendering with Text Control
by: Zhang, Huan, et al.
Published: (2025)
by: Zhang, Huan, et al.
Published: (2025)
A Toolkit for Joint Speaker Diarization and Identification with Application to Speaker-Attributed ASR
by: Morrone, Giovanni, et al.
Published: (2024)
by: Morrone, Giovanni, et al.
Published: (2024)
A Survey of Foundation Models for Music Understanding
by: Li, Wenjun, et al.
Published: (2024)
by: Li, Wenjun, et al.
Published: (2024)
WavChat: A Survey of Spoken Dialogue Models
by: Ji, Shengpeng, et al.
Published: (2024)
by: Ji, Shengpeng, et al.
Published: (2024)
Can LLMs "Reason" in Music? An Evaluation of LLMs' Capability of Music Understanding and Generation
by: Zhou, Ziya, et al.
Published: (2024)
by: Zhou, Ziya, et al.
Published: (2024)
Improving BERT for Symbolic Music Understanding Using Token Denoising and Pianoroll Prediction
by: Wang, Jun-You, et al.
Published: (2025)
by: Wang, Jun-You, et al.
Published: (2025)
WavReward: Spoken Dialogue Models With Generalist Reward Evaluators
by: Ji, Shengpeng, et al.
Published: (2025)
by: Ji, Shengpeng, et al.
Published: (2025)
Audio-Visual Speaker Tracking: Progress, Challenges, and Future Directions
by: Zhao, Jinzheng, et al.
Published: (2023)
by: Zhao, Jinzheng, et al.
Published: (2023)
Video-Guided Text-to-Music Generation Using Public Domain Movie Collections
by: Kim, Haven, et al.
Published: (2025)
by: Kim, Haven, et al.
Published: (2025)
PerformSinger: Multimodal Singing Voice Synthesis Leveraging Synchronized Lip Cues from Singing Performance Videos
by: Gu, Ke, et al.
Published: (2025)
by: Gu, Ke, et al.
Published: (2025)
Efficient Speech Watermarking for Speech Synthesis via Progressive Knowledge Distillation
by: Cui, Yang, et al.
Published: (2025)
by: Cui, Yang, et al.
Published: (2025)
Listen, Look, Drive: Coupling Audio Instructions for User-aware VLA-based Autonomous Driving
by: Guo, Ziang, et al.
Published: (2026)
by: Guo, Ziang, et al.
Published: (2026)
ARECHO: Autoregressive Evaluation via Chain-Based Hypothesis Optimization for Speech Multi-Metric Estimation
by: Shi, Jiatong, et al.
Published: (2025)
by: Shi, Jiatong, et al.
Published: (2025)
SteerMusic: Enhanced Musical Consistency for Zero-shot Text-guided and Personalized Music Editing
by: Niu, Xinlei, et al.
Published: (2025)
by: Niu, Xinlei, et al.
Published: (2025)
Building Audio-Visual Digital Twins with Smartphones
by: Lan, Zitong, et al.
Published: (2025)
by: Lan, Zitong, et al.
Published: (2025)
MiniMind-O Technical Report: An Open Small-Scale Speech-Native Omni Model
by: Gong, Jingyao
Published: (2026)
by: Gong, Jingyao
Published: (2026)
Dance2MIDI: Dance-driven multi-instruments music generation
by: Han, Bo, et al.
Published: (2023)
by: Han, Bo, et al.
Published: (2023)
LCB-net: Long-Context Biasing for Audio-Visual Speech Recognition
by: Yu, Fan, et al.
Published: (2024)
by: Yu, Fan, et al.
Published: (2024)
Listening Between the Lines: Synthetic Speech Detection Disregarding Verbal Content
by: Salvi, Davide, et al.
Published: (2024)
by: Salvi, Davide, et al.
Published: (2024)
LRS-VoxMM: A benchmark for in-the-wild audio-visual speech recognition
by: Kwak, Doyeop, et al.
Published: (2026)
by: Kwak, Doyeop, et al.
Published: (2026)
RVCBench: Benchmarking the Robustness of Voice Cloning Across Modern Audio Generation Models
by: Jin, Ruinan, et al.
Published: (2026)
by: Jin, Ruinan, et al.
Published: (2026)
M6: Multi-generator, Multi-domain, Multi-lingual and cultural, Multi-genres, Multi-instrument Machine-Generated Music Detection Databases
by: Li, Yupei, et al.
Published: (2024)
by: Li, Yupei, et al.
Published: (2024)
Multimodal Emotion Recognition from Raw Audio with Sinc-convolution
by: Zhang, Xiaohui, et al.
Published: (2024)
by: Zhang, Xiaohui, et al.
Published: (2024)
Similar Items
-
Beyond Correlation: Evaluating Multimedia Quality Models with the Constrained Concordance Index
by: Ragano, Alessandro, et al.
Published: (2024) -
Reduce, Reuse, Recycle: Is Perturbed Data better than Other Language augmentation for Low Resource Self-Supervised Speech Models
by: Ullah, Asad, et al.
Published: (2023) -
NOMAD: Unsupervised Learning of Perceptual Embeddings for Speech Enhancement and Non-matching Reference Audio Quality Assessment
by: Ragano, Alessandro, et al.
Published: (2023) -
SCOREQ: Speech Quality Assessment with Contrastive Regression
by: Ragano, Alessandro, et al.
Published: (2024) -
MuseAgent-1: Interactive Grounded Multimodal Understanding of Music Scores and Performance Audio
by: Zhao, Qihao, et al.
Published: (2026)