:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Martinez, Helard Becerra, Ragano, Alessandro, Debnath, Diptasree, Ullah, Asad, Lucas, Crisron Rudolf, Walsh, Martin, Hines, Andrew
Format:	Preprint
Published:	2024
Subjects:	Audio and Speech Processing Multimedia
Online Access:	https://arxiv.org/abs/2403.15336
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Beyond Correlation: Evaluating Multimedia Quality Models with the Constrained Concordance Index
by: Ragano, Alessandro, et al.
Published: (2024)

Reduce, Reuse, Recycle: Is Perturbed Data better than Other Language augmentation for Low Resource Self-Supervised Speech Models
by: Ullah, Asad, et al.
Published: (2023)

NOMAD: Unsupervised Learning of Perceptual Embeddings for Speech Enhancement and Non-matching Reference Audio Quality Assessment
by: Ragano, Alessandro, et al.
Published: (2023)

SCOREQ: Speech Quality Assessment with Contrastive Regression
by: Ragano, Alessandro, et al.
Published: (2024)

MuseAgent-1: Interactive Grounded Multimodal Understanding of Music Scores and Performance Audio
by: Zhao, Qihao, et al.
Published: (2026)

M$^{2}$UGen: Multi-modal Music Understanding and Generation with the Power of Large Language Models
by: Liu, Shansong, et al.
Published: (2023)

MuMu-LLaMA: Multi-modal Music Understanding and Generation via Large Language Models
by: Liu, Shansong, et al.
Published: (2024)

BINAQUAL: A Full-Reference Objective Localization Similarity Metric for Binaural Audio
by: Panah, Davoud Shariat, et al.
Published: (2025)

Binamix -- A Python Library for Generating Binaural Audio Datasets
by: Barry, Dan, et al.
Published: (2025)

StereoFoley: Object-Aware Stereo Audio Generation from Video
by: Karchkhadze, Tornike, et al.
Published: (2025)

Can Large Language Models Predict Audio Effects Parameters from Natural Language?
by: Doh, Seungheon, et al.
Published: (2025)

SongBloom: Coherent Song Generation via Interleaved Autoregressive Sketching and Diffusion Refinement
by: Yang, Chenyu, et al.
Published: (2025)

Target Speech Diarization with Multimodal Prompts
by: Jiang, Yidi, et al.
Published: (2024)

Iola Walker: A Mobile Footfall Detection System for Music Composition
by: James, William B.
Published: (2025)

LLaQo: Towards a Query-Based Coach in Expressive Music Performance Assessment
by: Zhang, Huan, et al.
Published: (2024)

M3SD: Multi-modal, Multi-scenario and Multi-language Speaker Diarization Dataset
by: Wu, Shilong
Published: (2025)

MOS-FAD: Improving Fake Audio Detection Via Automatic Mean Opinion Score Prediction
by: Zhou, Wangjin, et al.
Published: (2024)

RenderBox: Expressive Performance Rendering with Text Control
by: Zhang, Huan, et al.
Published: (2025)

A Toolkit for Joint Speaker Diarization and Identification with Application to Speaker-Attributed ASR
by: Morrone, Giovanni, et al.
Published: (2024)

A Survey of Foundation Models for Music Understanding
by: Li, Wenjun, et al.
Published: (2024)

WavChat: A Survey of Spoken Dialogue Models
by: Ji, Shengpeng, et al.
Published: (2024)

Can LLMs "Reason" in Music? An Evaluation of LLMs' Capability of Music Understanding and Generation
by: Zhou, Ziya, et al.
Published: (2024)

Improving BERT for Symbolic Music Understanding Using Token Denoising and Pianoroll Prediction
by: Wang, Jun-You, et al.
Published: (2025)

WavReward: Spoken Dialogue Models With Generalist Reward Evaluators
by: Ji, Shengpeng, et al.
Published: (2025)

Audio-Visual Speaker Tracking: Progress, Challenges, and Future Directions
by: Zhao, Jinzheng, et al.
Published: (2023)

Video-Guided Text-to-Music Generation Using Public Domain Movie Collections
by: Kim, Haven, et al.
Published: (2025)

PerformSinger: Multimodal Singing Voice Synthesis Leveraging Synchronized Lip Cues from Singing Performance Videos
by: Gu, Ke, et al.
Published: (2025)

Efficient Speech Watermarking for Speech Synthesis via Progressive Knowledge Distillation
by: Cui, Yang, et al.
Published: (2025)

Listen, Look, Drive: Coupling Audio Instructions for User-aware VLA-based Autonomous Driving
by: Guo, Ziang, et al.
Published: (2026)

ARECHO: Autoregressive Evaluation via Chain-Based Hypothesis Optimization for Speech Multi-Metric Estimation
by: Shi, Jiatong, et al.
Published: (2025)

SteerMusic: Enhanced Musical Consistency for Zero-shot Text-guided and Personalized Music Editing
by: Niu, Xinlei, et al.
Published: (2025)

Building Audio-Visual Digital Twins with Smartphones
by: Lan, Zitong, et al.
Published: (2025)

MiniMind-O Technical Report: An Open Small-Scale Speech-Native Omni Model
by: Gong, Jingyao
Published: (2026)

Dance2MIDI: Dance-driven multi-instruments music generation
by: Han, Bo, et al.
Published: (2023)

LCB-net: Long-Context Biasing for Audio-Visual Speech Recognition
by: Yu, Fan, et al.
Published: (2024)

Listening Between the Lines: Synthetic Speech Detection Disregarding Verbal Content
by: Salvi, Davide, et al.
Published: (2024)

LRS-VoxMM: A benchmark for in-the-wild audio-visual speech recognition
by: Kwak, Doyeop, et al.
Published: (2026)

RVCBench: Benchmarking the Robustness of Voice Cloning Across Modern Audio Generation Models
by: Jin, Ruinan, et al.
Published: (2026)

M6: Multi-generator, Multi-domain, Multi-lingual and cultural, Multi-genres, Multi-instrument Machine-Generated Music Detection Databases
by: Li, Yupei, et al.
Published: (2024)

Multimodal Emotion Recognition from Raw Audio with Sinc-convolution
by: Zhang, Xiaohui, et al.
Published: (2024)