:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Zhang, Huan, Maezawa, Akira, Dixon, Simon
Format:	Preprint
Published:	2025
Subjects:	Audio and Speech Processing Multimedia
Online Access:	https://arxiv.org/abs/2502.07711
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

LLaQo: Towards a Query-Based Coach in Expressive Music Performance Assessment
by: Zhang, Huan, et al.
Published: (2024)

Disentangling Score Content and Performance Style for Joint Piano Rendering and Transcription
by: Zeng, Wei, et al.
Published: (2025)

A Study on Synthesizing Expressive Violin Performances: Approaches and Comparisons
by: Hung, Tzu-Yun, et al.
Published: (2024)

Enhancing Expressiveness in Dance Generation via Integrating Frequency and Music Style Information
by: Huang, Qiaochu, et al.
Published: (2024)

MusicMagus: Zero-Shot Text-to-Music Editing via Diffusion Models
by: Zhang, Yixiao, et al.
Published: (2024)

Exploring Classical Piano Performance Generation with Expressive Music Variational AutoEncoder
by: Luo, Jing, et al.
Published: (2025)

FastTalker: Jointly Generating Speech and Conversational Gestures from Text
by: Guo, Zixin, et al.
Published: (2024)

Gesture2Speech: How Far Can Hand Movements Shape Expressive Speech?
by: Kumar, Lokesh, et al.
Published: (2026)

HybridVC: Efficient Voice Style Conversion with Text and Audio Prompts
by: Niu, Xinlei, et al.
Published: (2024)

pTSE-T: Presentation Target Speaker Extraction using Unaligned Text Cues
by: Jiang, Ziyang, et al.
Published: (2024)

Intelligent Text-Conditioned Music Generation
by: Xie, Zhouyao, et al.
Published: (2024)

MeloTrans: A Text to Symbolic Music Generation Model Following Human Composition Habit
by: Wang, Yutian, et al.
Published: (2024)

SteerMusic: Enhanced Musical Consistency for Zero-shot Text-guided and Personalized Music Editing
by: Niu, Xinlei, et al.
Published: (2025)

Towards Expressive Video Dubbing with Multiscale Multimodal Context Interaction
by: Zhao, Yuan, et al.
Published: (2024)

Rhythmic Foley: A Framework For Seamless Audio-Visual Alignment In Video-to-Audio Synthesis
by: Huang, Zhiqi, et al.
Published: (2024)

Voice Evaluation of Reasoning Ability: Diagnosing the Modality-Induced Performance Gap
by: Lin, Yueqian, et al.
Published: (2025)

Video-Guided Text-to-Music Generation Using Public Domain Movie Collections
by: Kim, Haven, et al.
Published: (2025)

MuseAgent-1: Interactive Grounded Multimodal Understanding of Music Scores and Performance Audio
by: Zhao, Qihao, et al.
Published: (2026)

PerformSinger: Multimodal Singing Voice Synthesis Leveraging Synchronized Lip Cues from Singing Performance Videos
by: Gu, Ke, et al.
Published: (2025)

SyncFlow: Toward Temporally Aligned Joint Audio-Video Generation from Text
by: Liu, Haohe, et al.
Published: (2024)

ecVoice: Audio Text Extraction and Optimization of Video Based on Idioms Similarity Replacement
by: Lin, Jinwei
Published: (2024)

CtrSVDD: A Benchmark Dataset and Baseline Analysis for Controlled Singing Voice Deepfake Detection
by: Zang, Yongyi, et al.
Published: (2024)

Transformer-Based Rhythm Quantization of Performance MIDI Using Beat Annotations
by: Wachter, Maximilian, et al.
Published: (2026)

Efficient Adapter Tuning for Joint Singing Voice Beat and Downbeat Tracking with Self-supervised Learning Features
by: Deng, Jiajun, et al.
Published: (2025)

SAVGBench: Benchmarking Spatially Aligned Audio-Video Generation
by: Shimada, Kazuki, et al.
Published: (2024)

Flexible Control in Symbolic Music Generation via Musical Metadata
by: Han, Sangjun, et al.
Published: (2024)

Semi-Supervised Contrastive Learning for Controllable Video-to-Music Retrieval
by: Stewart, Shanti, et al.
Published: (2024)

EmoDubber: Towards High Quality and Emotion Controllable Movie Dubbing
by: Cong, Gaoxiang, et al.
Published: (2024)

pyAMPACT: A Score-Audio Alignment Toolkit for Performance Data Estimation and Multi-modal Processing
by: Devaney, Johanna, et al.
Published: (2024)

Instruct-MusicGen: Unlocking Text-to-Music Editing for Music Language Models via Instruction Tuning
by: Zhang, Yixiao, et al.
Published: (2024)

Emotion-Aware Speech Generation with Character-Specific Voices for Comics
by: Qian, Zhiwen, et al.
Published: (2025)

SongBloom: Coherent Song Generation via Interleaved Autoregressive Sketching and Diffusion Refinement
by: Yang, Chenyu, et al.
Published: (2025)

Iola Walker: A Mobile Footfall Detection System for Music Composition
by: James, William B.
Published: (2025)

M3SD: Multi-modal, Multi-scenario and Multi-language Speaker Diarization Dataset
by: Wu, Shilong
Published: (2025)

Target Speech Diarization with Multimodal Prompts
by: Jiang, Yidi, et al.
Published: (2024)

MOS-FAD: Improving Fake Audio Detection Via Automatic Mean Opinion Score Prediction
by: Zhou, Wangjin, et al.
Published: (2024)

Dialogue Understandability: Why are we streaming movies with subtitles?
by: Martinez, Helard Becerra, et al.
Published: (2024)

A Toolkit for Joint Speaker Diarization and Identification with Application to Speaker-Attributed ASR
by: Morrone, Giovanni, et al.
Published: (2024)

StyleSpeech: Parameter-efficient Fine Tuning for Pre-trained Controllable Text-to-Speech
by: Lou, Haowei, et al.
Published: (2024)

Seeing What You Say: Expressive Image Generation from Speech
by: Lee, Jiyoung, et al.
Published: (2025)