Saved in:
| Main Authors: | Zhang, Huan, Maezawa, Akira, Dixon, Simon |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2502.07711 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
LLaQo: Towards a Query-Based Coach in Expressive Music Performance Assessment
by: Zhang, Huan, et al.
Published: (2024)
by: Zhang, Huan, et al.
Published: (2024)
Disentangling Score Content and Performance Style for Joint Piano Rendering and Transcription
by: Zeng, Wei, et al.
Published: (2025)
by: Zeng, Wei, et al.
Published: (2025)
A Study on Synthesizing Expressive Violin Performances: Approaches and Comparisons
by: Hung, Tzu-Yun, et al.
Published: (2024)
by: Hung, Tzu-Yun, et al.
Published: (2024)
Enhancing Expressiveness in Dance Generation via Integrating Frequency and Music Style Information
by: Huang, Qiaochu, et al.
Published: (2024)
by: Huang, Qiaochu, et al.
Published: (2024)
MusicMagus: Zero-Shot Text-to-Music Editing via Diffusion Models
by: Zhang, Yixiao, et al.
Published: (2024)
by: Zhang, Yixiao, et al.
Published: (2024)
Exploring Classical Piano Performance Generation with Expressive Music Variational AutoEncoder
by: Luo, Jing, et al.
Published: (2025)
by: Luo, Jing, et al.
Published: (2025)
FastTalker: Jointly Generating Speech and Conversational Gestures from Text
by: Guo, Zixin, et al.
Published: (2024)
by: Guo, Zixin, et al.
Published: (2024)
Gesture2Speech: How Far Can Hand Movements Shape Expressive Speech?
by: Kumar, Lokesh, et al.
Published: (2026)
by: Kumar, Lokesh, et al.
Published: (2026)
HybridVC: Efficient Voice Style Conversion with Text and Audio Prompts
by: Niu, Xinlei, et al.
Published: (2024)
by: Niu, Xinlei, et al.
Published: (2024)
pTSE-T: Presentation Target Speaker Extraction using Unaligned Text Cues
by: Jiang, Ziyang, et al.
Published: (2024)
by: Jiang, Ziyang, et al.
Published: (2024)
Intelligent Text-Conditioned Music Generation
by: Xie, Zhouyao, et al.
Published: (2024)
by: Xie, Zhouyao, et al.
Published: (2024)
MeloTrans: A Text to Symbolic Music Generation Model Following Human Composition Habit
by: Wang, Yutian, et al.
Published: (2024)
by: Wang, Yutian, et al.
Published: (2024)
SteerMusic: Enhanced Musical Consistency for Zero-shot Text-guided and Personalized Music Editing
by: Niu, Xinlei, et al.
Published: (2025)
by: Niu, Xinlei, et al.
Published: (2025)
Towards Expressive Video Dubbing with Multiscale Multimodal Context Interaction
by: Zhao, Yuan, et al.
Published: (2024)
by: Zhao, Yuan, et al.
Published: (2024)
Rhythmic Foley: A Framework For Seamless Audio-Visual Alignment In Video-to-Audio Synthesis
by: Huang, Zhiqi, et al.
Published: (2024)
by: Huang, Zhiqi, et al.
Published: (2024)
Voice Evaluation of Reasoning Ability: Diagnosing the Modality-Induced Performance Gap
by: Lin, Yueqian, et al.
Published: (2025)
by: Lin, Yueqian, et al.
Published: (2025)
Video-Guided Text-to-Music Generation Using Public Domain Movie Collections
by: Kim, Haven, et al.
Published: (2025)
by: Kim, Haven, et al.
Published: (2025)
MuseAgent-1: Interactive Grounded Multimodal Understanding of Music Scores and Performance Audio
by: Zhao, Qihao, et al.
Published: (2026)
by: Zhao, Qihao, et al.
Published: (2026)
PerformSinger: Multimodal Singing Voice Synthesis Leveraging Synchronized Lip Cues from Singing Performance Videos
by: Gu, Ke, et al.
Published: (2025)
by: Gu, Ke, et al.
Published: (2025)
SyncFlow: Toward Temporally Aligned Joint Audio-Video Generation from Text
by: Liu, Haohe, et al.
Published: (2024)
by: Liu, Haohe, et al.
Published: (2024)
ecVoice: Audio Text Extraction and Optimization of Video Based on Idioms Similarity Replacement
by: Lin, Jinwei
Published: (2024)
by: Lin, Jinwei
Published: (2024)
CtrSVDD: A Benchmark Dataset and Baseline Analysis for Controlled Singing Voice Deepfake Detection
by: Zang, Yongyi, et al.
Published: (2024)
by: Zang, Yongyi, et al.
Published: (2024)
Transformer-Based Rhythm Quantization of Performance MIDI Using Beat Annotations
by: Wachter, Maximilian, et al.
Published: (2026)
by: Wachter, Maximilian, et al.
Published: (2026)
Efficient Adapter Tuning for Joint Singing Voice Beat and Downbeat Tracking with Self-supervised Learning Features
by: Deng, Jiajun, et al.
Published: (2025)
by: Deng, Jiajun, et al.
Published: (2025)
SAVGBench: Benchmarking Spatially Aligned Audio-Video Generation
by: Shimada, Kazuki, et al.
Published: (2024)
by: Shimada, Kazuki, et al.
Published: (2024)
Flexible Control in Symbolic Music Generation via Musical Metadata
by: Han, Sangjun, et al.
Published: (2024)
by: Han, Sangjun, et al.
Published: (2024)
Semi-Supervised Contrastive Learning for Controllable Video-to-Music Retrieval
by: Stewart, Shanti, et al.
Published: (2024)
by: Stewart, Shanti, et al.
Published: (2024)
EmoDubber: Towards High Quality and Emotion Controllable Movie Dubbing
by: Cong, Gaoxiang, et al.
Published: (2024)
by: Cong, Gaoxiang, et al.
Published: (2024)
pyAMPACT: A Score-Audio Alignment Toolkit for Performance Data Estimation and Multi-modal Processing
by: Devaney, Johanna, et al.
Published: (2024)
by: Devaney, Johanna, et al.
Published: (2024)
Instruct-MusicGen: Unlocking Text-to-Music Editing for Music Language Models via Instruction Tuning
by: Zhang, Yixiao, et al.
Published: (2024)
by: Zhang, Yixiao, et al.
Published: (2024)
Emotion-Aware Speech Generation with Character-Specific Voices for Comics
by: Qian, Zhiwen, et al.
Published: (2025)
by: Qian, Zhiwen, et al.
Published: (2025)
SongBloom: Coherent Song Generation via Interleaved Autoregressive Sketching and Diffusion Refinement
by: Yang, Chenyu, et al.
Published: (2025)
by: Yang, Chenyu, et al.
Published: (2025)
Iola Walker: A Mobile Footfall Detection System for Music Composition
by: James, William B.
Published: (2025)
by: James, William B.
Published: (2025)
M3SD: Multi-modal, Multi-scenario and Multi-language Speaker Diarization Dataset
by: Wu, Shilong
Published: (2025)
by: Wu, Shilong
Published: (2025)
Target Speech Diarization with Multimodal Prompts
by: Jiang, Yidi, et al.
Published: (2024)
by: Jiang, Yidi, et al.
Published: (2024)
MOS-FAD: Improving Fake Audio Detection Via Automatic Mean Opinion Score Prediction
by: Zhou, Wangjin, et al.
Published: (2024)
by: Zhou, Wangjin, et al.
Published: (2024)
Dialogue Understandability: Why are we streaming movies with subtitles?
by: Martinez, Helard Becerra, et al.
Published: (2024)
by: Martinez, Helard Becerra, et al.
Published: (2024)
A Toolkit for Joint Speaker Diarization and Identification with Application to Speaker-Attributed ASR
by: Morrone, Giovanni, et al.
Published: (2024)
by: Morrone, Giovanni, et al.
Published: (2024)
StyleSpeech: Parameter-efficient Fine Tuning for Pre-trained Controllable Text-to-Speech
by: Lou, Haowei, et al.
Published: (2024)
by: Lou, Haowei, et al.
Published: (2024)
Seeing What You Say: Expressive Image Generation from Speech
by: Lee, Jiyoung, et al.
Published: (2025)
by: Lee, Jiyoung, et al.
Published: (2025)
Similar Items
-
LLaQo: Towards a Query-Based Coach in Expressive Music Performance Assessment
by: Zhang, Huan, et al.
Published: (2024) -
Disentangling Score Content and Performance Style for Joint Piano Rendering and Transcription
by: Zeng, Wei, et al.
Published: (2025) -
A Study on Synthesizing Expressive Violin Performances: Approaches and Comparisons
by: Hung, Tzu-Yun, et al.
Published: (2024) -
Enhancing Expressiveness in Dance Generation via Integrating Frequency and Music Style Information
by: Huang, Qiaochu, et al.
Published: (2024) -
MusicMagus: Zero-Shot Text-to-Music Editing via Diffusion Models
by: Zhang, Yixiao, et al.
Published: (2024)