Table of Contents: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Li, Jiajun, Xu, Tianze, Chen, Xuesong, Yao, Xinrui, Liu, Shuchang
Format:	Preprint
Published:	2024
Subjects:	Sound Artificial Intelligence Audio and Speech Processing
Online Access:	https://arxiv.org/abs/2405.02801
Tags:	Add Tag No Tags, Be the first to tag this record!

Table of Contents:

In recent years, AI-Generated Content (AIGC) has witnessed rapid advancements, facilitating the creation of music, images, and other artistic forms across a wide range of industries. However, current models for image- and video-to-music synthesis struggle to capture the nuanced emotions and atmosphere conveyed by visual content. To fill this gap, we propose Mozart's Touch, a multi-modal music generation framework capable of generating music aligned with cross-modal inputs such as images, videos, and text. The framework consists of three key components: Multi-modal Captioning Module, Large Language Model (LLM) understanding \& Bridging Module, and Music Generation Module. Unlike traditional end-to-end methods, Mozart's Touch uses LLMs to accurately interpret visual elements without requiring the training or fine-tuning of music generation models, providing efficiency and transparency through clear, interpretable prompts. We also introduce the "LLM-Bridge" method to resolve the heterogeneous representation challenges between descriptive texts from different modalities. Through a series of objective and subjective evaluations, we demonstrate that Mozart's Touch outperforms current state-of-the-art models. Our code and examples are available at https://github.com/TiffanyBlews/MozartsTouch.

Similar Items