Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Vitasovic, Leo, Graßhof, Stella, Kloft, Agnes Mercedes, Lehtola, Ville V., Cunneen, Martin, Starostka, Justyna, McGarry, Glenn, Li, Kun, Brandt, Sami S.
Format:	Preprint
Published:	2025
Subjects:	Sound Artificial Intelligence Multimedia Audio and Speech Processing
Online Access:	https://arxiv.org/abs/2509.00029
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866909760729645056
author	Vitasovic, Leo Graßhof, Stella Kloft, Agnes Mercedes Lehtola, Ville V. Cunneen, Martin Starostka, Justyna McGarry, Glenn Li, Kun Brandt, Sami S.
author_facet	Vitasovic, Leo Graßhof, Stella Kloft, Agnes Mercedes Lehtola, Ville V. Cunneen, Martin Starostka, Justyna McGarry, Glenn Li, Kun Brandt, Sami S.
contents	Conventional music visualisation systems rely on handcrafted ad hoc transformations of shapes and colours that offer only limited expressiveness. We propose two novel pipelines for automatically generating music videos from any user-specified, vocal or instrumental song using off-the-shelf deep learning models. Inspired by the manual workflows of music video producers, we experiment on how well latent feature-based techniques can analyse audio to detect musical qualities, such as emotional cues and instrumental patterns, and distil them into textual scene descriptions using a language model. Next, we employ a generative model to produce the corresponding video clips. To assess the generated videos, we identify several critical aspects and design and conduct a preliminary user evaluation that demonstrates storytelling potential, visual coherency and emotional alignment with the music. Our findings underscore the potential of latent feature techniques and deep generative models to expand music visualisation beyond traditional approaches.
format	Preprint
id	arxiv_https___arxiv_org_abs_2509_00029
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	From Sound to Sight: Towards AI-authored Music Videos Vitasovic, Leo Graßhof, Stella Kloft, Agnes Mercedes Lehtola, Ville V. Cunneen, Martin Starostka, Justyna McGarry, Glenn Li, Kun Brandt, Sami S. Sound Artificial Intelligence Multimedia Audio and Speech Processing Conventional music visualisation systems rely on handcrafted ad hoc transformations of shapes and colours that offer only limited expressiveness. We propose two novel pipelines for automatically generating music videos from any user-specified, vocal or instrumental song using off-the-shelf deep learning models. Inspired by the manual workflows of music video producers, we experiment on how well latent feature-based techniques can analyse audio to detect musical qualities, such as emotional cues and instrumental patterns, and distil them into textual scene descriptions using a language model. Next, we employ a generative model to produce the corresponding video clips. To assess the generated videos, we identify several critical aspects and design and conduct a preliminary user evaluation that demonstrates storytelling potential, visual coherency and emotional alignment with the music. Our findings underscore the potential of latent feature techniques and deep generative models to expand music visualisation beyond traditional approaches.
title	From Sound to Sight: Towards AI-authored Music Videos
topic	Sound Artificial Intelligence Multimedia Audio and Speech Processing
url	https://arxiv.org/abs/2509.00029

Similar Items