Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Yang, Meng, McCormack, Jon, Llano, Maria Teresa, Su, Wanchao, Lei, Chao
Format:	Preprint
Published:	2026
Subjects:	Multimedia Sound
Online Access:	https://arxiv.org/abs/2601.21740
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866911406863941632
author	Yang, Meng McCormack, Jon Llano, Maria Teresa Su, Wanchao Lei, Chao
author_facet	Yang, Meng McCormack, Jon Llano, Maria Teresa Su, Wanchao Lei, Chao
contents	Recent advances in multimodal large language models (MLLM) for audio music have demonstrated strong capabilities in music understanding, yet symbolic music, a fundamental representation of musical structure, remains unexplored. In this work, we introduce MIDI-LLaMA, the first instruction-following MLLM for symbolic music understanding. Our approach aligns the MIDI encoder MusicBERT and Llama-3-8B via a two-stage pipeline comprising feature alignment and instruction tuning. To support training, we design a scalable annotation pipeline that annotates GiantMIDI-Piano with fine-grained metadata, resulting in a MIDI-text dataset. Compared with the baseline trained on converting MIDI into ABC notation under the same instruction-tuning procedure, MIDI-LLaMA substantially outperforms in captioning and semantic alignment in question answering. Human evaluation further confirms the advantages of MIDI-LLaMA in music understanding, emotion recognition, creativity, and overall preference. These findings demonstrate that incorporating symbolic music into large language models enhances their capacity for musical understanding.
format	Preprint
id	arxiv_https___arxiv_org_abs_2601_21740
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	MIDI-LLaMA: An Instruction-Following Multimodal LLM for Symbolic Music Understanding Yang, Meng McCormack, Jon Llano, Maria Teresa Su, Wanchao Lei, Chao Multimedia Sound Recent advances in multimodal large language models (MLLM) for audio music have demonstrated strong capabilities in music understanding, yet symbolic music, a fundamental representation of musical structure, remains unexplored. In this work, we introduce MIDI-LLaMA, the first instruction-following MLLM for symbolic music understanding. Our approach aligns the MIDI encoder MusicBERT and Llama-3-8B via a two-stage pipeline comprising feature alignment and instruction tuning. To support training, we design a scalable annotation pipeline that annotates GiantMIDI-Piano with fine-grained metadata, resulting in a MIDI-text dataset. Compared with the baseline trained on converting MIDI into ABC notation under the same instruction-tuning procedure, MIDI-LLaMA substantially outperforms in captioning and semantic alignment in question answering. Human evaluation further confirms the advantages of MIDI-LLaMA in music understanding, emotion recognition, creativity, and overall preference. These findings demonstrate that incorporating symbolic music into large language models enhances their capacity for musical understanding.
title	MIDI-LLaMA: An Instruction-Following Multimodal LLM for Symbolic Music Understanding
topic	Multimedia Sound
url	https://arxiv.org/abs/2601.21740

Similar Items