Saved in:
Bibliographic Details
Main Authors: Dai, Congren, Yang, Yue, Li, Krinos, Zhou, Huichi, Liang, Shijie, Zhang, Bo, Liu, Enyang, Jin, Ge, An, Hongran, Zhang, Haosen, Jing, Peiyuan, Lee, Kinhei, Zhang, Z henxuan, Li, Xiaobing, Sun, Maosong
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2511.20697
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866913057093976064
author Dai, Congren
Yang, Yue
Li, Krinos
Zhou, Huichi
Liang, Shijie
Zhang, Bo
Liu, Enyang
Jin, Ge
An, Hongran
Zhang, Haosen
Jing, Peiyuan
Lee, Kinhei
Zhang, Z henxuan
Li, Xiaobing
Sun, Maosong
author_facet Dai, Congren
Yang, Yue
Li, Krinos
Zhou, Huichi
Liang, Shijie
Zhang, Bo
Liu, Enyang
Jin, Ge
An, Hongran
Zhang, Haosen
Jing, Peiyuan
Lee, Kinhei
Zhang, Z henxuan
Li, Xiaobing
Sun, Maosong
contents Understanding complete musical scores entails integrated reasoning over pitch, rhythm, harmony, and large-scale structure, yet the ability of Large Language Models and Vision--Language Models to interpret full musical notation remains insufficiently examined. We introduce Musical Score Understanding Benchmark (MSU-Bench), a human-curated benchmark for score-level musical understanding across textual (ABC notation) and visual (PDF) modalities. MSU-Bench contains 1,800 generative question-answer pairs from works by Bach, Beethoven, Chopin, Debussy, and others, organised into four levels of increasing difficulty, ranging from onset information to texture and form. Evaluations of more than fifteen state-of-the-art models, in both zero-shot and fine-tuned settings, reveal pronounced modality gaps, unstable level-wise performance, and challenges in maintaining multilevel correctness. Fine-tuning substantially improves results across modalities while preserving general knowledge, positioning MSU-Bench as a robust foundation for future research in multimodal reasoning. The benchmark and code are available at https://github.com/Congren-Dai/MSU-Bench.
format Preprint
id arxiv_https___arxiv_org_abs_2511_20697
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle Musical Score Understanding Benchmark: Evaluating Large Language Models' Comprehension of Complete Musical Scores
Dai, Congren
Yang, Yue
Li, Krinos
Zhou, Huichi
Liang, Shijie
Zhang, Bo
Liu, Enyang
Jin, Ge
An, Hongran
Zhang, Haosen
Jing, Peiyuan
Lee, Kinhei
Zhang, Z henxuan
Li, Xiaobing
Sun, Maosong
Sound
Artificial Intelligence
Understanding complete musical scores entails integrated reasoning over pitch, rhythm, harmony, and large-scale structure, yet the ability of Large Language Models and Vision--Language Models to interpret full musical notation remains insufficiently examined. We introduce Musical Score Understanding Benchmark (MSU-Bench), a human-curated benchmark for score-level musical understanding across textual (ABC notation) and visual (PDF) modalities. MSU-Bench contains 1,800 generative question-answer pairs from works by Bach, Beethoven, Chopin, Debussy, and others, organised into four levels of increasing difficulty, ranging from onset information to texture and form. Evaluations of more than fifteen state-of-the-art models, in both zero-shot and fine-tuned settings, reveal pronounced modality gaps, unstable level-wise performance, and challenges in maintaining multilevel correctness. Fine-tuning substantially improves results across modalities while preserving general knowledge, positioning MSU-Bench as a robust foundation for future research in multimodal reasoning. The benchmark and code are available at https://github.com/Congren-Dai/MSU-Bench.
title Musical Score Understanding Benchmark: Evaluating Large Language Models' Comprehension of Complete Musical Scores
topic Sound
Artificial Intelligence
url https://arxiv.org/abs/2511.20697