:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	You, Zuyao, Yu, Zhesong, Liu, Mingyu, Zhu, Bilei, Wan, Yuan, Wu, Zuxuan
Format:	Preprint
Published:	2026
Subjects:	Sound Artificial Intelligence
Online Access:	https://arxiv.org/abs/2605.00371
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

MINT: Boosting Audio-Language Model via Multi-Target Pre-Training and Instruction Tuning
by: Zhao, Hang, et al.
Published: (2024)

VideoLoom: A Video Large Language Model for Joint Spatial-Temporal Understanding
by: Shi, Jiapeng, et al.
Published: (2026)

ByteComposer: a Human-like Melody Composition Method based on Language Model Agent
by: Liang, Xia, et al.
Published: (2024)

Seg-R1: Segmentation Can Be Surprisingly Simple with Reinforcement Learning
by: You, Zuyao, et al.
Published: (2025)

Voices of Civilizations: A Multilingual QA Benchmark for Global Music Understanding
by: Wu, Shangda, et al.
Published: (2026)

Video Echoed in Music: Semantic, Temporal, and Rhythmic Alignment for Video-to-Music Generation
by: Tong, Xinyi, et al.
Published: (2025)

Musical Score Understanding Benchmark: Evaluating Large Language Models' Comprehension of Complete Musical Scores
by: Dai, Congren, et al.
Published: (2025)

PianoBind: A Multimodal Joint Embedding Model for Pop-piano Music
by: Bang, Hayeon, et al.
Published: (2025)

Towards Effective Negation Modeling in Joint Audio-Text Models for Music
by: Vasilakis, Yannis, et al.
Published: (2026)

MusicLIME: Explainable Multimodal Music Understanding
by: Sotirou, Theodoros, et al.
Published: (2024)

Futga: Towards Fine-grained Music Understanding through Temporally-enhanced Generative Augmentation
by: Wu, Junda, et al.
Published: (2024)

Joint Audio and Symbolic Conditioning for Temporally Controlled Text-to-Music Generation
by: Tal, Or, et al.
Published: (2024)

M$^{2}$UGen: Multi-modal Music Understanding and Generation with the Power of Large Language Models
by: Liu, Shansong, et al.
Published: (2023)

The World is Not Mono: Enabling Spatial Understanding in Large Audio-Language Models
by: You, Yuhuan, et al.
Published: (2026)

MIDI-LLaMA: An Instruction-Following Multimodal LLM for Symbolic Music Understanding
by: Yang, Meng, et al.
Published: (2026)

MuMu-LLaMA: Multi-modal Music Understanding and Generation via Large Language Models
by: Liu, Shansong, et al.
Published: (2024)

MuChoMusic: Evaluating Music Understanding in Multimodal Audio-Language Models
by: Weck, Benno, et al.
Published: (2024)

Synthesize, Diagnose, and Optimize: Towards Fine-Grained Vision-Language Understanding
by: Peng, Wujian, et al.
Published: (2023)

NOTA: Multimodal Music Notation Understanding for Visual Large Language Model
by: Tang, Mingni, et al.
Published: (2025)

CLaMP 2: Multimodal Music Information Retrieval Across 101 Languages Using Large Language Models
by: Wu, Shangda, et al.
Published: (2024)

TALKPLAY: Multimodal Music Recommendation with Large Language Models
by: Doh, Seungheon, et al.
Published: (2025)

Towards Practical Real-Time Low-Latency Music Source Separation
by: Wu, Junyu, et al.
Published: (2025)

WeaveMuse: An Open Agentic System for Multimodal Music Understanding and Generation
by: Karystinaios, Emmanouil
Published: (2025)

ABC-Eval: Benchmarking Large Language Models on Symbolic Music Understanding and Instruction Following
by: Zhao, Jiahao, et al.
Published: (2025)

FOCUS: Towards Universal Foreground Segmentation
by: You, Zuyao, et al.
Published: (2025)

Can LLMs "Reason" in Music? An Evaluation of LLMs' Capability of Music Understanding and Generation
by: Zhou, Ziya, et al.
Published: (2024)

TinyMU: A Compact Audio-Language Model for Music Understanding
by: Li, Xiquan, et al.
Published: (2026)

SLAM-LLM: A Modular, Open-Source Multimodal Large Language Model Framework and Best Practice for Speech, Language, Audio and Music Processing
by: Ma, Ziyang, et al.
Published: (2026)

Learning Accurate Segmentation Purely from Self-Supervision
by: You, Zuyao, et al.
Published: (2026)

DeepResonance: Enhancing Multimodal Music Understanding via Music-centric Multi-way Instruction Tuning
by: Mao, Zhuoyuan, et al.
Published: (2025)

Versatile Symbolic Music-for-Music Modeling via Function Alignment
by: Jiang, Junyan, et al.
Published: (2025)

Evaluating Multimodal Large Language Models on Core Music Perception Tasks
by: Carone, Brandon James, et al.
Published: (2025)

Improving BERT for Symbolic Music Understanding Using Token Denoising and Pianoroll Prediction
by: Wang, Jun-You, et al.
Published: (2025)

A Survey of Foundation Models for Music Understanding
by: Li, Wenjun, et al.
Published: (2024)

Khala: Scaling Acoustic Token Language Models Toward High-Fidelity Music Generation
by: Liu, Jiafeng, et al.
Published: (2026)

XMusic: Towards a Generalized and Controllable Symbolic Music Generation Framework
by: Tian, Sida, et al.
Published: (2025)

Advancing the Foundation Model for Music Understanding
by: Jiang, Yi, et al.
Published: (2025)

NotaGen: Advancing Musicality in Symbolic Music Generation with Large Language Model Training Paradigms
by: Wang, Yashan, et al.
Published: (2025)

Speech-XL: Towards Long-Form Speech Understanding in Large Speech Language Models
by: Sun, Haoqin, et al.
Published: (2026)

MAJL: A Model-Agnostic Joint Learning Framework for Music Source Separation and Pitch Estimation
by: Wei, Haojie, et al.
Published: (2025)