Saved in:
| Main Authors: | You, Zuyao, Yu, Zhesong, Liu, Mingyu, Zhu, Bilei, Wan, Yuan, Wu, Zuxuan |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2605.00371 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
MINT: Boosting Audio-Language Model via Multi-Target Pre-Training and Instruction Tuning
by: Zhao, Hang, et al.
Published: (2024)
by: Zhao, Hang, et al.
Published: (2024)
VideoLoom: A Video Large Language Model for Joint Spatial-Temporal Understanding
by: Shi, Jiapeng, et al.
Published: (2026)
by: Shi, Jiapeng, et al.
Published: (2026)
ByteComposer: a Human-like Melody Composition Method based on Language Model Agent
by: Liang, Xia, et al.
Published: (2024)
by: Liang, Xia, et al.
Published: (2024)
Seg-R1: Segmentation Can Be Surprisingly Simple with Reinforcement Learning
by: You, Zuyao, et al.
Published: (2025)
by: You, Zuyao, et al.
Published: (2025)
Voices of Civilizations: A Multilingual QA Benchmark for Global Music Understanding
by: Wu, Shangda, et al.
Published: (2026)
by: Wu, Shangda, et al.
Published: (2026)
Video Echoed in Music: Semantic, Temporal, and Rhythmic Alignment for Video-to-Music Generation
by: Tong, Xinyi, et al.
Published: (2025)
by: Tong, Xinyi, et al.
Published: (2025)
Musical Score Understanding Benchmark: Evaluating Large Language Models' Comprehension of Complete Musical Scores
by: Dai, Congren, et al.
Published: (2025)
by: Dai, Congren, et al.
Published: (2025)
PianoBind: A Multimodal Joint Embedding Model for Pop-piano Music
by: Bang, Hayeon, et al.
Published: (2025)
by: Bang, Hayeon, et al.
Published: (2025)
Towards Effective Negation Modeling in Joint Audio-Text Models for Music
by: Vasilakis, Yannis, et al.
Published: (2026)
by: Vasilakis, Yannis, et al.
Published: (2026)
MusicLIME: Explainable Multimodal Music Understanding
by: Sotirou, Theodoros, et al.
Published: (2024)
by: Sotirou, Theodoros, et al.
Published: (2024)
Futga: Towards Fine-grained Music Understanding through Temporally-enhanced Generative Augmentation
by: Wu, Junda, et al.
Published: (2024)
by: Wu, Junda, et al.
Published: (2024)
Joint Audio and Symbolic Conditioning for Temporally Controlled Text-to-Music Generation
by: Tal, Or, et al.
Published: (2024)
by: Tal, Or, et al.
Published: (2024)
M$^{2}$UGen: Multi-modal Music Understanding and Generation with the Power of Large Language Models
by: Liu, Shansong, et al.
Published: (2023)
by: Liu, Shansong, et al.
Published: (2023)
The World is Not Mono: Enabling Spatial Understanding in Large Audio-Language Models
by: You, Yuhuan, et al.
Published: (2026)
by: You, Yuhuan, et al.
Published: (2026)
MIDI-LLaMA: An Instruction-Following Multimodal LLM for Symbolic Music Understanding
by: Yang, Meng, et al.
Published: (2026)
by: Yang, Meng, et al.
Published: (2026)
MuMu-LLaMA: Multi-modal Music Understanding and Generation via Large Language Models
by: Liu, Shansong, et al.
Published: (2024)
by: Liu, Shansong, et al.
Published: (2024)
MuChoMusic: Evaluating Music Understanding in Multimodal Audio-Language Models
by: Weck, Benno, et al.
Published: (2024)
by: Weck, Benno, et al.
Published: (2024)
Synthesize, Diagnose, and Optimize: Towards Fine-Grained Vision-Language Understanding
by: Peng, Wujian, et al.
Published: (2023)
by: Peng, Wujian, et al.
Published: (2023)
NOTA: Multimodal Music Notation Understanding for Visual Large Language Model
by: Tang, Mingni, et al.
Published: (2025)
by: Tang, Mingni, et al.
Published: (2025)
CLaMP 2: Multimodal Music Information Retrieval Across 101 Languages Using Large Language Models
by: Wu, Shangda, et al.
Published: (2024)
by: Wu, Shangda, et al.
Published: (2024)
TALKPLAY: Multimodal Music Recommendation with Large Language Models
by: Doh, Seungheon, et al.
Published: (2025)
by: Doh, Seungheon, et al.
Published: (2025)
Towards Practical Real-Time Low-Latency Music Source Separation
by: Wu, Junyu, et al.
Published: (2025)
by: Wu, Junyu, et al.
Published: (2025)
WeaveMuse: An Open Agentic System for Multimodal Music Understanding and Generation
by: Karystinaios, Emmanouil
Published: (2025)
by: Karystinaios, Emmanouil
Published: (2025)
ABC-Eval: Benchmarking Large Language Models on Symbolic Music Understanding and Instruction Following
by: Zhao, Jiahao, et al.
Published: (2025)
by: Zhao, Jiahao, et al.
Published: (2025)
FOCUS: Towards Universal Foreground Segmentation
by: You, Zuyao, et al.
Published: (2025)
by: You, Zuyao, et al.
Published: (2025)
Can LLMs "Reason" in Music? An Evaluation of LLMs' Capability of Music Understanding and Generation
by: Zhou, Ziya, et al.
Published: (2024)
by: Zhou, Ziya, et al.
Published: (2024)
TinyMU: A Compact Audio-Language Model for Music Understanding
by: Li, Xiquan, et al.
Published: (2026)
by: Li, Xiquan, et al.
Published: (2026)
SLAM-LLM: A Modular, Open-Source Multimodal Large Language Model Framework and Best Practice for Speech, Language, Audio and Music Processing
by: Ma, Ziyang, et al.
Published: (2026)
by: Ma, Ziyang, et al.
Published: (2026)
Learning Accurate Segmentation Purely from Self-Supervision
by: You, Zuyao, et al.
Published: (2026)
by: You, Zuyao, et al.
Published: (2026)
DeepResonance: Enhancing Multimodal Music Understanding via Music-centric Multi-way Instruction Tuning
by: Mao, Zhuoyuan, et al.
Published: (2025)
by: Mao, Zhuoyuan, et al.
Published: (2025)
Versatile Symbolic Music-for-Music Modeling via Function Alignment
by: Jiang, Junyan, et al.
Published: (2025)
by: Jiang, Junyan, et al.
Published: (2025)
Evaluating Multimodal Large Language Models on Core Music Perception Tasks
by: Carone, Brandon James, et al.
Published: (2025)
by: Carone, Brandon James, et al.
Published: (2025)
Improving BERT for Symbolic Music Understanding Using Token Denoising and Pianoroll Prediction
by: Wang, Jun-You, et al.
Published: (2025)
by: Wang, Jun-You, et al.
Published: (2025)
A Survey of Foundation Models for Music Understanding
by: Li, Wenjun, et al.
Published: (2024)
by: Li, Wenjun, et al.
Published: (2024)
Khala: Scaling Acoustic Token Language Models Toward High-Fidelity Music Generation
by: Liu, Jiafeng, et al.
Published: (2026)
by: Liu, Jiafeng, et al.
Published: (2026)
XMusic: Towards a Generalized and Controllable Symbolic Music Generation Framework
by: Tian, Sida, et al.
Published: (2025)
by: Tian, Sida, et al.
Published: (2025)
Advancing the Foundation Model for Music Understanding
by: Jiang, Yi, et al.
Published: (2025)
by: Jiang, Yi, et al.
Published: (2025)
NotaGen: Advancing Musicality in Symbolic Music Generation with Large Language Model Training Paradigms
by: Wang, Yashan, et al.
Published: (2025)
by: Wang, Yashan, et al.
Published: (2025)
Speech-XL: Towards Long-Form Speech Understanding in Large Speech Language Models
by: Sun, Haoqin, et al.
Published: (2026)
by: Sun, Haoqin, et al.
Published: (2026)
MAJL: A Model-Agnostic Joint Learning Framework for Music Source Separation and Pitch Estimation
by: Wei, Haojie, et al.
Published: (2025)
by: Wei, Haojie, et al.
Published: (2025)
Similar Items
-
MINT: Boosting Audio-Language Model via Multi-Target Pre-Training and Instruction Tuning
by: Zhao, Hang, et al.
Published: (2024) -
VideoLoom: A Video Large Language Model for Joint Spatial-Temporal Understanding
by: Shi, Jiapeng, et al.
Published: (2026) -
ByteComposer: a Human-like Melody Composition Method based on Language Model Agent
by: Liang, Xia, et al.
Published: (2024) -
Seg-R1: Segmentation Can Be Surprisingly Simple with Reinforcement Learning
by: You, Zuyao, et al.
Published: (2025) -
Voices of Civilizations: A Multilingual QA Benchmark for Global Music Understanding
by: Wu, Shangda, et al.
Published: (2026)