Saved in:
Bibliographic Details
Main Authors: Liu, Kai, Li, Jungang, Sun, Yuchong, Wu, Shengqiong, Gao, Jianzhang, Zhang, Daoan, Zhang, Wei, Jin, Sheng, Yu, Sicheng, Zhan, Geng, Ji, Jiayi, Zhou, Fan, Zheng, Liang, Yan, Shuicheng, Fei, Hao, Chua, Tat-Seng
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2512.22905
Tags: Add Tag
No Tags, Be the first to tag this record!
Table of Contents:
  • This paper presents JavisGPT, the first unified multimodal large language model (MLLM) for joint audio-video (JAV) comprehension and generation. JavisGPT has a concise encoder-LLM-decoder architecture, which has a SyncFusion module for spatio-temporal audio-video fusion and synchrony-aware learnable queries to bridge a pretrained JAV-DiT generator. This design enables temporally coherent video-audio understanding and generation from multimodal instructions. We design an effective three-stage training pipeline consisting of multimodal pretraining, audio-video fine-tuning, and large-scale instruction-tuning, to progressively build multimodal comprehension and generation from existing vision-language models. For instruction tuning, we construct JavisInst-Omni, a high-quality instruction dataset with over 200K GPT-4o-curated audio-video-text dialogues that cover diverse and multi-level comprehension and generation scenarios. On JAV comprehension and generation benchmarks, our experiments show that JavisGPT outperforms existing MLLMs, particularly in complex and temporally synchronized settings.