Saved in:
Bibliographic Details
Main Authors: Luo, Run, Lin, Ting-En, Zhang, Haonan, Wu, Yuchuan, Liu, Xiong, Yang, Min, Li, Yongbin, Chen, Longze, Li, Jiaming, Zhang, Lei, Xia, Xiaobo, Alinejad-Rokny, Hamid, Huang, Fei
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2501.04561
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866912600448565248
author Luo, Run
Lin, Ting-En
Zhang, Haonan
Wu, Yuchuan
Liu, Xiong
Yang, Min
Li, Yongbin
Chen, Longze
Li, Jiaming
Zhang, Lei
Xia, Xiaobo
Alinejad-Rokny, Hamid
Huang, Fei
author_facet Luo, Run
Lin, Ting-En
Zhang, Haonan
Wu, Yuchuan
Liu, Xiong
Yang, Min
Li, Yongbin
Chen, Longze
Li, Jiaming
Zhang, Lei
Xia, Xiaobo
Alinejad-Rokny, Hamid
Huang, Fei
contents Recent advancements in omnimodal learning have significantly improved understanding and generation across images, text, and speech, yet these developments remain predominantly confined to proprietary models. The lack of high-quality omnimodal datasets and the challenges of real-time emotional speech synthesis have notably hindered progress in open-source research. To address these limitations, we introduce \name, a two-stage training framework that integrates omnimodal alignment and speech generation to develop a state-of-the-art omnimodal large language model. In the alignment phase, a pre-trained speech model undergoes further training on text-image tasks, enabling (near) zero-shot generalization from vision to speech, outperforming models trained on tri-modal datasets. In the speech generation phase, a lightweight decoder is trained on speech tasks with direct preference optimization, enabling real-time emotional speech synthesis with high fidelity. Experiments show that \name surpasses state-of-the-art models across omnimodal, vision-language, and speech-language benchmarks. It achieves a 4-point absolute improvement on OmniBench over the leading open-source model VITA, despite using 5x fewer training samples and a smaller model size (7B vs. 7x8B). Additionally, \name achieves real-time speech generation with <1s latency at non-autoregressive mode, reducing inference time by 5x compared to autoregressive methods, and improves emotion classification accuracy by 7.7\%
format Preprint
id arxiv_https___arxiv_org_abs_2501_04561
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle OpenOmni: Advancing Open-Source Omnimodal Large Language Models with Progressive Multimodal Alignment and Real-Time Self-Aware Emotional Speech Synthesis
Luo, Run
Lin, Ting-En
Zhang, Haonan
Wu, Yuchuan
Liu, Xiong
Yang, Min
Li, Yongbin
Chen, Longze
Li, Jiaming
Zhang, Lei
Xia, Xiaobo
Alinejad-Rokny, Hamid
Huang, Fei
Computation and Language
Computer Vision and Pattern Recognition
Recent advancements in omnimodal learning have significantly improved understanding and generation across images, text, and speech, yet these developments remain predominantly confined to proprietary models. The lack of high-quality omnimodal datasets and the challenges of real-time emotional speech synthesis have notably hindered progress in open-source research. To address these limitations, we introduce \name, a two-stage training framework that integrates omnimodal alignment and speech generation to develop a state-of-the-art omnimodal large language model. In the alignment phase, a pre-trained speech model undergoes further training on text-image tasks, enabling (near) zero-shot generalization from vision to speech, outperforming models trained on tri-modal datasets. In the speech generation phase, a lightweight decoder is trained on speech tasks with direct preference optimization, enabling real-time emotional speech synthesis with high fidelity. Experiments show that \name surpasses state-of-the-art models across omnimodal, vision-language, and speech-language benchmarks. It achieves a 4-point absolute improvement on OmniBench over the leading open-source model VITA, despite using 5x fewer training samples and a smaller model size (7B vs. 7x8B). Additionally, \name achieves real-time speech generation with <1s latency at non-autoregressive mode, reducing inference time by 5x compared to autoregressive methods, and improves emotion classification accuracy by 7.7\%
title OpenOmni: Advancing Open-Source Omnimodal Large Language Models with Progressive Multimodal Alignment and Real-Time Self-Aware Emotional Speech Synthesis
topic Computation and Language
Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2501.04561