Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Tong, Shengbang, Fan, David, Nguyen, John, Brown, Ellis, Zhou, Gaoyue, Qian, Shengyi, Zheng, Boyang, Vallaeys, Théophane, Han, Junlin, Fergus, Rob, Murray, Naila, Ghazvininejad, Marjan, Lewis, Mike, Ballas, Nicolas, Bar, Amir, Rabbat, Michael, Verbeek, Jakob, Zettlemoyer, Luke, Sinha, Koustuv, LeCun, Yann, Xie, Saining
Format: Preprint
Veröffentlicht: 2026
Schlagworte:
Online-Zugang:https://arxiv.org/abs/2603.03276
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
_version_ 1866912940517490688
author Tong, Shengbang
Fan, David
Nguyen, John
Brown, Ellis
Zhou, Gaoyue
Qian, Shengyi
Zheng, Boyang
Vallaeys, Théophane
Han, Junlin
Fergus, Rob
Murray, Naila
Ghazvininejad, Marjan
Lewis, Mike
Ballas, Nicolas
Bar, Amir
Rabbat, Michael
Verbeek, Jakob
Zettlemoyer, Luke
Sinha, Koustuv
LeCun, Yann
Xie, Saining
author_facet Tong, Shengbang
Fan, David
Nguyen, John
Brown, Ellis
Zhou, Gaoyue
Qian, Shengyi
Zheng, Boyang
Vallaeys, Théophane
Han, Junlin
Fergus, Rob
Murray, Naila
Ghazvininejad, Marjan
Lewis, Mike
Ballas, Nicolas
Bar, Amir
Rabbat, Michael
Verbeek, Jakob
Zettlemoyer, Luke
Sinha, Koustuv
LeCun, Yann
Xie, Saining
contents The visual world offers a critical axis for advancing foundation models beyond language. Despite growing interest in this direction, the design space for native multimodal models remains opaque. We provide empirical clarity through controlled, from-scratch pretraining experiments, isolating the factors that govern multimodal pretraining without interference from language pretraining. We adopt the Transfusion framework, using next-token prediction for language and diffusion for vision, to train on diverse data including text, video, image-text pairs, and even action-conditioned video. Our experiments yield four key insights: (i) Representation Autoencoder (RAE) provides an optimal unified visual representation by excelling at both visual understanding and generation; (ii) visual and language data are complementary and yield synergy for downstream capabilities; (iii) unified multimodal pretraining leads naturally to world modeling, with capabilities emerging from general training; and (iv) Mixture-of-Experts (MoE) enables efficient and effective multimodal scaling while naturally inducing modality specialization. Through IsoFLOP analysis, we compute scaling laws for both modalities and uncover a scaling asymmetry: vision is significantly more data-hungry than language. We demonstrate that the MoE architecture harmonizes this scaling asymmetry by providing the high model capacity required by language while accommodating the data-intensive nature of vision, paving the way for truly unified multimodal models.
format Preprint
id arxiv_https___arxiv_org_abs_2603_03276
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle Beyond Language Modeling: An Exploration of Multimodal Pretraining
Tong, Shengbang
Fan, David
Nguyen, John
Brown, Ellis
Zhou, Gaoyue
Qian, Shengyi
Zheng, Boyang
Vallaeys, Théophane
Han, Junlin
Fergus, Rob
Murray, Naila
Ghazvininejad, Marjan
Lewis, Mike
Ballas, Nicolas
Bar, Amir
Rabbat, Michael
Verbeek, Jakob
Zettlemoyer, Luke
Sinha, Koustuv
LeCun, Yann
Xie, Saining
Computer Vision and Pattern Recognition
The visual world offers a critical axis for advancing foundation models beyond language. Despite growing interest in this direction, the design space for native multimodal models remains opaque. We provide empirical clarity through controlled, from-scratch pretraining experiments, isolating the factors that govern multimodal pretraining without interference from language pretraining. We adopt the Transfusion framework, using next-token prediction for language and diffusion for vision, to train on diverse data including text, video, image-text pairs, and even action-conditioned video. Our experiments yield four key insights: (i) Representation Autoencoder (RAE) provides an optimal unified visual representation by excelling at both visual understanding and generation; (ii) visual and language data are complementary and yield synergy for downstream capabilities; (iii) unified multimodal pretraining leads naturally to world modeling, with capabilities emerging from general training; and (iv) Mixture-of-Experts (MoE) enables efficient and effective multimodal scaling while naturally inducing modality specialization. Through IsoFLOP analysis, we compute scaling laws for both modalities and uncover a scaling asymmetry: vision is significantly more data-hungry than language. We demonstrate that the MoE architecture harmonizes this scaling asymmetry by providing the high model capacity required by language while accommodating the data-intensive nature of vision, paving the way for truly unified multimodal models.
title Beyond Language Modeling: An Exploration of Multimodal Pretraining
topic Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2603.03276