Saved in:
Bibliographic Details
Main Authors: Bethune, Louis, Turrisi, Victor, Mlodozeniec, Bruno Kacper, Lopez, Pau Rodriguez, Boominathan, Lokesh, Bhendawade, Nikhil, Shidani, Amitis, Pelemans, Joris, Olausson, Theo X., Hjelm, Devon, Dixon, Paul, Monteiro, Joao, Ablin, Pierre, Banna, Vishnu, Blaas, Arno, Henderson, Nick, Noriy, Kari, Busbridge, Dan, Susskind, Josh, Cuturi, Marco, Belousova, Irina, Zappella, Luca, Webb, Russ, Ramapuram, Jason
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2602.21472
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866918355847348224
author Bethune, Louis
Turrisi, Victor
Mlodozeniec, Bruno Kacper
Lopez, Pau Rodriguez
Boominathan, Lokesh
Bhendawade, Nikhil
Shidani, Amitis
Pelemans, Joris
Olausson, Theo X.
Hjelm, Devon
Dixon, Paul
Monteiro, Joao
Ablin, Pierre
Banna, Vishnu
Blaas, Arno
Henderson, Nick
Noriy, Kari
Busbridge, Dan
Susskind, Josh
Cuturi, Marco
Belousova, Irina
Zappella, Luca
Webb, Russ
Ramapuram, Jason
author_facet Bethune, Louis
Turrisi, Victor
Mlodozeniec, Bruno Kacper
Lopez, Pau Rodriguez
Boominathan, Lokesh
Bhendawade, Nikhil
Shidani, Amitis
Pelemans, Joris
Olausson, Theo X.
Hjelm, Devon
Dixon, Paul
Monteiro, Joao
Ablin, Pierre
Banna, Vishnu
Blaas, Arno
Henderson, Nick
Noriy, Kari
Busbridge, Dan
Susskind, Josh
Cuturi, Marco
Belousova, Irina
Zappella, Luca
Webb, Russ
Ramapuram, Jason
contents Discrete diffusion models have emerged as strong alternatives to autoregressive language models, with recent work initializing and fine-tuning a base unimodal model for bimodal generation. Diverging from previous approaches, we introduce the first tri-modal masked diffusion model pretrained from scratch on text, image-text, and audio-text data. We systematically analyze multimodal scaling laws, modality mixing ratios, noise schedules, and batch-size effects, and we provide optimized inference sampling defaults. Our batch-size analysis yields a novel stochastic differential equation (SDE)-based reparameterization that eliminates the need for tuning the optimal batch size as reported in recent work. This reparameterization decouples the physical batch size, often chosen based on compute constraints (GPU saturation, FLOP efficiency, wall-clock time), from the logical batch size, chosen to balance gradient variance during stochastic optimization. Finally, we pretrain a preliminary 3B-parameter tri-modal model on 6.4T tokens, demonstrating the capabilities of a unified design and achieving strong results in text generation, text-to-image tasks, and text-to-speech tasks. Our work represents the largest-scale systematic open study of multimodal discrete diffusion models conducted to date, providing insights into scaling behaviors across multiple modalities.
format Preprint
id arxiv_https___arxiv_org_abs_2602_21472
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle The Design Space of Tri-Modal Masked Diffusion Models
Bethune, Louis
Turrisi, Victor
Mlodozeniec, Bruno Kacper
Lopez, Pau Rodriguez
Boominathan, Lokesh
Bhendawade, Nikhil
Shidani, Amitis
Pelemans, Joris
Olausson, Theo X.
Hjelm, Devon
Dixon, Paul
Monteiro, Joao
Ablin, Pierre
Banna, Vishnu
Blaas, Arno
Henderson, Nick
Noriy, Kari
Busbridge, Dan
Susskind, Josh
Cuturi, Marco
Belousova, Irina
Zappella, Luca
Webb, Russ
Ramapuram, Jason
Machine Learning
Discrete diffusion models have emerged as strong alternatives to autoregressive language models, with recent work initializing and fine-tuning a base unimodal model for bimodal generation. Diverging from previous approaches, we introduce the first tri-modal masked diffusion model pretrained from scratch on text, image-text, and audio-text data. We systematically analyze multimodal scaling laws, modality mixing ratios, noise schedules, and batch-size effects, and we provide optimized inference sampling defaults. Our batch-size analysis yields a novel stochastic differential equation (SDE)-based reparameterization that eliminates the need for tuning the optimal batch size as reported in recent work. This reparameterization decouples the physical batch size, often chosen based on compute constraints (GPU saturation, FLOP efficiency, wall-clock time), from the logical batch size, chosen to balance gradient variance during stochastic optimization. Finally, we pretrain a preliminary 3B-parameter tri-modal model on 6.4T tokens, demonstrating the capabilities of a unified design and achieving strong results in text generation, text-to-image tasks, and text-to-speech tasks. Our work represents the largest-scale systematic open study of multimodal discrete diffusion models conducted to date, providing insights into scaling behaviors across multiple modalities.
title The Design Space of Tri-Modal Masked Diffusion Models
topic Machine Learning
url https://arxiv.org/abs/2602.21472