Saved in:
Bibliographic Details
Main Authors: Shaker, Abdelrahman, Heakl, Ahmed, Muhammad, Jaseel, Thawkar, Ritesh, Thawakar, Omkar, Li, Senmao, Cholakkal, Hisham, Reid, Ian, Xing, Eric P., Khan, Salman, Khan, Fahad Shahbaz
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2602.20161
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866914346294050816
author Shaker, Abdelrahman
Heakl, Ahmed
Muhammad, Jaseel
Thawkar, Ritesh
Thawakar, Omkar
Li, Senmao
Cholakkal, Hisham
Reid, Ian
Xing, Eric P.
Khan, Salman
Khan, Fahad Shahbaz
author_facet Shaker, Abdelrahman
Heakl, Ahmed
Muhammad, Jaseel
Thawkar, Ritesh
Thawakar, Omkar
Li, Senmao
Cholakkal, Hisham
Reid, Ian
Xing, Eric P.
Khan, Salman
Khan, Fahad Shahbaz
contents Unified multimodal models can both understand and generate visual content within a single architecture. Existing models, however, remain data-hungry and too heavy for deployment on edge devices. We present Mobile-O, a compact vision-language-diffusion model that brings unified multimodal intelligence to a mobile device. Its core module, the Mobile Conditioning Projector (MCP), fuses vision-language features with a diffusion generator using depthwise-separable convolutions and layerwise alignment. This design enables efficient cross-modal conditioning with minimal computational cost. Trained on only a few million samples and post-trained in a novel quadruplet format (generation prompt, image, question, answer), Mobile-O jointly enhances both visual understanding and generation capabilities. Despite its efficiency, Mobile-O attains competitive or superior performance compared to other unified models, achieving 74% on GenEval and outperforming Show-O and JanusFlow by 5% and 11%, while running 6x and 11x faster, respectively. For visual understanding, Mobile-O surpasses them by 15.3% and 5.1% averaged across seven benchmarks. Running in only ~3s per 512x512 image on an iPhone, Mobile-O establishes the first practical framework for real-time unified multimodal understanding and generation on edge devices. We hope Mobile-O will ease future research in real-time unified multimodal intelligence running entirely on-device with no cloud dependency. Our code, models, datasets, and mobile application are publicly available at https://amshaker.github.io/Mobile-O/
format Preprint
id arxiv_https___arxiv_org_abs_2602_20161
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle Mobile-O: Unified Multimodal Understanding and Generation on Mobile Device
Shaker, Abdelrahman
Heakl, Ahmed
Muhammad, Jaseel
Thawkar, Ritesh
Thawakar, Omkar
Li, Senmao
Cholakkal, Hisham
Reid, Ian
Xing, Eric P.
Khan, Salman
Khan, Fahad Shahbaz
Computer Vision and Pattern Recognition
Unified multimodal models can both understand and generate visual content within a single architecture. Existing models, however, remain data-hungry and too heavy for deployment on edge devices. We present Mobile-O, a compact vision-language-diffusion model that brings unified multimodal intelligence to a mobile device. Its core module, the Mobile Conditioning Projector (MCP), fuses vision-language features with a diffusion generator using depthwise-separable convolutions and layerwise alignment. This design enables efficient cross-modal conditioning with minimal computational cost. Trained on only a few million samples and post-trained in a novel quadruplet format (generation prompt, image, question, answer), Mobile-O jointly enhances both visual understanding and generation capabilities. Despite its efficiency, Mobile-O attains competitive or superior performance compared to other unified models, achieving 74% on GenEval and outperforming Show-O and JanusFlow by 5% and 11%, while running 6x and 11x faster, respectively. For visual understanding, Mobile-O surpasses them by 15.3% and 5.1% averaged across seven benchmarks. Running in only ~3s per 512x512 image on an iPhone, Mobile-O establishes the first practical framework for real-time unified multimodal understanding and generation on edge devices. We hope Mobile-O will ease future research in real-time unified multimodal intelligence running entirely on-device with no cloud dependency. Our code, models, datasets, and mobile application are publicly available at https://amshaker.github.io/Mobile-O/
title Mobile-O: Unified Multimodal Understanding and Generation on Mobile Device
topic Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2602.20161