:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Author:	Furfaro, Fabien
Format:	Preprint
Published:	2024
Subjects:	Computer Vision and Pattern Recognition Artificial Intelligence
Online Access:	https://arxiv.org/abs/2409.15512
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

PixelBytes: Catching Unified Representation for Multimodal Generation
by: Furfaro, Fabien
Published: (2024)

Unified Multimodal Understanding via Byte-Pair Visual Encoding
by: Zhang, Wanpeng, et al.
Published: (2025)

From Pixels to Tokens: Byte-Pair Encoding on Quantized Visual Modalities
by: Zhang, Wanpeng, et al.
Published: (2024)

UniPixel: Unified Object Referring and Segmentation for Pixel-Level Visual Reasoning
by: Liu, Ye, et al.
Published: (2025)

Catch-Up Mix: Catch-Up Class for Struggling Filters in CNN
by: Kang, Minsoo, et al.
Published: (2024)

UniEval: Unified Holistic Evaluation for Unified Multimodal Understanding and Generation
by: Li, Yi, et al.
Published: (2025)

GLaMM: Pixel Grounding Large Multimodal Model
by: Rasheed, Hanoona, et al.
Published: (2023)

Pixel-Grounded Retrieval for Knowledgeable Large Multimodal Models
by: Kim, Jeonghwan, et al.
Published: (2026)

Semantic Generative Tuning for Unified Multimodal Models
by: Yu, Songsong, et al.
Published: (2026)

The Thinking Pixel: Recursive Sparse Reasoning in Multimodal Diffusion Latents
by: Sun, Yuwei, et al.
Published: (2026)

Enhancing Multimodal Unified Representations for Cross Modal Generalization
by: Huang, Hai, et al.
Published: (2024)

TokenFlow: Unified Image Tokenizer for Multimodal Understanding and Generation
by: Qu, Liao, et al.
Published: (2024)

Discrete Diffusion Models with MLLMs for Unified Medical Multimodal Generation
by: Mao, Jiawei, et al.
Published: (2025)

Steering Visual Generation in Unified Multimodal Models with Understanding Supervision
by: Liu, Zeyu, et al.
Published: (2026)

REGen: Multimodal Retrieval-Embedded Generation for Long-to-Short Video Editing
by: Xu, Weihan, et al.
Published: (2025)

Towards a Multimodal Large Language Model with Pixel-Level Insight for Biomedicine
by: Huang, Xiaoshuang, et al.
Published: (2024)

OmniCam: Unified Multimodal Video Generation via Camera Control
by: Yang, Xiaoda, et al.
Published: (2025)

NextFlow: Unified Sequential Modeling Activates Multimodal Understanding and Generation
by: Zhang, Huichao, et al.
Published: (2026)

HaploOmni: Unified Single Transformer for Multimodal Video Understanding and Generation
by: Xiao, Yicheng, et al.
Published: (2025)

Archon: A Unified Multimodal Model for Holistic Digital Human Generation
by: Bao, Chong, et al.
Published: (2026)

Bridging Pixels and Words: Mask-Aware Local Semantic Fusion for Multimodal Media Verification
by: Chen, Zizhao, et al.
Published: (2026)

PixelGen: Improving Pixel Diffusion with Perceptual Supervision
by: Ma, Zehong, et al.
Published: (2026)

UniToken: Harmonizing Multimodal Understanding and Generation through Unified Visual Encoding
by: Jiao, Yang, et al.
Published: (2025)

UniCMs: A Unified Consistency Model For Efficient Multimodal Generation and Understanding
by: Xu, Chenkai, et al.
Published: (2025)

Ming-Flash-Omni: A Sparse, Unified Architecture for Multimodal Perception and Generation
by: AI, Inclusion, et al.
Published: (2025)

PixelMan: Consistent Object Editing with Diffusion Models via Pixel Manipulation and Generation
by: Jiang, Liyao, et al.
Published: (2024)

Humor in Pixels: Benchmarking Large Multimodal Models Understanding of Online Comics
by: Ryan, Yuriel, et al.
Published: (2025)

PixelArena: A benchmark for Pixel-Precision Visual Intelligence
by: Liang, Feng, et al.
Published: (2025)

Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation
by: Wu, Chengyue, et al.
Published: (2024)

Pixel-Aligned Multi-View Generation with Depth Guided Decoder
by: Tang, Zhenggang, et al.
Published: (2024)

L2P: Unlocking Latent Potential for Pixel Generation
by: Chen, Zhennan, et al.
Published: (2026)

UniPose: A Unified Multimodal Framework for Human Pose Comprehension, Generation and Editing
by: Li, Yiheng, et al.
Published: (2024)

UniFashion: A Unified Vision-Language Model for Multimodal Fashion Retrieval and Generation
by: Zhao, Xiangyu, et al.
Published: (2024)

Lumina-OmniLV: A Unified Multimodal Framework for General Low-Level Vision
by: Pu, Yuandong, et al.
Published: (2025)

Nexus-Gen: Unified Image Understanding, Generation, and Editing via Prefilled Autoregression in Shared Embedding Space
by: Zhang, Hong, et al.
Published: (2025)

SafePLUG: Empowering Multimodal LLMs with Pixel-Level Insight and Temporal Grounding for Traffic Accident Understanding
by: Sheng, Zihao, et al.
Published: (2025)

UrbanGraphEmbeddings: Learning and Evaluating Spatially Grounded Multimodal Embeddings for Urban Science
by: Zhang, Jie, et al.
Published: (2026)

Grasp Any Region: Towards Precise, Contextual Pixel Understanding for Multimodal LLMs
by: Wang, Haochen, et al.
Published: (2025)

Understanding and Harnessing Sparsity in Unified Multimodal Models
by: He, Shwai, et al.
Published: (2025)

ImAgent: A Unified Multimodal Agent Framework for Test-Time Scalable Image Generation
by: Wang, Kaishen, et al.
Published: (2025)