Saved in:
| Main Authors: | Ma, Lichen, Fu, Xiaolong, Zhou, Gaojing, Guo, Zipeng, Zhu, Ting, Liu, Yichun, Shi, Yu, Li, Jason, Huang, Junshi |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2601.08321 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Dynamic-TreeRPO: Breaking the Independent Trajectory Bottleneck with Structured Sampling
by: Fu, Xiaolong, et al.
Published: (2025)
by: Fu, Xiaolong, et al.
Published: (2025)
FrequencyBooster: Full-Frequency Modeling for High-Fidelity Pixel Diffusion
by: Ma, Lichen, et al.
Published: (2026)
by: Ma, Lichen, et al.
Published: (2026)
Fashion130K: An E-commerce Fashion Dataset for Outfit Generation with Unified Multi-modal Condition
by: He, Yu, et al.
Published: (2026)
by: He, Yu, et al.
Published: (2026)
RePainter: Empowering E-commerce Object Removal via Spatial-matting Reinforcement Learning
by: Guo, Zipeng, et al.
Published: (2025)
by: Guo, Zipeng, et al.
Published: (2025)
Meta-TTRL: A Metacognitive Framework for Self-Improving Test-Time Reinforcement Learning in Unified Multimodal Models
by: Tan, Lit Sin, et al.
Published: (2026)
by: Tan, Lit Sin, et al.
Published: (2026)
HyperDiT: Hyper-Connected Transformers for High-Fidelity Pixel-Space Diffusion
by: He, Yu, et al.
Published: (2026)
by: He, Yu, et al.
Published: (2026)
QLIP: Text-Aligned Visual Tokenization Unifies Auto-Regressive Multimodal Understanding and Generation
by: Zhao, Yue, et al.
Published: (2025)
by: Zhao, Yue, et al.
Published: (2025)
A Survey of Multimodal-Guided Image Editing with Text-to-Image Diffusion Models
by: Shuai, Xincheng, et al.
Published: (2024)
by: Shuai, Xincheng, et al.
Published: (2024)
CharGen: High Accurate Character-Level Visual Text Generation Model with MultiModal Encoder
by: Ma, Lichen, et al.
Published: (2024)
by: Ma, Lichen, et al.
Published: (2024)
SeedEdit: Align Image Re-Generation to Image Editing
by: Shi, Yichun, et al.
Published: (2024)
by: Shi, Yichun, et al.
Published: (2024)
MULTI: Multimodal Understanding Leaderboard with Text and Images
by: Zhu, Zichen, et al.
Published: (2024)
by: Zhu, Zichen, et al.
Published: (2024)
Training-Free Text-Guided Image Editing with Visual Autoregressive Model
by: Wang, Yufei, et al.
Published: (2025)
by: Wang, Yufei, et al.
Published: (2025)
Multimodal Large Language Models for Text-rich Image Understanding: A Comprehensive Review
by: Fu, Pei, et al.
Published: (2025)
by: Fu, Pei, et al.
Published: (2025)
Rethinking Structure Preservation in Text-Guided Image Editing with Visual Autoregressive Models
by: Xia, Tao, et al.
Published: (2026)
by: Xia, Tao, et al.
Published: (2026)
TextCoT: Zoom In for Enhanced Multimodal Text-Rich Image Understanding
by: Luan, Bozhi, et al.
Published: (2024)
by: Luan, Bozhi, et al.
Published: (2024)
UniFinEval: Towards Unified Evaluation of Financial Multimodal Models across Text, Images and Videos
by: Yang, Zhi, et al.
Published: (2026)
by: Yang, Zhi, et al.
Published: (2026)
Enhancing Multimodal Understanding with CLIP-Based Image-to-Text Transformation
by: Che, Chang, et al.
Published: (2024)
by: Che, Chang, et al.
Published: (2024)
Unifying Contrastive and Generative Objectives for Visual Understanding and Text-to-Image Generation
by: Li, Chao, et al.
Published: (2026)
by: Li, Chao, et al.
Published: (2026)
Visual Harmony: Text-Visual Interplay in Circular Infographics
by: He, Shuqi, et al.
Published: (2024)
by: He, Shuqi, et al.
Published: (2024)
UniVLR: Unifying Text and Vision in Visual Latent Reasoning for Multimodal LLMs
by: Jiang, Houcheng, et al.
Published: (2026)
by: Jiang, Houcheng, et al.
Published: (2026)
Dual Diffusion for Unified Image Generation and Understanding
by: Li, Zijie, et al.
Published: (2024)
by: Li, Zijie, et al.
Published: (2024)
PrompTHis: Visualizing the Process and Influence of Prompt Editing during Text-to-Image Creation
by: Guo, Yuhan, et al.
Published: (2024)
by: Guo, Yuhan, et al.
Published: (2024)
AnyText: Multilingual Visual Text Generation And Editing
by: Tuo, Yuxiang, et al.
Published: (2023)
by: Tuo, Yuxiang, et al.
Published: (2023)
InstructUDrag: Joint Text Instructions and Object Dragging for Interactive Image Editing
by: Yu, Haoran, et al.
Published: (2025)
by: Yu, Haoran, et al.
Published: (2025)
Text to Image Generation and Editing: A Survey
by: Yang, Pengfei, et al.
Published: (2025)
by: Yang, Pengfei, et al.
Published: (2025)
Safety of Multimodal Large Language Models on Images and Texts
by: Liu, Xin, et al.
Published: (2024)
by: Liu, Xin, et al.
Published: (2024)
Towards Visual Text Grounding of Multimodal Large Language Model
by: Li, Ming, et al.
Published: (2025)
by: Li, Ming, et al.
Published: (2025)
Plot'n Polish: Zero-shot Story Visualization and Disentangled Editing with Text-to-Image Diffusion Models
by: Akdemir, Kiymet, et al.
Published: (2025)
by: Akdemir, Kiymet, et al.
Published: (2025)
InternVL-U: Democratizing Unified Multimodal Models for Understanding, Reasoning, Generation and Editing
by: Tian, Changyao, et al.
Published: (2026)
by: Tian, Changyao, et al.
Published: (2026)
EruDiff: Refactoring Knowledge in Diffusion Models for Advanced Text-to-Image Synthesis
by: Guo, Xiefan, et al.
Published: (2026)
by: Guo, Xiefan, et al.
Published: (2026)
LTCF-Net: A Transformer-Enhanced Dual-Channel Fourier Framework for Low-Light Image Restoration
by: Zhang, Gaojing, et al.
Published: (2024)
by: Zhang, Gaojing, et al.
Published: (2024)
ReFACT: Updating Text-to-Image Models by Editing the Text Encoder
by: Arad, Dana, et al.
Published: (2023)
by: Arad, Dana, et al.
Published: (2023)
Towards Understanding Cross and Self-Attention in Stable Diffusion for Text-Guided Image Editing
by: Liu, Bingyan, et al.
Published: (2024)
by: Liu, Bingyan, et al.
Published: (2024)
AdapEdit: Spatio-Temporal Guided Adaptive Editing Algorithm for Text-Based Continuity-Sensitive Image Editing
by: Ma, Zhiyuan, et al.
Published: (2023)
by: Ma, Zhiyuan, et al.
Published: (2023)
Vistoria: A Multimodal System to Support Fictional Story Writing through Instrumental Text-Image Co-Editing
by: Fu, Kexue, et al.
Published: (2025)
by: Fu, Kexue, et al.
Published: (2025)
Multimodal LLMs as Customized Reward Models for Text-to-Image Generation
by: Zhou, Shijie, et al.
Published: (2025)
by: Zhou, Shijie, et al.
Published: (2025)
ASCIIBench: Evaluating Language-Model-Based Understanding of Visually-Oriented Text
by: Luo, Kerry, et al.
Published: (2025)
by: Luo, Kerry, et al.
Published: (2025)
A Token-level Text Image Foundation Model for Document Understanding
by: Guan, Tongkun, et al.
Published: (2025)
by: Guan, Tongkun, et al.
Published: (2025)
Text-driven Multiplanar Visual Interaction for Semi-supervised Medical Image Segmentation
by: Huang, Kaiwen, et al.
Published: (2025)
by: Huang, Kaiwen, et al.
Published: (2025)
Query-Kontext: An Unified Multimodal Model for Image Generation and Editing
by: Song, Yuxin, et al.
Published: (2025)
by: Song, Yuxin, et al.
Published: (2025)
Similar Items
-
Dynamic-TreeRPO: Breaking the Independent Trajectory Bottleneck with Structured Sampling
by: Fu, Xiaolong, et al.
Published: (2025) -
FrequencyBooster: Full-Frequency Modeling for High-Fidelity Pixel Diffusion
by: Ma, Lichen, et al.
Published: (2026) -
Fashion130K: An E-commerce Fashion Dataset for Outfit Generation with Unified Multi-modal Condition
by: He, Yu, et al.
Published: (2026) -
RePainter: Empowering E-commerce Object Removal via Spatial-matting Reinforcement Learning
by: Guo, Zipeng, et al.
Published: (2025) -
Meta-TTRL: A Metacognitive Framework for Self-Improving Test-Time Reinforcement Learning in Unified Multimodal Models
by: Tan, Lit Sin, et al.
Published: (2026)