Saved in:
| Main Authors: | Yang, Ling, Zhang, Xinchen, Tian, Ye, Shang, Chenming, Xu, Minghao, Zhang, Wentao, Cui, Bin |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2502.12148 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Diffusion-Sharpening: Fine-tuning Diffusion Models with Denoising Trajectory Sharpening
by: Tian, Ye, et al.
Published: (2025)
by: Tian, Ye, et al.
Published: (2025)
Understanding Multimodal Deep Neural Networks: A Concept Selection View
by: Shang, Chenming, et al.
Published: (2024)
by: Shang, Chenming, et al.
Published: (2024)
MMaDA: Multimodal Large Diffusion Language Models
by: Yang, Ling, et al.
Published: (2025)
by: Yang, Ling, et al.
Published: (2025)
MMFineReason: Closing the Multimodal Reasoning Gap via Open Data-Centric Methods
by: Lin, Honglin, et al.
Published: (2026)
by: Lin, Honglin, et al.
Published: (2026)
Consistency Flow Matching: Defining Straight Flows with Velocity Consistency
by: Yang, Ling, et al.
Published: (2024)
by: Yang, Ling, et al.
Published: (2024)
IterComp: Iterative Composition-Aware Feedback Learning from Model Gallery for Text-to-Image Generation
by: Zhang, Xinchen, et al.
Published: (2024)
by: Zhang, Xinchen, et al.
Published: (2024)
Generative Universal Verifier as Multimodal Meta-Reasoner
by: Zhang, Xinchen, et al.
Published: (2025)
by: Zhang, Xinchen, et al.
Published: (2025)
X2I: Seamless Integration of Multimodal Understanding into Diffusion Transformer via Attention Distillation
by: Ma, Jian, et al.
Published: (2025)
by: Ma, Jian, et al.
Published: (2025)
RecipeGen: A Step-Aligned Multimodal Benchmark for Real-World Recipe Generation
by: Zhang, Ruoxuan, et al.
Published: (2025)
by: Zhang, Ruoxuan, et al.
Published: (2025)
Harmonizing Visual Representations for Unified Multimodal Understanding and Generation
by: Wu, Size, et al.
Published: (2025)
by: Wu, Size, et al.
Published: (2025)
TokenFlow: Unified Image Tokenizer for Multimodal Understanding and Generation
by: Qu, Liao, et al.
Published: (2024)
by: Qu, Liao, et al.
Published: (2024)
FLARE: Fully Integration of Vision-Language Representations for Deep Cross-Modal Understanding
by: Liu, Zheng, et al.
Published: (2025)
by: Liu, Zheng, et al.
Published: (2025)
Breaking the Encoder Barrier for Seamless Video-Language Understanding
by: Li, Handong, et al.
Published: (2025)
by: Li, Handong, et al.
Published: (2025)
Unified Multimodal Understanding and Generation Models: Advances, Challenges, and Opportunities
by: Zhao, Shanshan, et al.
Published: (2025)
by: Zhao, Shanshan, et al.
Published: (2025)
Retrieval-Augmented Generation for AI-Generated Content: A Survey
by: Zhao, Penghao, et al.
Published: (2024)
by: Zhao, Penghao, et al.
Published: (2024)
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
by: Chen, Zhe, et al.
Published: (2024)
by: Chen, Zhe, et al.
Published: (2024)
RealCompo: Balancing Realism and Compositionality Improves Text-to-Image Diffusion Models
by: Zhang, Xinchen, et al.
Published: (2024)
by: Zhang, Xinchen, et al.
Published: (2024)
VQGraph: Rethinking Graph Representation Space for Bridging GNNs and MLPs
by: Yang, Ling, et al.
Published: (2023)
by: Yang, Ling, et al.
Published: (2023)
Mobius: Text to Seamless Looping Video Generation via Latent Shift
by: Bi, Xiuli, et al.
Published: (2025)
by: Bi, Xiuli, et al.
Published: (2025)
Universal Medical Image Representation Learning with Compositional Decoders
by: Wang, Kaini, et al.
Published: (2024)
by: Wang, Kaini, et al.
Published: (2024)
DocKylin: A Large Multimodal Model for Visual Document Understanding with Efficient Visual Slimming
by: Zhang, Jiaxin, et al.
Published: (2024)
by: Zhang, Jiaxin, et al.
Published: (2024)
AnyCharV: Bootstrap Controllable Character Video Generation with Fine-to-Coarse Guidance
by: Wang, Zhao, et al.
Published: (2025)
by: Wang, Zhao, et al.
Published: (2025)
Generative Giants, Retrieval Weaklings: Why do Multimodal Large Language Models Fail at Multimodal Retrieval?
by: Feng, Hengyi, et al.
Published: (2025)
by: Feng, Hengyi, et al.
Published: (2025)
Do Joint Audio-Video Generation Models Understand Physics?
by: Cui, Zijun, et al.
Published: (2026)
by: Cui, Zijun, et al.
Published: (2026)
Closing the Safety Gap: Surgical Concept Erasure in Visual Autoregressive Models
by: Zhong, Xinhao, et al.
Published: (2025)
by: Zhong, Xinhao, et al.
Published: (2025)
Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMs
by: Yang, Ling, et al.
Published: (2024)
by: Yang, Ling, et al.
Published: (2024)
NextFlow: Unified Sequential Modeling Activates Multimodal Understanding and Generation
by: Zhang, Huichao, et al.
Published: (2026)
by: Zhang, Huichao, et al.
Published: (2026)
FlowTok: Flowing Seamlessly Across Text and Image Tokens
by: He, Ju, et al.
Published: (2025)
by: He, Ju, et al.
Published: (2025)
EditWorld: Simulating World Dynamics for Instruction-Following Image Editing
by: Yang, Ling, et al.
Published: (2024)
by: Yang, Ling, et al.
Published: (2024)
VideoTetris: Towards Compositional Text-to-Video Generation
by: Tian, Ye, et al.
Published: (2024)
by: Tian, Ye, et al.
Published: (2024)
Reversing the Flow: Generation-to-Understanding Synergy in Large Multimodal Models
by: Tong, Yujun, et al.
Published: (2026)
by: Tong, Yujun, et al.
Published: (2026)
LoVR: A Benchmark for Long Video Retrieval in Multimodal Contexts
by: Cai, Qifeng, et al.
Published: (2025)
by: Cai, Qifeng, et al.
Published: (2025)
ESCA: Enabling Seamless Codec Avatar Execution through Algorithm and Hardware Co-Optimization for Virtual Reality
by: Zhu, Mingzhi, et al.
Published: (2025)
by: Zhu, Mingzhi, et al.
Published: (2025)
InternVL-U: Democratizing Unified Multimodal Models for Understanding, Reasoning, Generation and Editing
by: Tian, Changyao, et al.
Published: (2026)
by: Tian, Changyao, et al.
Published: (2026)
GenAD: Generative End-to-End Autonomous Driving
by: Zheng, Wenzhao, et al.
Published: (2024)
by: Zheng, Wenzhao, et al.
Published: (2024)
UniGen: Enhanced Training & Test-Time Strategies for Unified Multimodal Understanding and Generation
by: Tian, Rui, et al.
Published: (2025)
by: Tian, Rui, et al.
Published: (2025)
FlowVQTalker: High-Quality Emotional Talking Face Generation through Normalizing Flow and Quantization
by: Tan, Shuai, et al.
Published: (2024)
by: Tan, Shuai, et al.
Published: (2024)
From Prompt to Progression: Taming Video Diffusion Models for Seamless Attribute Transition
by: Lo, Ling, et al.
Published: (2025)
by: Lo, Ling, et al.
Published: (2025)
SciFlow-Bench: Evaluating Structure-Aware Scientific Diagram Generation via Inverse Parsing
by: Zhang, Tong, et al.
Published: (2026)
by: Zhang, Tong, et al.
Published: (2026)
DynamicScaler: Seamless and Scalable Video Generation for Panoramic Scenes
by: Liu, Jinxiu, et al.
Published: (2024)
by: Liu, Jinxiu, et al.
Published: (2024)
Similar Items
-
Diffusion-Sharpening: Fine-tuning Diffusion Models with Denoising Trajectory Sharpening
by: Tian, Ye, et al.
Published: (2025) -
Understanding Multimodal Deep Neural Networks: A Concept Selection View
by: Shang, Chenming, et al.
Published: (2024) -
MMaDA: Multimodal Large Diffusion Language Models
by: Yang, Ling, et al.
Published: (2025) -
MMFineReason: Closing the Multimodal Reasoning Gap via Open Data-Centric Methods
by: Lin, Honglin, et al.
Published: (2026) -
Consistency Flow Matching: Defining Straight Flows with Velocity Consistency
by: Yang, Ling, et al.
Published: (2024)