:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Yang, Ling, Zhang, Xinchen, Tian, Ye, Shang, Chenming, Xu, Minghao, Zhang, Wentao, Cui, Bin
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2502.12148
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Diffusion-Sharpening: Fine-tuning Diffusion Models with Denoising Trajectory Sharpening
by: Tian, Ye, et al.
Published: (2025)

Understanding Multimodal Deep Neural Networks: A Concept Selection View
by: Shang, Chenming, et al.
Published: (2024)

MMaDA: Multimodal Large Diffusion Language Models
by: Yang, Ling, et al.
Published: (2025)

MMFineReason: Closing the Multimodal Reasoning Gap via Open Data-Centric Methods
by: Lin, Honglin, et al.
Published: (2026)

Consistency Flow Matching: Defining Straight Flows with Velocity Consistency
by: Yang, Ling, et al.
Published: (2024)

IterComp: Iterative Composition-Aware Feedback Learning from Model Gallery for Text-to-Image Generation
by: Zhang, Xinchen, et al.
Published: (2024)

Generative Universal Verifier as Multimodal Meta-Reasoner
by: Zhang, Xinchen, et al.
Published: (2025)

X2I: Seamless Integration of Multimodal Understanding into Diffusion Transformer via Attention Distillation
by: Ma, Jian, et al.
Published: (2025)

RecipeGen: A Step-Aligned Multimodal Benchmark for Real-World Recipe Generation
by: Zhang, Ruoxuan, et al.
Published: (2025)

Harmonizing Visual Representations for Unified Multimodal Understanding and Generation
by: Wu, Size, et al.
Published: (2025)

TokenFlow: Unified Image Tokenizer for Multimodal Understanding and Generation
by: Qu, Liao, et al.
Published: (2024)

FLARE: Fully Integration of Vision-Language Representations for Deep Cross-Modal Understanding
by: Liu, Zheng, et al.
Published: (2025)

Breaking the Encoder Barrier for Seamless Video-Language Understanding
by: Li, Handong, et al.
Published: (2025)

Unified Multimodal Understanding and Generation Models: Advances, Challenges, and Opportunities
by: Zhao, Shanshan, et al.
Published: (2025)

Retrieval-Augmented Generation for AI-Generated Content: A Survey
by: Zhao, Penghao, et al.
Published: (2024)

How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
by: Chen, Zhe, et al.
Published: (2024)

RealCompo: Balancing Realism and Compositionality Improves Text-to-Image Diffusion Models
by: Zhang, Xinchen, et al.
Published: (2024)

VQGraph: Rethinking Graph Representation Space for Bridging GNNs and MLPs
by: Yang, Ling, et al.
Published: (2023)

Mobius: Text to Seamless Looping Video Generation via Latent Shift
by: Bi, Xiuli, et al.
Published: (2025)

Universal Medical Image Representation Learning with Compositional Decoders
by: Wang, Kaini, et al.
Published: (2024)

DocKylin: A Large Multimodal Model for Visual Document Understanding with Efficient Visual Slimming
by: Zhang, Jiaxin, et al.
Published: (2024)

AnyCharV: Bootstrap Controllable Character Video Generation with Fine-to-Coarse Guidance
by: Wang, Zhao, et al.
Published: (2025)

Generative Giants, Retrieval Weaklings: Why do Multimodal Large Language Models Fail at Multimodal Retrieval?
by: Feng, Hengyi, et al.
Published: (2025)

Do Joint Audio-Video Generation Models Understand Physics?
by: Cui, Zijun, et al.
Published: (2026)

Closing the Safety Gap: Surgical Concept Erasure in Visual Autoregressive Models
by: Zhong, Xinhao, et al.
Published: (2025)

Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMs
by: Yang, Ling, et al.
Published: (2024)

NextFlow: Unified Sequential Modeling Activates Multimodal Understanding and Generation
by: Zhang, Huichao, et al.
Published: (2026)

FlowTok: Flowing Seamlessly Across Text and Image Tokens
by: He, Ju, et al.
Published: (2025)

EditWorld: Simulating World Dynamics for Instruction-Following Image Editing
by: Yang, Ling, et al.
Published: (2024)

VideoTetris: Towards Compositional Text-to-Video Generation
by: Tian, Ye, et al.
Published: (2024)

Reversing the Flow: Generation-to-Understanding Synergy in Large Multimodal Models
by: Tong, Yujun, et al.
Published: (2026)

LoVR: A Benchmark for Long Video Retrieval in Multimodal Contexts
by: Cai, Qifeng, et al.
Published: (2025)

ESCA: Enabling Seamless Codec Avatar Execution through Algorithm and Hardware Co-Optimization for Virtual Reality
by: Zhu, Mingzhi, et al.
Published: (2025)

InternVL-U: Democratizing Unified Multimodal Models for Understanding, Reasoning, Generation and Editing
by: Tian, Changyao, et al.
Published: (2026)

GenAD: Generative End-to-End Autonomous Driving
by: Zheng, Wenzhao, et al.
Published: (2024)

UniGen: Enhanced Training & Test-Time Strategies for Unified Multimodal Understanding and Generation
by: Tian, Rui, et al.
Published: (2025)

FlowVQTalker: High-Quality Emotional Talking Face Generation through Normalizing Flow and Quantization
by: Tan, Shuai, et al.
Published: (2024)

From Prompt to Progression: Taming Video Diffusion Models for Seamless Attribute Transition
by: Lo, Ling, et al.
Published: (2025)

SciFlow-Bench: Evaluating Structure-Aware Scientific Diagram Generation via Inverse Parsing
by: Zhang, Tong, et al.
Published: (2026)

DynamicScaler: Seamless and Scalable Video Generation for Panoramic Scenes
by: Liu, Jinxiu, et al.
Published: (2024)