:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Wang, Wenqi, Tan, Reuben, Zhu, Pengyue, Yang, Jianwei, Yang, Zhengyuan, Wang, Lijuan, Kolobov, Andrey, Gao, Jianfeng, Gong, Boqing
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2505.05456
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Elevating Visual Perception in Multimodal LLMs with Visual Embedding Distillation
by: Jain, Jitesh, et al.
Published: (2024)

List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs
by: Yan, An, et al.
Published: (2024)

Bring Metric Functions into Diffusion Models
by: An, Jie, et al.
Published: (2024)

Lifting Data-Tracing Machine Unlearning to Knowledge-Tracing for Foundation Models
by: Tan, Yuwen, et al.
Published: (2025)

Entity6K: A Large Open-Domain Evaluation Dataset for Real-World Entity Recognition
by: Qiu, Jielin, et al.
Published: (2024)

LiVOS: Light Video Object Segmentation with Gated Linear Matching
by: Liu, Qin, et al.
Published: (2024)

Idea2Img: Iterative Self-Refinement with GPT-4V(ision) for Automatic Image Design and Generation
by: Yang, Zhengyuan, et al.
Published: (2023)

Culture in Action: Evaluating Text-to-Image Models through Social Activities
by: Malakouti, Sina, et al.
Published: (2025)

MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities
by: Yu, Weihao, et al.
Published: (2023)

Beyond Words: Advancing Long-Text Image Generation via Multimodal Autoregressive Models
by: Wang, Alex Jinpeng, et al.
Published: (2025)

COSMO: COntrastive Streamlined MultimOdal Model with Interleaved Pre-Training
by: Wang, Alex Jinpeng, et al.
Published: (2024)

Pix2Gif: Motion-Guided Diffusion for GIF Generation
by: Kandala, Hitesh, et al.
Published: (2024)

MindJourney: Test-Time Scaling with World Models for Spatial Reasoning
by: Yang, Yuncong, et al.
Published: (2025)

Motion Consistency Model: Accelerating Video Diffusion with Disentangled Motion-Appearance Distillation
by: Zhai, Yuanhao, et al.
Published: (2024)

AsgardBench -- Evaluating Visually Grounded Interactive Planning Under Minimal Feedback
by: Tupini, Andrea, et al.
Published: (2026)

Attention to Neural Plagiarism: Diffusion Models Can Plagiarize Your Copyrighted Images!
by: Zou, Zihang, et al.
Published: (2026)

V-MAGE: A Game Evaluation Framework for Assessing Vision-Centric Capabilities in Multimodal Large Language Models
by: Zheng, Xiangxi, et al.
Published: (2025)

MM-Vet v2: A Challenging Benchmark to Evaluate Large Multimodal Models for Integrated Capabilities
by: Yu, Weihao, et al.
Published: (2024)

Conditional Text-to-Image Generation with Reference Guidance
by: Kim, Taewook, et al.
Published: (2024)

IDOL: Unified Dual-Modal Latent Diffusion for Human-Centric Joint Video-Depth Generation
by: Zhai, Yuanhao, et al.
Published: (2024)

The LLM Bottleneck: Why Open-Source Vision LLMs Struggle with Hierarchical Visual Recognition
by: Tan, Yuwen, et al.
Published: (2025)

Can MLLMs Reason in Multimodality? EMMA: An Enhanced MultiModal ReAsoning Benchmark
by: Hao, Yunzhuo, et al.
Published: (2025)

Tuning Timestep-Distilled Diffusion Model Using Pairwise Sample Optimization
by: Miao, Zichen, et al.
Published: (2024)

Structured Video-Language Modeling with Temporal Grouping and Spatial Grounding
by: Xiong, Yuanhao, et al.
Published: (2023)

AutoDirector: Online Auto-scheduling Agents for Multi-sensory Composition
by: Ni, Minheng, et al.
Published: (2024)

MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos
by: He, Xuehai, et al.
Published: (2024)

GenXD: Generating Any 3D and 4D Scenes
by: Zhao, Yuyang, et al.
Published: (2024)

Interfacing Foundation Models' Embeddings
by: Zou, Xueyan, et al.
Published: (2023)

ImageGen-CoT: Enhancing Text-to-Image In-context Learning with Chain-of-Thought Reasoning
by: Liao, Jiaqi, et al.
Published: (2025)

Continual Adapter Tuning with Semantic Shift Compensation for Class-Incremental Learning
by: Zhou, Qinhao, et al.
Published: (2024)

Point-RFT: Improving Multimodal Reasoning with Visually Grounded Reinforcement Finetuning
by: Ni, Minheng, et al.
Published: (2025)

CAPability: A Comprehensive Visual Caption Benchmark for Evaluating Both Correctness and Thoroughness
by: Liu, Zhihang, et al.
Published: (2025)

HypDAE: Hyperbolic Diffusion Autoencoders for Hierarchical Few-shot Image Generation
by: Li, Lingxiao, et al.
Published: (2024)

Glance: Accelerating Diffusion Models with 1 Sample
by: Dong, Zhuobai, et al.
Published: (2025)

TechImage-Bench: Rubric-Based Evaluation for Technical Image Generation
by: Ni, Minheng, et al.
Published: (2025)

Learning Sparse Visual Representations via Spatial-Semantic Factorization
by: Zhao, Theodore Zhengde, et al.
Published: (2026)

DisCo: Disentangled Control for Realistic Human Dance Generation
by: Wang, Tan, et al.
Published: (2023)

FlowInOne:Unifying Multimodal Generation as Image-in, Image-out Flow Matching
by: Yi, Junchao, et al.
Published: (2026)

Thinking in Structures: Evaluating Spatial Intelligence in Constraint-Governed Spaces
by: Yang, Chen, et al.
Published: (2026)

Struct2D: A Perception-Guided Framework for Spatial Reasoning in MLLMs
by: Zhu, Fangrui, et al.
Published: (2025)