Saved in:
| Main Authors: | Wang, Wenqi, Tan, Reuben, Zhu, Pengyue, Yang, Jianwei, Yang, Zhengyuan, Wang, Lijuan, Kolobov, Andrey, Gao, Jianfeng, Gong, Boqing |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2505.05456 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Elevating Visual Perception in Multimodal LLMs with Visual Embedding Distillation
by: Jain, Jitesh, et al.
Published: (2024)
by: Jain, Jitesh, et al.
Published: (2024)
List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs
by: Yan, An, et al.
Published: (2024)
by: Yan, An, et al.
Published: (2024)
Bring Metric Functions into Diffusion Models
by: An, Jie, et al.
Published: (2024)
by: An, Jie, et al.
Published: (2024)
Lifting Data-Tracing Machine Unlearning to Knowledge-Tracing for Foundation Models
by: Tan, Yuwen, et al.
Published: (2025)
by: Tan, Yuwen, et al.
Published: (2025)
Entity6K: A Large Open-Domain Evaluation Dataset for Real-World Entity Recognition
by: Qiu, Jielin, et al.
Published: (2024)
by: Qiu, Jielin, et al.
Published: (2024)
LiVOS: Light Video Object Segmentation with Gated Linear Matching
by: Liu, Qin, et al.
Published: (2024)
by: Liu, Qin, et al.
Published: (2024)
Idea2Img: Iterative Self-Refinement with GPT-4V(ision) for Automatic Image Design and Generation
by: Yang, Zhengyuan, et al.
Published: (2023)
by: Yang, Zhengyuan, et al.
Published: (2023)
Culture in Action: Evaluating Text-to-Image Models through Social Activities
by: Malakouti, Sina, et al.
Published: (2025)
by: Malakouti, Sina, et al.
Published: (2025)
MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities
by: Yu, Weihao, et al.
Published: (2023)
by: Yu, Weihao, et al.
Published: (2023)
Beyond Words: Advancing Long-Text Image Generation via Multimodal Autoregressive Models
by: Wang, Alex Jinpeng, et al.
Published: (2025)
by: Wang, Alex Jinpeng, et al.
Published: (2025)
COSMO: COntrastive Streamlined MultimOdal Model with Interleaved Pre-Training
by: Wang, Alex Jinpeng, et al.
Published: (2024)
by: Wang, Alex Jinpeng, et al.
Published: (2024)
Pix2Gif: Motion-Guided Diffusion for GIF Generation
by: Kandala, Hitesh, et al.
Published: (2024)
by: Kandala, Hitesh, et al.
Published: (2024)
MindJourney: Test-Time Scaling with World Models for Spatial Reasoning
by: Yang, Yuncong, et al.
Published: (2025)
by: Yang, Yuncong, et al.
Published: (2025)
Motion Consistency Model: Accelerating Video Diffusion with Disentangled Motion-Appearance Distillation
by: Zhai, Yuanhao, et al.
Published: (2024)
by: Zhai, Yuanhao, et al.
Published: (2024)
AsgardBench -- Evaluating Visually Grounded Interactive Planning Under Minimal Feedback
by: Tupini, Andrea, et al.
Published: (2026)
by: Tupini, Andrea, et al.
Published: (2026)
Attention to Neural Plagiarism: Diffusion Models Can Plagiarize Your Copyrighted Images!
by: Zou, Zihang, et al.
Published: (2026)
by: Zou, Zihang, et al.
Published: (2026)
V-MAGE: A Game Evaluation Framework for Assessing Vision-Centric Capabilities in Multimodal Large Language Models
by: Zheng, Xiangxi, et al.
Published: (2025)
by: Zheng, Xiangxi, et al.
Published: (2025)
MM-Vet v2: A Challenging Benchmark to Evaluate Large Multimodal Models for Integrated Capabilities
by: Yu, Weihao, et al.
Published: (2024)
by: Yu, Weihao, et al.
Published: (2024)
Conditional Text-to-Image Generation with Reference Guidance
by: Kim, Taewook, et al.
Published: (2024)
by: Kim, Taewook, et al.
Published: (2024)
IDOL: Unified Dual-Modal Latent Diffusion for Human-Centric Joint Video-Depth Generation
by: Zhai, Yuanhao, et al.
Published: (2024)
by: Zhai, Yuanhao, et al.
Published: (2024)
The LLM Bottleneck: Why Open-Source Vision LLMs Struggle with Hierarchical Visual Recognition
by: Tan, Yuwen, et al.
Published: (2025)
by: Tan, Yuwen, et al.
Published: (2025)
Can MLLMs Reason in Multimodality? EMMA: An Enhanced MultiModal ReAsoning Benchmark
by: Hao, Yunzhuo, et al.
Published: (2025)
by: Hao, Yunzhuo, et al.
Published: (2025)
Tuning Timestep-Distilled Diffusion Model Using Pairwise Sample Optimization
by: Miao, Zichen, et al.
Published: (2024)
by: Miao, Zichen, et al.
Published: (2024)
Structured Video-Language Modeling with Temporal Grouping and Spatial Grounding
by: Xiong, Yuanhao, et al.
Published: (2023)
by: Xiong, Yuanhao, et al.
Published: (2023)
AutoDirector: Online Auto-scheduling Agents for Multi-sensory Composition
by: Ni, Minheng, et al.
Published: (2024)
by: Ni, Minheng, et al.
Published: (2024)
MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos
by: He, Xuehai, et al.
Published: (2024)
by: He, Xuehai, et al.
Published: (2024)
GenXD: Generating Any 3D and 4D Scenes
by: Zhao, Yuyang, et al.
Published: (2024)
by: Zhao, Yuyang, et al.
Published: (2024)
Interfacing Foundation Models' Embeddings
by: Zou, Xueyan, et al.
Published: (2023)
by: Zou, Xueyan, et al.
Published: (2023)
ImageGen-CoT: Enhancing Text-to-Image In-context Learning with Chain-of-Thought Reasoning
by: Liao, Jiaqi, et al.
Published: (2025)
by: Liao, Jiaqi, et al.
Published: (2025)
Continual Adapter Tuning with Semantic Shift Compensation for Class-Incremental Learning
by: Zhou, Qinhao, et al.
Published: (2024)
by: Zhou, Qinhao, et al.
Published: (2024)
Point-RFT: Improving Multimodal Reasoning with Visually Grounded Reinforcement Finetuning
by: Ni, Minheng, et al.
Published: (2025)
by: Ni, Minheng, et al.
Published: (2025)
CAPability: A Comprehensive Visual Caption Benchmark for Evaluating Both Correctness and Thoroughness
by: Liu, Zhihang, et al.
Published: (2025)
by: Liu, Zhihang, et al.
Published: (2025)
HypDAE: Hyperbolic Diffusion Autoencoders for Hierarchical Few-shot Image Generation
by: Li, Lingxiao, et al.
Published: (2024)
by: Li, Lingxiao, et al.
Published: (2024)
Glance: Accelerating Diffusion Models with 1 Sample
by: Dong, Zhuobai, et al.
Published: (2025)
by: Dong, Zhuobai, et al.
Published: (2025)
TechImage-Bench: Rubric-Based Evaluation for Technical Image Generation
by: Ni, Minheng, et al.
Published: (2025)
by: Ni, Minheng, et al.
Published: (2025)
Learning Sparse Visual Representations via Spatial-Semantic Factorization
by: Zhao, Theodore Zhengde, et al.
Published: (2026)
by: Zhao, Theodore Zhengde, et al.
Published: (2026)
DisCo: Disentangled Control for Realistic Human Dance Generation
by: Wang, Tan, et al.
Published: (2023)
by: Wang, Tan, et al.
Published: (2023)
FlowInOne:Unifying Multimodal Generation as Image-in, Image-out Flow Matching
by: Yi, Junchao, et al.
Published: (2026)
by: Yi, Junchao, et al.
Published: (2026)
Thinking in Structures: Evaluating Spatial Intelligence in Constraint-Governed Spaces
by: Yang, Chen, et al.
Published: (2026)
by: Yang, Chen, et al.
Published: (2026)
Struct2D: A Perception-Guided Framework for Spatial Reasoning in MLLMs
by: Zhu, Fangrui, et al.
Published: (2025)
by: Zhu, Fangrui, et al.
Published: (2025)
Similar Items
-
Elevating Visual Perception in Multimodal LLMs with Visual Embedding Distillation
by: Jain, Jitesh, et al.
Published: (2024) -
List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs
by: Yan, An, et al.
Published: (2024) -
Bring Metric Functions into Diffusion Models
by: An, Jie, et al.
Published: (2024) -
Lifting Data-Tracing Machine Unlearning to Knowledge-Tracing for Foundation Models
by: Tan, Yuwen, et al.
Published: (2025) -
Entity6K: A Large Open-Domain Evaluation Dataset for Real-World Entity Recognition
by: Qiu, Jielin, et al.
Published: (2024)