:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Wu, Weijia, Zhu, Zeyu, Shou, Mike Zheng
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2503.07314
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Multi-human Interactive Talking Dataset
by: Zhu, Zeyu, et al.
Published: (2025)

MovieBench: A Hierarchical Movie Level Dataset for Long Video Generation
by: Wu, Weijia, et al.
Published: (2024)

DoraCycle: Domain-Oriented Adaptation of Unified Generative Model in Multimodal Cycles
by: Zhao, Rui, et al.
Published: (2025)

UniRL: Self-Improving Unified Multimodal Models via Supervised and Reinforcement Learning
by: Mao, Weijia, et al.
Published: (2025)

Mitty: Diffusion-based Human-to-Robot Video Generation
by: Song, Yiren, et al.
Published: (2025)

Long-Context Autoregressive Video Modeling with Next-Frame Prediction
by: Gu, Yuchao, et al.
Published: (2025)

UniMoD: Efficient Unified Multimodal Transformers with Mixture-of-Depths
by: Mao, Weijia, et al.
Published: (2025)

The Image as Its Own Reward: Reinforcement Learning with Adversarial Reward for Image Generation
by: Mao, Weijia, et al.
Published: (2025)

FakeVLM-R1: Internalizing Physical Laws via CoT for Synthetic Image Detection
by: Zhu, Leqi, et al.
Published: (2026)

DiffuMask: Synthesizing Images with Pixel-level Annotations for Semantic Segmentation Using Diffusion Models
by: Wu, Weijia, et al.
Published: (2023)

Paper2Video: Automatic Video Generation from Scientific Papers
by: Zhu, Zeyu, et al.
Published: (2025)

MakeAnything: Harnessing Diffusion Transformers for Multi-Domain Procedural Sequence Generation
by: Song, Yiren, et al.
Published: (2025)

P-Flow: Prompting Visual Effects Generation
by: Zhao, Rui, et al.
Published: (2026)

D-AR: Diffusion via Autoregressive Models
by: Gao, Ziteng, et al.
Published: (2025)

Soap2Soap: Long Cinematic Video Remaking via Multi-Agent Collaboration
by: Song, Yiren, et al.
Published: (2026)

Towards Automated Movie Trailer Generation
by: Argaw, Dawit Mureja, et al.
Published: (2024)

VLog: Video-Language Models by Generative Retrieval of Narration Vocabulary
by: Lin, Kevin Qinghong, et al.
Published: (2025)

Paragraph-to-Image Generation with Information-Enriched Diffusion Model
by: Wu, Weijia, et al.
Published: (2023)

PANDA: Towards Generalist Video Anomaly Detection via Agentic AI Engineer
by: Yang, Zhiwei, et al.
Published: (2025)

LayerTracer: Cognitive-Aligned Layered SVG Synthesis via Diffusion Transformer
by: Song, Yiren, et al.
Published: (2025)

TPDiff: Temporal Pyramid Video Diffusion Model
by: Ran, Lingmin, et al.
Published: (2025)

Ego-centric Predictive Model Conditioned on Hand Trajectories
by: Zhang, Binjie, et al.
Published: (2025)

LaV-CoT: Language-Aware Visual CoT with Multi-Aspect Reward Optimization for Real-World Multilingual VQA
by: Huang, Jing, et al.
Published: (2025)

DraCo: Draft as CoT for Text-to-Image Preview and Rare Concept Generation
by: Jiang, Dongzhi, et al.
Published: (2025)

MM-MovieDubber: Towards Multi-Modal Learning for Multi-Modal Movie Dubbing
by: Zheng, Junjie, et al.
Published: (2025)

RingID: Rethinking Tree-Ring Watermarking for Enhanced Multi-Key Identification
by: Ci, Hai, et al.
Published: (2024)

OmniPSD: Layered PSD Generation with Diffusion Transformer
by: Liu, Cheng, et al.
Published: (2025)

X-Humanoid: Robotize Human Videos to Generate Humanoid Videos at Scale
by: Yang, Pei, et al.
Published: (2025)

StreamingEffect: Real-Time Human-Centric Video Effect Generation
by: Song, Yiren, et al.
Published: (2026)

IDProtector: An Adversarial Noise Encoder to Protect Against ID-Preserving Image Generation
by: Song, Yiren, et al.
Published: (2024)

Edit Transfer: Learning Image Editing via Vision In-Context Relations
by: Chen, Lan, et al.
Published: (2025)

The Consistency Critic: Correcting Inconsistencies in Generated Images via Reference-Guided Attentive Alignment
by: Ouyang, Ziheng, et al.
Published: (2025)

Empowering Lightweight MLLMs with Reasoning via Long CoT SFT
by: Ou, Linyu, et al.
Published: (2025)

VideoAgent2: Enhancing the LLM-Based Agent System for Long-Form Video Understanding by Uncertainty-Aware CoT
by: Zhi, Zhuo, et al.
Published: (2025)

SurvAgent: Hierarchical CoT-Enhanced Case Banking and Dichotomy-Based Multi-Agent System for Multimodal Survival Prediction
by: Huang, Guolin, et al.
Published: (2025)

VG-TVP: Multimodal Procedural Planning via Visually Grounded Text-Video Prompting
by: Ilaslan, Muhammet Furkan, et al.
Published: (2024)

MC-CoT: A Modular Collaborative CoT Framework for Zero-shot Medical-VQA with LLM and MLLM Integration
by: Wei, Lai, et al.
Published: (2024)

Parrot Captions Teach CLIP to Spot Text
by: Lin, Yiqi, et al.
Published: (2023)

Benchmarking Multimodal CoT Reward Model Stepwise by Visual Program
by: Gao, Minghe, et al.
Published: (2025)

Multi-Modal Generative Embedding Model
by: Ma, Feipeng, et al.
Published: (2024)