:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Zhang, David Junhao, Li, Dongxu, Le, Hung, Shou, Mike Zheng, Xiong, Caiming, Sahoo, Doyen
Format:	Preprint
Published:	2024
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2401.01827
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Draw-In-Mind: Rebalancing Designer-Painter Roles in Unified Multimodal Models Benefits Image Editing
by: Zeng, Ziyun, et al.
Published: (2025)

Ego-centric Predictive Model Conditioned on Hand Trajectories
by: Zhang, Binjie, et al.
Published: (2025)

VLog: Video-Language Models by Generative Retrieval of Narration Vocabulary
by: Lin, Kevin Qinghong, et al.
Published: (2025)

PANDA: Towards Generalist Video Anomaly Detection via Agentic AI Engineer
by: Yang, Zhiwei, et al.
Published: (2025)

TPDiff: Temporal Pyramid Video Diffusion Model
by: Ran, Lingmin, et al.
Published: (2025)

DoraCycle: Domain-Oriented Adaptation of Unified Generative Model in Multimodal Cycles
by: Zhao, Rui, et al.
Published: (2025)

Edit2Perceive: Image Editing Diffusion Models Are Strong Dense Perceivers
by: Shi, Yiqing, et al.
Published: (2025)

X-Humanoid: Robotize Human Videos to Generate Humanoid Videos at Scale
by: Yang, Pei, et al.
Published: (2025)

Towards A Better Metric for Text-to-Video Generation
by: Wu, Jay Zhangjie, et al.
Published: (2024)

Show-1: Marrying Pixel and Latent Diffusion Models for Text-to-Video Generation
by: Zhang, David Junhao, et al.
Published: (2023)

Impossible Videos
by: Bai, Zechen, et al.
Published: (2025)

Mitty: Diffusion-based Human-to-Robot Video Generation
by: Song, Yiren, et al.
Published: (2025)

ReCapture: Generative Video Camera Controls for User-Provided Videos using Masked Video Fine-Tuning
by: Zhang, David Junhao, et al.
Published: (2024)

P-Flow: Prompting Visual Effects Generation
by: Zhao, Rui, et al.
Published: (2026)

Show-o: One Single Transformer to Unify Multimodal Understanding and Generation
by: Xie, Jinheng, et al.
Published: (2024)

Edit Transfer: Learning Image Editing via Vision In-Context Relations
by: Chen, Lan, et al.
Published: (2025)

StreamingEffect: Real-Time Human-Centric Video Effect Generation
by: Song, Yiren, et al.
Published: (2026)

Kiwi-Edit: Versatile Video Editing via Instruction and Reference Guidance
by: Lin, Yiqi, et al.
Published: (2026)

Show-o2: Improved Native Unified Multimodal Models
by: Xie, Jinheng, et al.
Published: (2025)

SAM-I2V: Upgrading SAM to Support Promptable Video Segmentation with Less than 0.2% Training Cost
by: Mei, Haiyang, et al.
Published: (2025)

UniMoD: Efficient Unified Multimodal Transformers with Mixture-of-Depths
by: Mao, Weijia, et al.
Published: (2025)

Long-Context Autoregressive Video Modeling with Next-Frame Prediction
by: Gu, Yuchao, et al.
Published: (2025)

Genixer: Empowering Multimodal Large Language Models as a Powerful Data Generator
by: Zhao, Henry Hengyuan, et al.
Published: (2023)

UniRL: Self-Improving Unified Multimodal Models via Supervised and Reinforcement Learning
by: Mao, Weijia, et al.
Published: (2025)

WorldWander: Bridging Egocentric and Exocentric Worlds in Video Generation
by: Song, Quanjian, et al.
Published: (2025)

SWEET: Sparse World Modeling with Image Editing for Embodied Task Execution
by: Song, Yiren, et al.
Published: (2026)

DragAnything: Motion Control for Anything using Entity Representation
by: Wu, Weijia, et al.
Published: (2024)

OmniHumanoid: Streaming Cross-Embodiment Video Generation with Paired-Free Adaptation
by: Song, Yiren, et al.
Published: (2026)

Future Optical Flow Prediction Improves Robot Control & Video Generation
by: Ranasinghe, Kanchana, et al.
Published: (2026)

Automated Movie Generation via Multi-Agent CoT Planning
by: Wu, Weijia, et al.
Published: (2025)

DiffSeg30k: A Multi-Turn Diffusion Editing Benchmark for Localized AIGC Detection
by: Ci, Hai, et al.
Published: (2025)

VISTA: Triplet-Supervised Video Style Transfer with Diffusion Transformers
by: Song, Yiren, et al.
Published: (2026)

MakeAnything: Harnessing Diffusion Transformers for Multi-Domain Procedural Sequence Generation
by: Song, Yiren, et al.
Published: (2025)

Tuning-Free Image Editing with Fidelity and Editability via Unified Latent Diffusion Model
by: Mao, Qi, et al.
Published: (2025)

MP-Mat: A 3D-and-Instance-Aware Human Matting and Editing Framework with Multiplane Representation
by: Jiao, Siyi, et al.
Published: (2025)

Learning Video Context as Interleaved Multimodal Sequences
by: Lin, Kevin Qinghong, et al.
Published: (2024)

D-AR: Diffusion via Autoregressive Models
by: Gao, Ziteng, et al.
Published: (2025)

Paper2Video: Automatic Video Generation from Scientific Papers
by: Zhu, Zeyu, et al.
Published: (2025)

PhotoDoodle: Learning Artistic Image Editing from Few-Shot Pairwise Data
by: Huang, Shijie, et al.
Published: (2025)

ULIP-2: Towards Scalable Multimodal Pre-training for 3D Understanding
by: Xue, Le, et al.
Published: (2023)