Saved in:
| Main Authors: | Zhang, David Junhao, Li, Dongxu, Le, Hung, Shou, Mike Zheng, Xiong, Caiming, Sahoo, Doyen |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2401.01827 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Draw-In-Mind: Rebalancing Designer-Painter Roles in Unified Multimodal Models Benefits Image Editing
by: Zeng, Ziyun, et al.
Published: (2025)
by: Zeng, Ziyun, et al.
Published: (2025)
Ego-centric Predictive Model Conditioned on Hand Trajectories
by: Zhang, Binjie, et al.
Published: (2025)
by: Zhang, Binjie, et al.
Published: (2025)
VLog: Video-Language Models by Generative Retrieval of Narration Vocabulary
by: Lin, Kevin Qinghong, et al.
Published: (2025)
by: Lin, Kevin Qinghong, et al.
Published: (2025)
PANDA: Towards Generalist Video Anomaly Detection via Agentic AI Engineer
by: Yang, Zhiwei, et al.
Published: (2025)
by: Yang, Zhiwei, et al.
Published: (2025)
TPDiff: Temporal Pyramid Video Diffusion Model
by: Ran, Lingmin, et al.
Published: (2025)
by: Ran, Lingmin, et al.
Published: (2025)
DoraCycle: Domain-Oriented Adaptation of Unified Generative Model in Multimodal Cycles
by: Zhao, Rui, et al.
Published: (2025)
by: Zhao, Rui, et al.
Published: (2025)
Edit2Perceive: Image Editing Diffusion Models Are Strong Dense Perceivers
by: Shi, Yiqing, et al.
Published: (2025)
by: Shi, Yiqing, et al.
Published: (2025)
X-Humanoid: Robotize Human Videos to Generate Humanoid Videos at Scale
by: Yang, Pei, et al.
Published: (2025)
by: Yang, Pei, et al.
Published: (2025)
Towards A Better Metric for Text-to-Video Generation
by: Wu, Jay Zhangjie, et al.
Published: (2024)
by: Wu, Jay Zhangjie, et al.
Published: (2024)
Show-1: Marrying Pixel and Latent Diffusion Models for Text-to-Video Generation
by: Zhang, David Junhao, et al.
Published: (2023)
by: Zhang, David Junhao, et al.
Published: (2023)
Impossible Videos
by: Bai, Zechen, et al.
Published: (2025)
by: Bai, Zechen, et al.
Published: (2025)
Mitty: Diffusion-based Human-to-Robot Video Generation
by: Song, Yiren, et al.
Published: (2025)
by: Song, Yiren, et al.
Published: (2025)
ReCapture: Generative Video Camera Controls for User-Provided Videos using Masked Video Fine-Tuning
by: Zhang, David Junhao, et al.
Published: (2024)
by: Zhang, David Junhao, et al.
Published: (2024)
P-Flow: Prompting Visual Effects Generation
by: Zhao, Rui, et al.
Published: (2026)
by: Zhao, Rui, et al.
Published: (2026)
Show-o: One Single Transformer to Unify Multimodal Understanding and Generation
by: Xie, Jinheng, et al.
Published: (2024)
by: Xie, Jinheng, et al.
Published: (2024)
Edit Transfer: Learning Image Editing via Vision In-Context Relations
by: Chen, Lan, et al.
Published: (2025)
by: Chen, Lan, et al.
Published: (2025)
StreamingEffect: Real-Time Human-Centric Video Effect Generation
by: Song, Yiren, et al.
Published: (2026)
by: Song, Yiren, et al.
Published: (2026)
Kiwi-Edit: Versatile Video Editing via Instruction and Reference Guidance
by: Lin, Yiqi, et al.
Published: (2026)
by: Lin, Yiqi, et al.
Published: (2026)
Show-o2: Improved Native Unified Multimodal Models
by: Xie, Jinheng, et al.
Published: (2025)
by: Xie, Jinheng, et al.
Published: (2025)
SAM-I2V: Upgrading SAM to Support Promptable Video Segmentation with Less than 0.2% Training Cost
by: Mei, Haiyang, et al.
Published: (2025)
by: Mei, Haiyang, et al.
Published: (2025)
UniMoD: Efficient Unified Multimodal Transformers with Mixture-of-Depths
by: Mao, Weijia, et al.
Published: (2025)
by: Mao, Weijia, et al.
Published: (2025)
Long-Context Autoregressive Video Modeling with Next-Frame Prediction
by: Gu, Yuchao, et al.
Published: (2025)
by: Gu, Yuchao, et al.
Published: (2025)
Genixer: Empowering Multimodal Large Language Models as a Powerful Data Generator
by: Zhao, Henry Hengyuan, et al.
Published: (2023)
by: Zhao, Henry Hengyuan, et al.
Published: (2023)
UniRL: Self-Improving Unified Multimodal Models via Supervised and Reinforcement Learning
by: Mao, Weijia, et al.
Published: (2025)
by: Mao, Weijia, et al.
Published: (2025)
WorldWander: Bridging Egocentric and Exocentric Worlds in Video Generation
by: Song, Quanjian, et al.
Published: (2025)
by: Song, Quanjian, et al.
Published: (2025)
SWEET: Sparse World Modeling with Image Editing for Embodied Task Execution
by: Song, Yiren, et al.
Published: (2026)
by: Song, Yiren, et al.
Published: (2026)
DragAnything: Motion Control for Anything using Entity Representation
by: Wu, Weijia, et al.
Published: (2024)
by: Wu, Weijia, et al.
Published: (2024)
OmniHumanoid: Streaming Cross-Embodiment Video Generation with Paired-Free Adaptation
by: Song, Yiren, et al.
Published: (2026)
by: Song, Yiren, et al.
Published: (2026)
Future Optical Flow Prediction Improves Robot Control & Video Generation
by: Ranasinghe, Kanchana, et al.
Published: (2026)
by: Ranasinghe, Kanchana, et al.
Published: (2026)
Automated Movie Generation via Multi-Agent CoT Planning
by: Wu, Weijia, et al.
Published: (2025)
by: Wu, Weijia, et al.
Published: (2025)
DiffSeg30k: A Multi-Turn Diffusion Editing Benchmark for Localized AIGC Detection
by: Ci, Hai, et al.
Published: (2025)
by: Ci, Hai, et al.
Published: (2025)
VISTA: Triplet-Supervised Video Style Transfer with Diffusion Transformers
by: Song, Yiren, et al.
Published: (2026)
by: Song, Yiren, et al.
Published: (2026)
MakeAnything: Harnessing Diffusion Transformers for Multi-Domain Procedural Sequence Generation
by: Song, Yiren, et al.
Published: (2025)
by: Song, Yiren, et al.
Published: (2025)
Tuning-Free Image Editing with Fidelity and Editability via Unified Latent Diffusion Model
by: Mao, Qi, et al.
Published: (2025)
by: Mao, Qi, et al.
Published: (2025)
MP-Mat: A 3D-and-Instance-Aware Human Matting and Editing Framework with Multiplane Representation
by: Jiao, Siyi, et al.
Published: (2025)
by: Jiao, Siyi, et al.
Published: (2025)
Learning Video Context as Interleaved Multimodal Sequences
by: Lin, Kevin Qinghong, et al.
Published: (2024)
by: Lin, Kevin Qinghong, et al.
Published: (2024)
D-AR: Diffusion via Autoregressive Models
by: Gao, Ziteng, et al.
Published: (2025)
by: Gao, Ziteng, et al.
Published: (2025)
Paper2Video: Automatic Video Generation from Scientific Papers
by: Zhu, Zeyu, et al.
Published: (2025)
by: Zhu, Zeyu, et al.
Published: (2025)
PhotoDoodle: Learning Artistic Image Editing from Few-Shot Pairwise Data
by: Huang, Shijie, et al.
Published: (2025)
by: Huang, Shijie, et al.
Published: (2025)
ULIP-2: Towards Scalable Multimodal Pre-training for 3D Understanding
by: Xue, Le, et al.
Published: (2023)
by: Xue, Le, et al.
Published: (2023)
Similar Items
-
Draw-In-Mind: Rebalancing Designer-Painter Roles in Unified Multimodal Models Benefits Image Editing
by: Zeng, Ziyun, et al.
Published: (2025) -
Ego-centric Predictive Model Conditioned on Hand Trajectories
by: Zhang, Binjie, et al.
Published: (2025) -
VLog: Video-Language Models by Generative Retrieval of Narration Vocabulary
by: Lin, Kevin Qinghong, et al.
Published: (2025) -
PANDA: Towards Generalist Video Anomaly Detection via Agentic AI Engineer
by: Yang, Zhiwei, et al.
Published: (2025) -
TPDiff: Temporal Pyramid Video Diffusion Model
by: Ran, Lingmin, et al.
Published: (2025)