Saved in:
| Main Authors: | Wu, Weijia, Zhu, Zeyu, Shou, Mike Zheng |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2503.07314 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Multi-human Interactive Talking Dataset
by: Zhu, Zeyu, et al.
Published: (2025)
by: Zhu, Zeyu, et al.
Published: (2025)
MovieBench: A Hierarchical Movie Level Dataset for Long Video Generation
by: Wu, Weijia, et al.
Published: (2024)
by: Wu, Weijia, et al.
Published: (2024)
DoraCycle: Domain-Oriented Adaptation of Unified Generative Model in Multimodal Cycles
by: Zhao, Rui, et al.
Published: (2025)
by: Zhao, Rui, et al.
Published: (2025)
UniRL: Self-Improving Unified Multimodal Models via Supervised and Reinforcement Learning
by: Mao, Weijia, et al.
Published: (2025)
by: Mao, Weijia, et al.
Published: (2025)
Mitty: Diffusion-based Human-to-Robot Video Generation
by: Song, Yiren, et al.
Published: (2025)
by: Song, Yiren, et al.
Published: (2025)
Long-Context Autoregressive Video Modeling with Next-Frame Prediction
by: Gu, Yuchao, et al.
Published: (2025)
by: Gu, Yuchao, et al.
Published: (2025)
UniMoD: Efficient Unified Multimodal Transformers with Mixture-of-Depths
by: Mao, Weijia, et al.
Published: (2025)
by: Mao, Weijia, et al.
Published: (2025)
The Image as Its Own Reward: Reinforcement Learning with Adversarial Reward for Image Generation
by: Mao, Weijia, et al.
Published: (2025)
by: Mao, Weijia, et al.
Published: (2025)
FakeVLM-R1: Internalizing Physical Laws via CoT for Synthetic Image Detection
by: Zhu, Leqi, et al.
Published: (2026)
by: Zhu, Leqi, et al.
Published: (2026)
DiffuMask: Synthesizing Images with Pixel-level Annotations for Semantic Segmentation Using Diffusion Models
by: Wu, Weijia, et al.
Published: (2023)
by: Wu, Weijia, et al.
Published: (2023)
Paper2Video: Automatic Video Generation from Scientific Papers
by: Zhu, Zeyu, et al.
Published: (2025)
by: Zhu, Zeyu, et al.
Published: (2025)
MakeAnything: Harnessing Diffusion Transformers for Multi-Domain Procedural Sequence Generation
by: Song, Yiren, et al.
Published: (2025)
by: Song, Yiren, et al.
Published: (2025)
P-Flow: Prompting Visual Effects Generation
by: Zhao, Rui, et al.
Published: (2026)
by: Zhao, Rui, et al.
Published: (2026)
D-AR: Diffusion via Autoregressive Models
by: Gao, Ziteng, et al.
Published: (2025)
by: Gao, Ziteng, et al.
Published: (2025)
Soap2Soap: Long Cinematic Video Remaking via Multi-Agent Collaboration
by: Song, Yiren, et al.
Published: (2026)
by: Song, Yiren, et al.
Published: (2026)
Towards Automated Movie Trailer Generation
by: Argaw, Dawit Mureja, et al.
Published: (2024)
by: Argaw, Dawit Mureja, et al.
Published: (2024)
VLog: Video-Language Models by Generative Retrieval of Narration Vocabulary
by: Lin, Kevin Qinghong, et al.
Published: (2025)
by: Lin, Kevin Qinghong, et al.
Published: (2025)
Paragraph-to-Image Generation with Information-Enriched Diffusion Model
by: Wu, Weijia, et al.
Published: (2023)
by: Wu, Weijia, et al.
Published: (2023)
PANDA: Towards Generalist Video Anomaly Detection via Agentic AI Engineer
by: Yang, Zhiwei, et al.
Published: (2025)
by: Yang, Zhiwei, et al.
Published: (2025)
LayerTracer: Cognitive-Aligned Layered SVG Synthesis via Diffusion Transformer
by: Song, Yiren, et al.
Published: (2025)
by: Song, Yiren, et al.
Published: (2025)
TPDiff: Temporal Pyramid Video Diffusion Model
by: Ran, Lingmin, et al.
Published: (2025)
by: Ran, Lingmin, et al.
Published: (2025)
Ego-centric Predictive Model Conditioned on Hand Trajectories
by: Zhang, Binjie, et al.
Published: (2025)
by: Zhang, Binjie, et al.
Published: (2025)
LaV-CoT: Language-Aware Visual CoT with Multi-Aspect Reward Optimization for Real-World Multilingual VQA
by: Huang, Jing, et al.
Published: (2025)
by: Huang, Jing, et al.
Published: (2025)
DraCo: Draft as CoT for Text-to-Image Preview and Rare Concept Generation
by: Jiang, Dongzhi, et al.
Published: (2025)
by: Jiang, Dongzhi, et al.
Published: (2025)
MM-MovieDubber: Towards Multi-Modal Learning for Multi-Modal Movie Dubbing
by: Zheng, Junjie, et al.
Published: (2025)
by: Zheng, Junjie, et al.
Published: (2025)
RingID: Rethinking Tree-Ring Watermarking for Enhanced Multi-Key Identification
by: Ci, Hai, et al.
Published: (2024)
by: Ci, Hai, et al.
Published: (2024)
OmniPSD: Layered PSD Generation with Diffusion Transformer
by: Liu, Cheng, et al.
Published: (2025)
by: Liu, Cheng, et al.
Published: (2025)
X-Humanoid: Robotize Human Videos to Generate Humanoid Videos at Scale
by: Yang, Pei, et al.
Published: (2025)
by: Yang, Pei, et al.
Published: (2025)
StreamingEffect: Real-Time Human-Centric Video Effect Generation
by: Song, Yiren, et al.
Published: (2026)
by: Song, Yiren, et al.
Published: (2026)
IDProtector: An Adversarial Noise Encoder to Protect Against ID-Preserving Image Generation
by: Song, Yiren, et al.
Published: (2024)
by: Song, Yiren, et al.
Published: (2024)
Edit Transfer: Learning Image Editing via Vision In-Context Relations
by: Chen, Lan, et al.
Published: (2025)
by: Chen, Lan, et al.
Published: (2025)
The Consistency Critic: Correcting Inconsistencies in Generated Images via Reference-Guided Attentive Alignment
by: Ouyang, Ziheng, et al.
Published: (2025)
by: Ouyang, Ziheng, et al.
Published: (2025)
Empowering Lightweight MLLMs with Reasoning via Long CoT SFT
by: Ou, Linyu, et al.
Published: (2025)
by: Ou, Linyu, et al.
Published: (2025)
VideoAgent2: Enhancing the LLM-Based Agent System for Long-Form Video Understanding by Uncertainty-Aware CoT
by: Zhi, Zhuo, et al.
Published: (2025)
by: Zhi, Zhuo, et al.
Published: (2025)
SurvAgent: Hierarchical CoT-Enhanced Case Banking and Dichotomy-Based Multi-Agent System for Multimodal Survival Prediction
by: Huang, Guolin, et al.
Published: (2025)
by: Huang, Guolin, et al.
Published: (2025)
VG-TVP: Multimodal Procedural Planning via Visually Grounded Text-Video Prompting
by: Ilaslan, Muhammet Furkan, et al.
Published: (2024)
by: Ilaslan, Muhammet Furkan, et al.
Published: (2024)
MC-CoT: A Modular Collaborative CoT Framework for Zero-shot Medical-VQA with LLM and MLLM Integration
by: Wei, Lai, et al.
Published: (2024)
by: Wei, Lai, et al.
Published: (2024)
Parrot Captions Teach CLIP to Spot Text
by: Lin, Yiqi, et al.
Published: (2023)
by: Lin, Yiqi, et al.
Published: (2023)
Benchmarking Multimodal CoT Reward Model Stepwise by Visual Program
by: Gao, Minghe, et al.
Published: (2025)
by: Gao, Minghe, et al.
Published: (2025)
Multi-Modal Generative Embedding Model
by: Ma, Feipeng, et al.
Published: (2024)
by: Ma, Feipeng, et al.
Published: (2024)
Similar Items
-
Multi-human Interactive Talking Dataset
by: Zhu, Zeyu, et al.
Published: (2025) -
MovieBench: A Hierarchical Movie Level Dataset for Long Video Generation
by: Wu, Weijia, et al.
Published: (2024) -
DoraCycle: Domain-Oriented Adaptation of Unified Generative Model in Multimodal Cycles
by: Zhao, Rui, et al.
Published: (2025) -
UniRL: Self-Improving Unified Multimodal Models via Supervised and Reinforcement Learning
by: Mao, Weijia, et al.
Published: (2025) -
Mitty: Diffusion-based Human-to-Robot Video Generation
by: Song, Yiren, et al.
Published: (2025)