Saved in:
| Main Authors: | Xie, Jinheng, Feng, Jiajun, Tian, Zhaoxu, Lin, Kevin Qinghong, Huang, Yawen, Xia, Xi, Gong, Nanxu, Zuo, Xu, Yang, Jiaqi, Zheng, Yefeng, Shou, Mike Zheng |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2404.15909 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
VLog: Video-Language Models by Generative Retrieval of Narration Vocabulary
by: Lin, Kevin Qinghong, et al.
Published: (2025)
by: Lin, Kevin Qinghong, et al.
Published: (2025)
Paper2Video: Automatic Video Generation from Scientific Papers
by: Zhu, Zeyu, et al.
Published: (2025)
by: Zhu, Zeyu, et al.
Published: (2025)
Code2Video: A Code-centric Paradigm for Educational Video Generation
by: Chen, Yanzhe, et al.
Published: (2025)
by: Chen, Yanzhe, et al.
Published: (2025)
Learning Video Context as Interleaved Multimodal Sequences
by: Lin, Kevin Qinghong, et al.
Published: (2024)
by: Lin, Kevin Qinghong, et al.
Published: (2024)
Show-o2: Improved Native Unified Multimodal Models
by: Xie, Jinheng, et al.
Published: (2025)
by: Xie, Jinheng, et al.
Published: (2025)
Think or Not? Selective Reasoning via Reinforcement Learning for Vision-Language Models
by: Wang, Jiaqi, et al.
Published: (2025)
by: Wang, Jiaqi, et al.
Published: (2025)
MovieBench: A Hierarchical Movie Level Dataset for Long Video Generation
by: Wu, Weijia, et al.
Published: (2024)
by: Wu, Weijia, et al.
Published: (2024)
Soap2Soap: Long Cinematic Video Remaking via Multi-Agent Collaboration
by: Song, Yiren, et al.
Published: (2026)
by: Song, Yiren, et al.
Published: (2026)
VideoMind: A Chain-of-LoRA Agent for Temporal-Grounded Video Reasoning
by: Liu, Ye, et al.
Published: (2025)
by: Liu, Ye, et al.
Published: (2025)
ShowUI-$π$: Flow-based Generative Models as GUI Dexterous Hands
by: Hu, Siyuan, et al.
Published: (2025)
by: Hu, Siyuan, et al.
Published: (2025)
X-ray Insights Unleashed: Pioneering the Enhancement of Multi-Label Long-Tail Data
by: Yang, Xinquan, et al.
Published: (2025)
by: Yang, Xinquan, et al.
Published: (2025)
COSMO: COntrastive Streamlined MultimOdal Model with Interleaved Pre-Training
by: Wang, Alex Jinpeng, et al.
Published: (2024)
by: Wang, Alex Jinpeng, et al.
Published: (2024)
WMAdapter: Adding WaterMark Control to Latent Diffusion Models
by: Ci, Hai, et al.
Published: (2024)
by: Ci, Hai, et al.
Published: (2024)
Dynamically Masked Discriminator for Generative Adversarial Networks
by: Zhang, Wentian, et al.
Published: (2023)
by: Zhang, Wentian, et al.
Published: (2023)
FocusUI: Efficient UI Grounding via Position-Preserving Visual Token Selection
by: Ouyang, Mingyu, et al.
Published: (2026)
by: Ouyang, Mingyu, et al.
Published: (2026)
Show-o: One Single Transformer to Unify Multimodal Understanding and Generation
by: Xie, Jinheng, et al.
Published: (2024)
by: Xie, Jinheng, et al.
Published: (2024)
Bootstrapping SparseFormers from Vision Foundation Models
by: Gao, Ziteng, et al.
Published: (2023)
by: Gao, Ziteng, et al.
Published: (2023)
Long-Context Autoregressive Video Modeling with Next-Frame Prediction
by: Gu, Yuchao, et al.
Published: (2025)
by: Gu, Yuchao, et al.
Published: (2025)
VideoGUI: A Benchmark for GUI Automation from Instructional Videos
by: Lin, Kevin Qinghong, et al.
Published: (2024)
by: Lin, Kevin Qinghong, et al.
Published: (2024)
GameWorld: Towards Standardized and Verifiable Evaluation of Multimodal Game Agents
by: Ouyang, Mingyu, et al.
Published: (2026)
by: Ouyang, Mingyu, et al.
Published: (2026)
TPDiff: Temporal Pyramid Video Diffusion Model
by: Ran, Lingmin, et al.
Published: (2025)
by: Ran, Lingmin, et al.
Published: (2025)
SAM-I2V: Upgrading SAM to Support Promptable Video Segmentation with Less than 0.2% Training Cost
by: Mei, Haiyang, et al.
Published: (2025)
by: Mei, Haiyang, et al.
Published: (2025)
K-Space-Aware Cross-Modality Score for Synthesized Neuroimage Quality Assessment
by: Xie, Guoyang, et al.
Published: (2023)
by: Xie, Guoyang, et al.
Published: (2023)
Impossible Videos
by: Bai, Zechen, et al.
Published: (2025)
by: Bai, Zechen, et al.
Published: (2025)
VideoLLM-online: Online Video Large Language Model for Streaming Video
by: Chen, Joya, et al.
Published: (2024)
by: Chen, Joya, et al.
Published: (2024)
VideoLLM-MoD: Efficient Video-Language Streaming with Mixture-of-Depths Vision Computation
by: Wu, Shiwei, et al.
Published: (2024)
by: Wu, Shiwei, et al.
Published: (2024)
Computer-Use Agents as Judges for Generative User Interface
by: Lin, Kevin Qinghong, et al.
Published: (2025)
by: Lin, Kevin Qinghong, et al.
Published: (2025)
PANDA: Towards Generalist Video Anomaly Detection via Agentic AI Engineer
by: Yang, Zhiwei, et al.
Published: (2025)
by: Yang, Zhiwei, et al.
Published: (2025)
EVOLVE-VLA: Test-Time Training from Environment Feedback for Vision-Language-Action Models
by: Bai, Zechen, et al.
Published: (2025)
by: Bai, Zechen, et al.
Published: (2025)
X-Humanoid: Robotize Human Videos to Generate Humanoid Videos at Scale
by: Yang, Pei, et al.
Published: (2025)
by: Yang, Pei, et al.
Published: (2025)
World-VLA-Loop: Closed-Loop Learning of Video World Model and VLA Policy
by: Liu, Xiaokang, et al.
Published: (2026)
by: Liu, Xiaokang, et al.
Published: (2026)
ROICtrl: Boosting Instance Control for Visual Generation
by: Gu, Yuchao, et al.
Published: (2024)
by: Gu, Yuchao, et al.
Published: (2024)
SP-SLAM: Neural Real-Time Dense SLAM With Scene Priors
by: Hong, Zhen, et al.
Published: (2025)
by: Hong, Zhen, et al.
Published: (2025)
P-Flow: Prompting Visual Effects Generation
by: Zhao, Rui, et al.
Published: (2026)
by: Zhao, Rui, et al.
Published: (2026)
D-AR: Diffusion via Autoregressive Models
by: Gao, Ziteng, et al.
Published: (2025)
by: Gao, Ziteng, et al.
Published: (2025)
Ego-centric Predictive Model Conditioned on Hand Trajectories
by: Zhang, Binjie, et al.
Published: (2025)
by: Zhang, Binjie, et al.
Published: (2025)
CTNeRF: Cross-Time Transformer for Dynamic Neural Radiance Field from Monocular Video
by: Miao, Xingyu, et al.
Published: (2024)
by: Miao, Xingyu, et al.
Published: (2024)
Mitty: Diffusion-based Human-to-Robot Video Generation
by: Song, Yiren, et al.
Published: (2025)
by: Song, Yiren, et al.
Published: (2025)
Faster Diffusion via Temporal Attention Decomposition
by: Liu, Haozhe, et al.
Published: (2024)
by: Liu, Haozhe, et al.
Published: (2024)
VISTA: Triplet-Supervised Video Style Transfer with Diffusion Transformers
by: Song, Yiren, et al.
Published: (2026)
by: Song, Yiren, et al.
Published: (2026)
Similar Items
-
VLog: Video-Language Models by Generative Retrieval of Narration Vocabulary
by: Lin, Kevin Qinghong, et al.
Published: (2025) -
Paper2Video: Automatic Video Generation from Scientific Papers
by: Zhu, Zeyu, et al.
Published: (2025) -
Code2Video: A Code-centric Paradigm for Educational Video Generation
by: Chen, Yanzhe, et al.
Published: (2025) -
Learning Video Context as Interleaved Multimodal Sequences
by: Lin, Kevin Qinghong, et al.
Published: (2024) -
Show-o2: Improved Native Unified Multimodal Models
by: Xie, Jinheng, et al.
Published: (2025)