:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Xie, Jinheng, Feng, Jiajun, Tian, Zhaoxu, Lin, Kevin Qinghong, Huang, Yawen, Xia, Xi, Gong, Nanxu, Zuo, Xu, Yang, Jiaqi, Zheng, Yefeng, Shou, Mike Zheng
Format:	Preprint
Published:	2024
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2404.15909
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

VLog: Video-Language Models by Generative Retrieval of Narration Vocabulary
by: Lin, Kevin Qinghong, et al.
Published: (2025)

Paper2Video: Automatic Video Generation from Scientific Papers
by: Zhu, Zeyu, et al.
Published: (2025)

Code2Video: A Code-centric Paradigm for Educational Video Generation
by: Chen, Yanzhe, et al.
Published: (2025)

Learning Video Context as Interleaved Multimodal Sequences
by: Lin, Kevin Qinghong, et al.
Published: (2024)

Show-o2: Improved Native Unified Multimodal Models
by: Xie, Jinheng, et al.
Published: (2025)

Think or Not? Selective Reasoning via Reinforcement Learning for Vision-Language Models
by: Wang, Jiaqi, et al.
Published: (2025)

MovieBench: A Hierarchical Movie Level Dataset for Long Video Generation
by: Wu, Weijia, et al.
Published: (2024)

Soap2Soap: Long Cinematic Video Remaking via Multi-Agent Collaboration
by: Song, Yiren, et al.
Published: (2026)

VideoMind: A Chain-of-LoRA Agent for Temporal-Grounded Video Reasoning
by: Liu, Ye, et al.
Published: (2025)

ShowUI-$π$: Flow-based Generative Models as GUI Dexterous Hands
by: Hu, Siyuan, et al.
Published: (2025)

X-ray Insights Unleashed: Pioneering the Enhancement of Multi-Label Long-Tail Data
by: Yang, Xinquan, et al.
Published: (2025)

COSMO: COntrastive Streamlined MultimOdal Model with Interleaved Pre-Training
by: Wang, Alex Jinpeng, et al.
Published: (2024)

WMAdapter: Adding WaterMark Control to Latent Diffusion Models
by: Ci, Hai, et al.
Published: (2024)

Dynamically Masked Discriminator for Generative Adversarial Networks
by: Zhang, Wentian, et al.
Published: (2023)

FocusUI: Efficient UI Grounding via Position-Preserving Visual Token Selection
by: Ouyang, Mingyu, et al.
Published: (2026)

Show-o: One Single Transformer to Unify Multimodal Understanding and Generation
by: Xie, Jinheng, et al.
Published: (2024)

Bootstrapping SparseFormers from Vision Foundation Models
by: Gao, Ziteng, et al.
Published: (2023)

Long-Context Autoregressive Video Modeling with Next-Frame Prediction
by: Gu, Yuchao, et al.
Published: (2025)

VideoGUI: A Benchmark for GUI Automation from Instructional Videos
by: Lin, Kevin Qinghong, et al.
Published: (2024)

GameWorld: Towards Standardized and Verifiable Evaluation of Multimodal Game Agents
by: Ouyang, Mingyu, et al.
Published: (2026)

TPDiff: Temporal Pyramid Video Diffusion Model
by: Ran, Lingmin, et al.
Published: (2025)

SAM-I2V: Upgrading SAM to Support Promptable Video Segmentation with Less than 0.2% Training Cost
by: Mei, Haiyang, et al.
Published: (2025)

K-Space-Aware Cross-Modality Score for Synthesized Neuroimage Quality Assessment
by: Xie, Guoyang, et al.
Published: (2023)

Impossible Videos
by: Bai, Zechen, et al.
Published: (2025)

VideoLLM-online: Online Video Large Language Model for Streaming Video
by: Chen, Joya, et al.
Published: (2024)

VideoLLM-MoD: Efficient Video-Language Streaming with Mixture-of-Depths Vision Computation
by: Wu, Shiwei, et al.
Published: (2024)

Computer-Use Agents as Judges for Generative User Interface
by: Lin, Kevin Qinghong, et al.
Published: (2025)

PANDA: Towards Generalist Video Anomaly Detection via Agentic AI Engineer
by: Yang, Zhiwei, et al.
Published: (2025)

EVOLVE-VLA: Test-Time Training from Environment Feedback for Vision-Language-Action Models
by: Bai, Zechen, et al.
Published: (2025)

X-Humanoid: Robotize Human Videos to Generate Humanoid Videos at Scale
by: Yang, Pei, et al.
Published: (2025)

World-VLA-Loop: Closed-Loop Learning of Video World Model and VLA Policy
by: Liu, Xiaokang, et al.
Published: (2026)

ROICtrl: Boosting Instance Control for Visual Generation
by: Gu, Yuchao, et al.
Published: (2024)

SP-SLAM: Neural Real-Time Dense SLAM With Scene Priors
by: Hong, Zhen, et al.
Published: (2025)

P-Flow: Prompting Visual Effects Generation
by: Zhao, Rui, et al.
Published: (2026)

D-AR: Diffusion via Autoregressive Models
by: Gao, Ziteng, et al.
Published: (2025)

Ego-centric Predictive Model Conditioned on Hand Trajectories
by: Zhang, Binjie, et al.
Published: (2025)

CTNeRF: Cross-Time Transformer for Dynamic Neural Radiance Field from Monocular Video
by: Miao, Xingyu, et al.
Published: (2024)

Mitty: Diffusion-based Human-to-Robot Video Generation
by: Song, Yiren, et al.
Published: (2025)

Faster Diffusion via Temporal Attention Decomposition
by: Liu, Haozhe, et al.
Published: (2024)

VISTA: Triplet-Supervised Video Style Transfer with Diffusion Transformers
by: Song, Yiren, et al.
Published: (2026)