:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Zhang, Chongzhi, Zhang, Mingyuan, Teng, Zhiyang, Li, Jiayi, Zhu, Xizhou, Lu, Lewei, Liu, Ziwei, Sun, Aixin
Format:	Preprint
Published:	2024
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2401.08232
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

A Flexible and Scalable Framework for Video Moment Search
by: Zhang, Chongzhi, et al.
Published: (2025)

TVR-Ranking: A Dataset for Ranked Video Moment Retrieval with Imprecise Queries
by: Liang, Renjie, et al.
Published: (2024)

PVC: Progressive Visual Token Compression for Unified Image and Video Processing in Large Vision-Language Models
by: Yang, Chenyu, et al.
Published: (2024)

Large Motion Model for Unified Multi-Modal Motion Generation
by: Zhang, Mingyuan, et al.
Published: (2024)

DriveMLM: Aligning Multi-Modal Large Language Models with Behavioral Planning States for Autonomous Driving
by: Cui, Erfei, et al.
Published: (2023)

HoVLE: Unleashing the Power of Monolithic Vision-Language Models with Holistic Vision-Language Embedding
by: Tao, Chenxin, et al.
Published: (2024)

CVD-STORM: Cross-View Video Diffusion with Spatial-Temporal Reconstruction Model for Autonomous Driving
by: Zhang, Tianrui, et al.
Published: (2025)

Streamline Without Sacrifice -- Squeeze out Computation Redundancy in LMM
by: Wu, Penghao, et al.
Published: (2025)

VisuLogic: A Benchmark for Evaluating Visual Reasoning in Multi-modal Large Language Models
by: Xu, Weiye, et al.
Published: (2025)

UltrAvatar: A Realistic Animatable 3D Avatar Diffusion Model with Authenticity Guided Textures
by: Zhou, Mingyuan, et al.
Published: (2024)

ADDP: Learning General Representations for Image Recognition and Generation with Alternating Denoising Diffusion Process
by: Tian, Changyao, et al.
Published: (2023)

Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization
by: Wang, Weiyun, et al.
Published: (2024)

Diffusion as Shader: 3D-aware Video Diffusion for Versatile Video Generation Control
by: Gu, Zekai, et al.
Published: (2025)

Auto MC-Reward: Automated Dense Reward Design with Large Language Models for Minecraft
by: Li, Hao, et al.
Published: (2023)

VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks
by: Wu, Jiannan, et al.
Published: (2024)

NL2Contact: Natural Language Guided 3D Hand-Object Contact Modeling with Diffusion Model
by: Zhang, Zhongqun, et al.
Published: (2024)

Error Analyses of Auto-Regressive Video Diffusion Models: A Unified Framework
by: Wang, Jing, et al.
Published: (2025)

Masked Diffusion Vision-Language Models for Temporal Action Localization
by: Wang, Fengshun, et al.
Published: (2026)

MM-Interleaved: Interleaved Image-Text Generative Modeling via Multi-modal Feature Synchronizer
by: Tian, Changyao, et al.
Published: (2024)

Learning 1D Causal Visual Representation with De-focus Attention Networks
by: Tao, Chenxin, et al.
Published: (2024)

MMInstruct: A High-Quality Multi-Modal Instruction Tuning Dataset with Extensive Diversity
by: Liu, Yangzhou, et al.
Published: (2024)

Incentivizing Temporal-Awareness in Egocentric Video Understanding Models
by: Xu, Zhiyang, et al.
Published: (2026)

Parameter-Inverted Image Pyramid Networks
by: Zhu, Xizhou, et al.
Published: (2024)

NaViL: Rethinking Scaling Properties of Native Multimodal Large Language Models under Data Constraints
by: Tian, Changyao, et al.
Published: (2025)

Visual Jigsaw Post-Training Improves MLLMs
by: Wu, Penghao, et al.
Published: (2025)

Vision-RWKV: Efficient and Scalable Visual Perception with RWKV-Like Architectures
by: Duan, Yuchen, et al.
Published: (2024)

WildRefer: 3D Object Localization in Large-scale Dynamic Scenes with Multi-modal Visual Data and Natural Language
by: Lin, Zhenxiang, et al.
Published: (2023)

DiffCrossGait: Trajectory-Level Alignment for 2D-3D Cross-Modal Gait Recognition via Latent Diffusion
by: Lu, Zhiyang, et al.
Published: (2026)

Vision Model Pre-training on Interleaved Image-Text Data via Latent Compression Learning
by: Yang, Chenyu, et al.
Published: (2024)

Bridging Semantic and Kinematic Conditions with Diffusion-based Discrete Motion Tokenizer
by: Gu, Chenyang, et al.
Published: (2026)

UniFork: Exploring Modality Alignment for Unified Multimodal Understanding and Generation
by: Li, Teng, et al.
Published: (2025)

InfLVG: Reinforce Inference-Time Consistent Long Video Generation with GRPO
by: Fang, Xueji, et al.
Published: (2025)

Collaborative Temporal Consistency Learning for Point-supervised Natural Language Video Localization
by: Tao, Zhuo, et al.
Published: (2025)

GUI-Reflection: Empowering Multimodal GUI Models with Self-Reflection Behavior
by: Wu, Penghao, et al.
Published: (2025)

Openstory++: A Large-scale Dataset and Benchmark for Instance-aware Open-domain Visual Storytelling
by: Ye, Zilyu, et al.
Published: (2024)

Weakly Supervised Monocular 3D Detection with a Single-View Image
by: Jiang, Xueying, et al.
Published: (2024)

Prompt Relay: Inference-Time Temporal Control for Multi-Event Video Generation
by: Chen, Gordon, et al.
Published: (2026)

Open-o3-Video: Grounded Video Reasoning with Explicit Spatio-Temporal Evidence
by: Meng, Jiahao, et al.
Published: (2025)

MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models
by: Meng, Fanqing, et al.
Published: (2024)

SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding
by: Li, Hao, et al.
Published: (2024)