Saved in:
| Main Authors: | Zhang, Chongzhi, Zhang, Mingyuan, Teng, Zhiyang, Li, Jiayi, Zhu, Xizhou, Lu, Lewei, Liu, Ziwei, Sun, Aixin |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2401.08232 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
A Flexible and Scalable Framework for Video Moment Search
by: Zhang, Chongzhi, et al.
Published: (2025)
by: Zhang, Chongzhi, et al.
Published: (2025)
TVR-Ranking: A Dataset for Ranked Video Moment Retrieval with Imprecise Queries
by: Liang, Renjie, et al.
Published: (2024)
by: Liang, Renjie, et al.
Published: (2024)
PVC: Progressive Visual Token Compression for Unified Image and Video Processing in Large Vision-Language Models
by: Yang, Chenyu, et al.
Published: (2024)
by: Yang, Chenyu, et al.
Published: (2024)
Large Motion Model for Unified Multi-Modal Motion Generation
by: Zhang, Mingyuan, et al.
Published: (2024)
by: Zhang, Mingyuan, et al.
Published: (2024)
DriveMLM: Aligning Multi-Modal Large Language Models with Behavioral Planning States for Autonomous Driving
by: Cui, Erfei, et al.
Published: (2023)
by: Cui, Erfei, et al.
Published: (2023)
HoVLE: Unleashing the Power of Monolithic Vision-Language Models with Holistic Vision-Language Embedding
by: Tao, Chenxin, et al.
Published: (2024)
by: Tao, Chenxin, et al.
Published: (2024)
CVD-STORM: Cross-View Video Diffusion with Spatial-Temporal Reconstruction Model for Autonomous Driving
by: Zhang, Tianrui, et al.
Published: (2025)
by: Zhang, Tianrui, et al.
Published: (2025)
Streamline Without Sacrifice -- Squeeze out Computation Redundancy in LMM
by: Wu, Penghao, et al.
Published: (2025)
by: Wu, Penghao, et al.
Published: (2025)
VisuLogic: A Benchmark for Evaluating Visual Reasoning in Multi-modal Large Language Models
by: Xu, Weiye, et al.
Published: (2025)
by: Xu, Weiye, et al.
Published: (2025)
UltrAvatar: A Realistic Animatable 3D Avatar Diffusion Model with Authenticity Guided Textures
by: Zhou, Mingyuan, et al.
Published: (2024)
by: Zhou, Mingyuan, et al.
Published: (2024)
ADDP: Learning General Representations for Image Recognition and Generation with Alternating Denoising Diffusion Process
by: Tian, Changyao, et al.
Published: (2023)
by: Tian, Changyao, et al.
Published: (2023)
Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization
by: Wang, Weiyun, et al.
Published: (2024)
by: Wang, Weiyun, et al.
Published: (2024)
Diffusion as Shader: 3D-aware Video Diffusion for Versatile Video Generation Control
by: Gu, Zekai, et al.
Published: (2025)
by: Gu, Zekai, et al.
Published: (2025)
Auto MC-Reward: Automated Dense Reward Design with Large Language Models for Minecraft
by: Li, Hao, et al.
Published: (2023)
by: Li, Hao, et al.
Published: (2023)
VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks
by: Wu, Jiannan, et al.
Published: (2024)
by: Wu, Jiannan, et al.
Published: (2024)
NL2Contact: Natural Language Guided 3D Hand-Object Contact Modeling with Diffusion Model
by: Zhang, Zhongqun, et al.
Published: (2024)
by: Zhang, Zhongqun, et al.
Published: (2024)
Error Analyses of Auto-Regressive Video Diffusion Models: A Unified Framework
by: Wang, Jing, et al.
Published: (2025)
by: Wang, Jing, et al.
Published: (2025)
Masked Diffusion Vision-Language Models for Temporal Action Localization
by: Wang, Fengshun, et al.
Published: (2026)
by: Wang, Fengshun, et al.
Published: (2026)
MM-Interleaved: Interleaved Image-Text Generative Modeling via Multi-modal Feature Synchronizer
by: Tian, Changyao, et al.
Published: (2024)
by: Tian, Changyao, et al.
Published: (2024)
Learning 1D Causal Visual Representation with De-focus Attention Networks
by: Tao, Chenxin, et al.
Published: (2024)
by: Tao, Chenxin, et al.
Published: (2024)
MMInstruct: A High-Quality Multi-Modal Instruction Tuning Dataset with Extensive Diversity
by: Liu, Yangzhou, et al.
Published: (2024)
by: Liu, Yangzhou, et al.
Published: (2024)
Incentivizing Temporal-Awareness in Egocentric Video Understanding Models
by: Xu, Zhiyang, et al.
Published: (2026)
by: Xu, Zhiyang, et al.
Published: (2026)
Parameter-Inverted Image Pyramid Networks
by: Zhu, Xizhou, et al.
Published: (2024)
by: Zhu, Xizhou, et al.
Published: (2024)
NaViL: Rethinking Scaling Properties of Native Multimodal Large Language Models under Data Constraints
by: Tian, Changyao, et al.
Published: (2025)
by: Tian, Changyao, et al.
Published: (2025)
Visual Jigsaw Post-Training Improves MLLMs
by: Wu, Penghao, et al.
Published: (2025)
by: Wu, Penghao, et al.
Published: (2025)
Vision-RWKV: Efficient and Scalable Visual Perception with RWKV-Like Architectures
by: Duan, Yuchen, et al.
Published: (2024)
by: Duan, Yuchen, et al.
Published: (2024)
WildRefer: 3D Object Localization in Large-scale Dynamic Scenes with Multi-modal Visual Data and Natural Language
by: Lin, Zhenxiang, et al.
Published: (2023)
by: Lin, Zhenxiang, et al.
Published: (2023)
DiffCrossGait: Trajectory-Level Alignment for 2D-3D Cross-Modal Gait Recognition via Latent Diffusion
by: Lu, Zhiyang, et al.
Published: (2026)
by: Lu, Zhiyang, et al.
Published: (2026)
Vision Model Pre-training on Interleaved Image-Text Data via Latent Compression Learning
by: Yang, Chenyu, et al.
Published: (2024)
by: Yang, Chenyu, et al.
Published: (2024)
Bridging Semantic and Kinematic Conditions with Diffusion-based Discrete Motion Tokenizer
by: Gu, Chenyang, et al.
Published: (2026)
by: Gu, Chenyang, et al.
Published: (2026)
UniFork: Exploring Modality Alignment for Unified Multimodal Understanding and Generation
by: Li, Teng, et al.
Published: (2025)
by: Li, Teng, et al.
Published: (2025)
InfLVG: Reinforce Inference-Time Consistent Long Video Generation with GRPO
by: Fang, Xueji, et al.
Published: (2025)
by: Fang, Xueji, et al.
Published: (2025)
Collaborative Temporal Consistency Learning for Point-supervised Natural Language Video Localization
by: Tao, Zhuo, et al.
Published: (2025)
by: Tao, Zhuo, et al.
Published: (2025)
GUI-Reflection: Empowering Multimodal GUI Models with Self-Reflection Behavior
by: Wu, Penghao, et al.
Published: (2025)
by: Wu, Penghao, et al.
Published: (2025)
Openstory++: A Large-scale Dataset and Benchmark for Instance-aware Open-domain Visual Storytelling
by: Ye, Zilyu, et al.
Published: (2024)
by: Ye, Zilyu, et al.
Published: (2024)
Weakly Supervised Monocular 3D Detection with a Single-View Image
by: Jiang, Xueying, et al.
Published: (2024)
by: Jiang, Xueying, et al.
Published: (2024)
Prompt Relay: Inference-Time Temporal Control for Multi-Event Video Generation
by: Chen, Gordon, et al.
Published: (2026)
by: Chen, Gordon, et al.
Published: (2026)
Open-o3-Video: Grounded Video Reasoning with Explicit Spatio-Temporal Evidence
by: Meng, Jiahao, et al.
Published: (2025)
by: Meng, Jiahao, et al.
Published: (2025)
MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models
by: Meng, Fanqing, et al.
Published: (2024)
by: Meng, Fanqing, et al.
Published: (2024)
SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding
by: Li, Hao, et al.
Published: (2024)
by: Li, Hao, et al.
Published: (2024)
Similar Items
-
A Flexible and Scalable Framework for Video Moment Search
by: Zhang, Chongzhi, et al.
Published: (2025) -
TVR-Ranking: A Dataset for Ranked Video Moment Retrieval with Imprecise Queries
by: Liang, Renjie, et al.
Published: (2024) -
PVC: Progressive Visual Token Compression for Unified Image and Video Processing in Large Vision-Language Models
by: Yang, Chenyu, et al.
Published: (2024) -
Large Motion Model for Unified Multi-Modal Motion Generation
by: Zhang, Mingyuan, et al.
Published: (2024) -
DriveMLM: Aligning Multi-Modal Large Language Models with Behavioral Planning States for Autonomous Driving
by: Cui, Erfei, et al.
Published: (2023)