Saved in:
| Main Authors: | Tao, Chenxin, Zhu, Xizhou, Su, Shiqian, Lu, Lewei, Tian, Changyao, Luo, Xuan, Huang, Gao, Li, Hongsheng, Qiao, Yu, Zhou, Jie, Dai, Jifeng |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2406.04342 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
ADDP: Learning General Representations for Image Recognition and Generation with Alternating Denoising Diffusion Process
by: Tian, Changyao, et al.
Published: (2023)
by: Tian, Changyao, et al.
Published: (2023)
Parameter-Inverted Image Pyramid Networks for Visual Perception and Multimodal Understanding
by: Wang, Zhaokai, et al.
Published: (2025)
by: Wang, Zhaokai, et al.
Published: (2025)
HoVLE: Unleashing the Power of Monolithic Vision-Language Models with Holistic Vision-Language Embedding
by: Tao, Chenxin, et al.
Published: (2024)
by: Tao, Chenxin, et al.
Published: (2024)
SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding
by: Li, Hao, et al.
Published: (2024)
by: Li, Hao, et al.
Published: (2024)
MM-Interleaved: Interleaved Image-Text Generative Modeling via Multi-modal Feature Synchronizer
by: Tian, Changyao, et al.
Published: (2024)
by: Tian, Changyao, et al.
Published: (2024)
NaViL: Rethinking Scaling Properties of Native Multimodal Large Language Models under Data Constraints
by: Tian, Changyao, et al.
Published: (2025)
by: Tian, Changyao, et al.
Published: (2025)
Vision-RWKV: Efficient and Scalable Visual Perception with RWKV-Like Architectures
by: Duan, Yuchen, et al.
Published: (2024)
by: Duan, Yuchen, et al.
Published: (2024)
PVC: Progressive Visual Token Compression for Unified Image and Video Processing in Large Vision-Language Models
by: Yang, Chenyu, et al.
Published: (2024)
by: Yang, Chenyu, et al.
Published: (2024)
Auto MC-Reward: Automated Dense Reward Design with Large Language Models for Minecraft
by: Li, Hao, et al.
Published: (2023)
by: Li, Hao, et al.
Published: (2023)
Parameter-Inverted Image Pyramid Networks
by: Zhu, Xizhou, et al.
Published: (2024)
by: Zhu, Xizhou, et al.
Published: (2024)
Efficient Deformable ConvNets: Rethinking Dynamic and Sparse Operator for Vision Applications
by: Xiong, Yuwen, et al.
Published: (2024)
by: Xiong, Yuwen, et al.
Published: (2024)
Vision Model Pre-training on Interleaved Image-Text Data via Latent Compression Learning
by: Yang, Chenyu, et al.
Published: (2024)
by: Yang, Chenyu, et al.
Published: (2024)
Demystify Transformers & Convolutions in Modern Image Deep Networks
by: Hu, Xiaowei, et al.
Published: (2022)
by: Hu, Xiaowei, et al.
Published: (2022)
MMInstruct: A High-Quality Multi-Modal Instruction Tuning Dataset with Extensive Diversity
by: Liu, Yangzhou, et al.
Published: (2024)
by: Liu, Yangzhou, et al.
Published: (2024)
Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization
by: Wang, Weiyun, et al.
Published: (2024)
by: Wang, Weiyun, et al.
Published: (2024)
Mono-InternVL-1.5: Towards Cheaper and Faster Monolithic Multimodal Large Language Models
by: Luo, Gen, et al.
Published: (2025)
by: Luo, Gen, et al.
Published: (2025)
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
by: Chen, Zhe, et al.
Published: (2023)
by: Chen, Zhe, et al.
Published: (2023)
Learning Adaptive and Temporally Causal Video Tokenization in a 1D Latent Space
by: Li, Yan, et al.
Published: (2025)
by: Li, Yan, et al.
Published: (2025)
CoMemo: LVLMs Need Image Context with Image Memory
by: Liu, Shi, et al.
Published: (2025)
by: Liu, Shi, et al.
Published: (2025)
Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training
by: Luo, Gen, et al.
Published: (2024)
by: Luo, Gen, et al.
Published: (2024)
VisualPRM: An Effective Process Reward Model for Multimodal Reasoning
by: Wang, Weiyun, et al.
Published: (2025)
by: Wang, Weiyun, et al.
Published: (2025)
ZeroGUI: Automating Online GUI Learning at Zero Human Cost
by: Yang, Chenyu, et al.
Published: (2025)
by: Yang, Chenyu, et al.
Published: (2025)
DriveMLM: Aligning Multi-Modal Large Language Models with Behavioral Planning States for Autonomous Driving
by: Cui, Erfei, et al.
Published: (2023)
by: Cui, Erfei, et al.
Published: (2023)
The All-Seeing Project V2: Towards General Relation Comprehension of the Open World
by: Wang, Weiyun, et al.
Published: (2024)
by: Wang, Weiyun, et al.
Published: (2024)
GenExam: A Multidisciplinary Text-to-Image Exam
by: Wang, Zhaokai, et al.
Published: (2025)
by: Wang, Zhaokai, et al.
Published: (2025)
VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks
by: Wu, Jiannan, et al.
Published: (2024)
by: Wu, Jiannan, et al.
Published: (2024)
Mini-InternVL: A Flexible-Transfer Pocket Multimodal Model with 5% Parameters and 90% Performance
by: Gao, Zhangwei, et al.
Published: (2024)
by: Gao, Zhangwei, et al.
Published: (2024)
VisuLogic: A Benchmark for Evaluating Visual Reasoning in Multi-modal Large Language Models
by: Xu, Weiye, et al.
Published: (2025)
by: Xu, Weiye, et al.
Published: (2025)
LangBridge: Interpreting Image as a Combination of Language Embeddings
by: Liao, Jiaqi, et al.
Published: (2025)
by: Liao, Jiaqi, et al.
Published: (2025)
V2PE: Improving Multimodal Long-Context Capability of Vision-Language Models with Variable Visual Position Encoding
by: Ge, Junqi, et al.
Published: (2024)
by: Ge, Junqi, et al.
Published: (2024)
Visual Embodied Brain: Let Multimodal Large Language Models See, Think, and Control in Spaces
by: Luo, Gen, et al.
Published: (2025)
by: Luo, Gen, et al.
Published: (2025)
MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models
by: Meng, Fanqing, et al.
Published: (2024)
by: Meng, Fanqing, et al.
Published: (2024)
ScaleEdit-12M: Scaling Open-Source Image Editing Data Generation via Multi-Agent Framework
by: Chen, Guanzhou, et al.
Published: (2026)
by: Chen, Guanzhou, et al.
Published: (2026)
Needle In A Multimodal Haystack
by: Wang, Weiyun, et al.
Published: (2024)
by: Wang, Weiyun, et al.
Published: (2024)
MiroFlow: Towards High-Performance and Robust Open-Source Agent Framework for General Deep Research Tasks
by: Su, Shiqian, et al.
Published: (2026)
by: Su, Shiqian, et al.
Published: (2026)
Multi-scale 2D Temporal Map Diffusion Models for Natural Language Video Localization
by: Zhang, Chongzhi, et al.
Published: (2024)
by: Zhang, Chongzhi, et al.
Published: (2024)
Docopilot: Improving Multimodal Models for Document-Level Understanding
by: Duan, Yuchen, et al.
Published: (2025)
by: Duan, Yuchen, et al.
Published: (2025)
Symmetries, Bifurcations and Control for Traveling Waves of the Drinfeld‐Sokolov‐Wilson System
by: Chenxin Luo, et al.
Published: (2026)
by: Chenxin Luo, et al.
Published: (2026)
Dita: Scaling Diffusion Transformer for Generalist Vision-Language-Action Policy
by: Hou, Zhi, et al.
Published: (2025)
by: Hou, Zhi, et al.
Published: (2025)
PUMA: Empowering Unified MLLM with Multi-granular Visual Generation
by: Fang, Rongyao, et al.
Published: (2024)
by: Fang, Rongyao, et al.
Published: (2024)
Similar Items
-
ADDP: Learning General Representations for Image Recognition and Generation with Alternating Denoising Diffusion Process
by: Tian, Changyao, et al.
Published: (2023) -
Parameter-Inverted Image Pyramid Networks for Visual Perception and Multimodal Understanding
by: Wang, Zhaokai, et al.
Published: (2025) -
HoVLE: Unleashing the Power of Monolithic Vision-Language Models with Holistic Vision-Language Embedding
by: Tao, Chenxin, et al.
Published: (2024) -
SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding
by: Li, Hao, et al.
Published: (2024) -
MM-Interleaved: Interleaved Image-Text Generative Modeling via Multi-modal Feature Synchronizer
by: Tian, Changyao, et al.
Published: (2024)