Saved in:
| Main Authors: | Zhu, Jinguo, Wang, Weiyun, Chen, Zhe, Liu, Zhaoyang, Ye, Shenglong, Gu, Lixin, Tian, Hao, Duan, Yuchen, Su, Weijie, Shao, Jie, Gao, Zhangwei, Cui, Erfei, Wang, Xuehui, Cao, Yue, Liu, Yangzhou, Wei, Xingguang, Zhang, Hongjie, Wang, Haomin, Xu, Weiye, Li, Hao, Wang, Jiahao, Deng, Nianchen, Li, Songze, He, Yinan, Jiang, Tan, Luo, Jiapeng, Wang, Yi, He, Conghui, Shi, Botian, Zhang, Xingcheng, Shao, Wenqi, He, Junjun, Xiong, Yingtong, Qu, Wenwen, Sun, Peng, Jiao, Penglong, Lv, Han, Wu, Lijun, Zhang, Kaipeng, Deng, Huipeng, Ge, Jiaye, Chen, Kai, Wang, Limin, Dou, Min, Lu, Lewei, Zhu, Xizhou, Lu, Tong, Lin, Dahua, Qiao, Yu, Dai, Jifeng, Wang, Wenhai |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2504.10479 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Mini-InternVL: A Flexible-Transfer Pocket Multimodal Model with 5% Parameters and 90% Performance
by: Gao, Zhangwei, et al.
Published: (2024)
by: Gao, Zhangwei, et al.
Published: (2024)
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
by: Wang, Weiyun, et al.
Published: (2025)
by: Wang, Weiyun, et al.
Published: (2025)
InternSpatial: A Comprehensive Dataset for Spatial Reasoning in Vision-Language Models
by: Deng, Nianchen, et al.
Published: (2025)
by: Deng, Nianchen, et al.
Published: (2025)
Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization
by: Wang, Weiyun, et al.
Published: (2024)
by: Wang, Weiyun, et al.
Published: (2024)
Mono-InternVL-1.5: Towards Cheaper and Faster Monolithic Multimodal Large Language Models
by: Luo, Gen, et al.
Published: (2025)
by: Luo, Gen, et al.
Published: (2025)
MMInstruct: A High-Quality Multi-Modal Instruction Tuning Dataset with Extensive Diversity
by: Liu, Yangzhou, et al.
Published: (2024)
by: Liu, Yangzhou, et al.
Published: (2024)
VisualPRM: An Effective Process Reward Model for Multimodal Reasoning
by: Wang, Weiyun, et al.
Published: (2025)
by: Wang, Weiyun, et al.
Published: (2025)
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
by: Chen, Zhe, et al.
Published: (2023)
by: Chen, Zhe, et al.
Published: (2023)
InternVL-X: Advancing and Accelerating InternVL Series with Efficient Visual Token Compression
by: Lu, Dongchen, et al.
Published: (2025)
by: Lu, Dongchen, et al.
Published: (2025)
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
by: Chen, Zhe, et al.
Published: (2024)
by: Chen, Zhe, et al.
Published: (2024)
VisuLogic: A Benchmark for Evaluating Visual Reasoning in Multi-modal Large Language Models
by: Xu, Weiye, et al.
Published: (2025)
by: Xu, Weiye, et al.
Published: (2025)
Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training
by: Luo, Gen, et al.
Published: (2024)
by: Luo, Gen, et al.
Published: (2024)
SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding
by: Li, Hao, et al.
Published: (2024)
by: Li, Hao, et al.
Published: (2024)
InternVL-U: Democratizing Unified Multimodal Models for Understanding, Reasoning, Generation and Editing
by: Tian, Changyao, et al.
Published: (2026)
by: Tian, Changyao, et al.
Published: (2026)
Docopilot: Improving Multimodal Models for Document-Level Understanding
by: Duan, Yuchen, et al.
Published: (2025)
by: Duan, Yuchen, et al.
Published: (2025)
DriveMLM: Aligning Multi-Modal Large Language Models with Behavioral Planning States for Autonomous Driving
by: Cui, Erfei, et al.
Published: (2023)
by: Cui, Erfei, et al.
Published: (2023)
Enhancing the Outcome Reward-based RL Training of MLLMs with Self-Consistency Sampling
by: Wang, Jiahao, et al.
Published: (2025)
by: Wang, Jiahao, et al.
Published: (2025)
Vision-RWKV: Efficient and Scalable Visual Perception with RWKV-Like Architectures
by: Duan, Yuchen, et al.
Published: (2024)
by: Duan, Yuchen, et al.
Published: (2024)
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
by: Chen, Zhe, et al.
Published: (2024)
by: Chen, Zhe, et al.
Published: (2024)
OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text
by: Li, Qingyun, et al.
Published: (2024)
by: Li, Qingyun, et al.
Published: (2024)
NaViL: Rethinking Scaling Properties of Native Multimodal Large Language Models under Data Constraints
by: Tian, Changyao, et al.
Published: (2025)
by: Tian, Changyao, et al.
Published: (2025)
InternSVG: Towards Unified SVG Tasks with Multimodal Large Language Models
by: Wang, Haomin, et al.
Published: (2025)
by: Wang, Haomin, et al.
Published: (2025)
Driving with InternVL: Oustanding Champion in the Track on Driving with Language of the Autonomous Grand Challenge at CVPR 2024
by: Li, Jiahan, et al.
Published: (2024)
by: Li, Jiahan, et al.
Published: (2024)
EVA: Efficient Reinforcement Learning for End-to-End Video Agent
by: Zhang, Yaolun, et al.
Published: (2026)
by: Zhang, Yaolun, et al.
Published: (2026)
Demystify Transformers & Convolutions in Modern Image Deep Networks
by: Hu, Xiaowei, et al.
Published: (2022)
by: Hu, Xiaowei, et al.
Published: (2022)
Needle In A Multimodal Haystack
by: Wang, Weiyun, et al.
Published: (2024)
by: Wang, Weiyun, et al.
Published: (2024)
Causal Inference in Social Platforms Under Approximate Interference Networks
by: Jiang, Yiming, et al.
Published: (2024)
by: Jiang, Yiming, et al.
Published: (2024)
InternVideo2: Scaling Foundation Models for Multimodal Video Understanding
by: Wang, Yi, et al.
Published: (2024)
by: Wang, Yi, et al.
Published: (2024)
Visual Embodied Brain: Let Multimodal Large Language Models See, Think, and Control in Spaces
by: Luo, Gen, et al.
Published: (2025)
by: Luo, Gen, et al.
Published: (2025)
MM-Interleaved: Interleaved Image-Text Generative Modeling via Multi-modal Feature Synchronizer
by: Tian, Changyao, et al.
Published: (2024)
by: Tian, Changyao, et al.
Published: (2024)
The All-Seeing Project V2: Towards General Relation Comprehension of the Open World
by: Wang, Weiyun, et al.
Published: (2024)
by: Wang, Weiyun, et al.
Published: (2024)
Vision Model Pre-training on Interleaved Image-Text Data via Latent Compression Learning
by: Yang, Chenyu, et al.
Published: (2024)
by: Yang, Chenyu, et al.
Published: (2024)
Auto MC-Reward: Automated Dense Reward Design with Large Language Models for Minecraft
by: Li, Hao, et al.
Published: (2023)
by: Li, Hao, et al.
Published: (2023)
PVC: Progressive Visual Token Compression for Unified Image and Video Processing in Large Vision-Language Models
by: Yang, Chenyu, et al.
Published: (2024)
by: Yang, Chenyu, et al.
Published: (2024)
ELV-Halluc: Benchmarking Semantic Aggregation Hallucinations in Long Video Understanding
by: Lu, Hao, et al.
Published: (2025)
by: Lu, Hao, et al.
Published: (2025)
Bi-Erasing: A Bidirectional Framework for Concept Removal in Diffusion Models
by: Chen, Hao, et al.
Published: (2025)
by: Chen, Hao, et al.
Published: (2025)
Parameter-Inverted Image Pyramid Networks
by: Zhu, Xizhou, et al.
Published: (2024)
by: Zhu, Xizhou, et al.
Published: (2024)
Membrane‐Ion Interactions Creating Dual‐Nanoconfined Channels for Superior Mixed Ion Separations
by: Guangcheng Wang, et al.
Published: (2025)
by: Guangcheng Wang, et al.
Published: (2025)
GenesisTex2: Stable, Consistent and High-Quality Text-to-Texture Generation
by: Lu, Jiawei, et al.
Published: (2024)
by: Lu, Jiawei, et al.
Published: (2024)
MetaCaptioner: Towards Generalist Visual Captioning with Open-source Suites
by: Lei, Zhenxin, et al.
Published: (2025)
by: Lei, Zhenxin, et al.
Published: (2025)
Similar Items
-
Mini-InternVL: A Flexible-Transfer Pocket Multimodal Model with 5% Parameters and 90% Performance
by: Gao, Zhangwei, et al.
Published: (2024) -
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
by: Wang, Weiyun, et al.
Published: (2025) -
InternSpatial: A Comprehensive Dataset for Spatial Reasoning in Vision-Language Models
by: Deng, Nianchen, et al.
Published: (2025) -
Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization
by: Wang, Weiyun, et al.
Published: (2024) -
Mono-InternVL-1.5: Towards Cheaper and Faster Monolithic Multimodal Large Language Models
by: Luo, Gen, et al.
Published: (2025)