Saved in:
| Main Authors: | Zhu, Minjie, Zhu, Yichen, Liu, Xin, Liu, Ning, Xu, Zhiyuan, Shen, Chaomin, Peng, Yaxin, Ou, Zhicai, Feng, Feifei, Tang, Jian |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2403.06199 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
LLaVA-Phi: Efficient Multi-Modal Assistant with Small Language Model
by: Zhu, Yichen, et al.
Published: (2024)
by: Zhu, Yichen, et al.
Published: (2024)
ChatVLA: Unified Multimodal Understanding and Robot Control with Vision-Language-Action Model
by: Zhou, Zhongyi, et al.
Published: (2025)
by: Zhou, Zhongyi, et al.
Published: (2025)
Language-Conditioned Robotic Manipulation with Fast and Slow Thinking
by: Zhu, Minjie, et al.
Published: (2024)
by: Zhu, Minjie, et al.
Published: (2024)
MMRo: Are Multimodal LLMs Eligible as the Brain for In-Home Robotics?
by: Li, Jinming, et al.
Published: (2024)
by: Li, Jinming, et al.
Published: (2024)
TinyVLA: Towards Fast, Data-Efficient Vision-Language-Action Models for Robotic Manipulation
by: Wen, Junjie, et al.
Published: (2024)
by: Wen, Junjie, et al.
Published: (2024)
Object-Centric Instruction Augmentation for Robotic Manipulation
by: Wen, Junjie, et al.
Published: (2024)
by: Wen, Junjie, et al.
Published: (2024)
ObjectVLA: End-to-End Open-World Object Manipulation Without Demonstration
by: Zhu, Minjie, et al.
Published: (2025)
by: Zhu, Minjie, et al.
Published: (2025)
Diffusion-VLA: Generalizable and Interpretable Robot Foundation Model via Self-Generated Reasoning
by: Wen, Junjie, et al.
Published: (2024)
by: Wen, Junjie, et al.
Published: (2024)
Fresh-CL: Feature Realignment through Experts on Hypersphere in Continual Learning
by: Zhou, Zhongyi, et al.
Published: (2025)
by: Zhou, Zhongyi, et al.
Published: (2025)
Scaling Diffusion Policy in Transformer to 1 Billion Parameters for Robotic Manipulation
by: Zhu, Minjie, et al.
Published: (2024)
by: Zhu, Minjie, et al.
Published: (2024)
DexVLA: Vision-Language Model with Plug-In Diffusion Expert for General Robot Control
by: Wen, Junjie, et al.
Published: (2025)
by: Wen, Junjie, et al.
Published: (2025)
EDT: An Efficient Diffusion Transformer Framework Inspired by Human-like Sketching
by: Chen, Xinwang, et al.
Published: (2024)
by: Chen, Xinwang, et al.
Published: (2024)
Efficient Feature Fusion for UAV Object Detection
by: Wang, Xudong, et al.
Published: (2025)
by: Wang, Xudong, et al.
Published: (2025)
PointVLA: Injecting the 3D World into Vision-Language-Action Models
by: Li, Chengmeng, et al.
Published: (2025)
by: Li, Chengmeng, et al.
Published: (2025)
Visual Robotic Manipulation with Depth-Aware Pretraining
by: Wang, Wanying, et al.
Published: (2024)
by: Wang, Wanying, et al.
Published: (2024)
dVLA: Diffusion Vision-Language-Action Model with Multimodal Chain-of-Thought
by: Wen, Junjie, et al.
Published: (2025)
by: Wen, Junjie, et al.
Published: (2025)
ChatVLA-2: Vision-Language-Action Model with Open-World Embodied Reasoning from Pretrained Knowledge
by: Zhou, Zhongyi, et al.
Published: (2025)
by: Zhou, Zhongyi, et al.
Published: (2025)
Retrieval-Augmented Embodied Agents
by: Zhu, Yichen, et al.
Published: (2024)
by: Zhu, Yichen, et al.
Published: (2024)
WorldEval: World Model as Real-World Robot Policies Evaluator
by: Li, Yaxuan, et al.
Published: (2025)
by: Li, Yaxuan, et al.
Published: (2025)
Safety of Multimodal Large Language Models on Images and Texts
by: Liu, Xin, et al.
Published: (2024)
by: Liu, Xin, et al.
Published: (2024)
CoA-VLA: Improving Vision-Language-Action Models via Visual-Textual Chain-of-Affordance
by: Li, Jinming, et al.
Published: (2024)
by: Li, Jinming, et al.
Published: (2024)
Res-Bench: Benchmarking the Robustness of Multimodal Large Language Models to Dynamic Resolution Input
by: Li, Chenxu, et al.
Published: (2025)
by: Li, Chenxu, et al.
Published: (2025)
Look Carefully: Adaptive Visual Reinforcements in Multimodal Large Language Models for Hallucination Mitigation
by: Zhu, Xingyu, et al.
Published: (2026)
by: Zhu, Xingyu, et al.
Published: (2026)
MM-SafetyBench: A Benchmark for Safety Evaluation of Multimodal Large Language Models
by: Liu, Xin, et al.
Published: (2023)
by: Liu, Xin, et al.
Published: (2023)
Open CaptchaWorld: A Comprehensive Web-based Platform for Testing and Benchmarking Multimodal LLM Agents
by: Luo, Yaxin, et al.
Published: (2025)
by: Luo, Yaxin, et al.
Published: (2025)
Dynamic Multimodal Prototype Learning in Vision-Language Models
by: Zhu, Xingyu, et al.
Published: (2025)
by: Zhu, Xingyu, et al.
Published: (2025)
Student-Oriented Teacher Knowledge Refinement for Knowledge Distillation
by: Shen, Chaomin, et al.
Published: (2024)
by: Shen, Chaomin, et al.
Published: (2024)
ERGeoBench:A Comprehensive Benchmark for Embodied Reasoning and Geo-localization in Multimodal Large Language Models
by: Xue, Kaiwen, et al.
Published: (2026)
by: Xue, Kaiwen, et al.
Published: (2026)
CFBenchmark-MM: Chinese Financial Assistant Benchmark for Multimodal Large Language Model
by: Li, Jiangtong, et al.
Published: (2025)
by: Li, Jiangtong, et al.
Published: (2025)
Face-Human-Bench: A Comprehensive Benchmark of Face and Human Understanding for Multi-modal Assistants
by: Qin, Lixiong, et al.
Published: (2025)
by: Qin, Lixiong, et al.
Published: (2025)
Language-Pretraining-Induced Bias: A Strong Foundation for General Vision Tasks
by: Luo, Yaxin, et al.
Published: (2026)
by: Luo, Yaxin, et al.
Published: (2026)
A Comprehensive Survey of Large Language Models and Multimodal Large Language Models in Medicine
by: Xiao, Hanguang, et al.
Published: (2024)
by: Xiao, Hanguang, et al.
Published: (2024)
DocRetriever: A Plug-and-Play Framework for Multimodal Document Retrieval with Comprehensive Benchmark
by: Hu, Ruofan, et al.
Published: (2026)
by: Hu, Ruofan, et al.
Published: (2026)
Analyzing Fine-Grained Alignment and Enhancing Vision Understanding in Multimodal Language Models
by: Jiang, Jiachen, et al.
Published: (2025)
by: Jiang, Jiachen, et al.
Published: (2025)
Distilling Mathematical Reasoning Capabilities into Small Language Models
by: Zhu, Xunyu, et al.
Published: (2024)
by: Zhu, Xunyu, et al.
Published: (2024)
Language-Instructed Reasoning for Group Activity Detection via Multimodal Large Language Model
by: Peng, Jihua, et al.
Published: (2025)
by: Peng, Jihua, et al.
Published: (2025)
Mitigating Hallucinations in Large Vision-Language Models without Performance Degradation
by: Zhu, Xingyu, et al.
Published: (2026)
by: Zhu, Xingyu, et al.
Published: (2026)
CODIS: Benchmarking Context-Dependent Visual Comprehension for Multimodal Large Language Models
by: Luo, Fuwen, et al.
Published: (2024)
by: Luo, Fuwen, et al.
Published: (2024)
Towards Harmless Multimodal Assistants with Blind Preference Optimization
by: Li, Yongqi, et al.
Published: (2025)
by: Li, Yongqi, et al.
Published: (2025)
LLaSA: Large Language and E-Commerce Shopping Assistant
by: Zhang, Shuo, et al.
Published: (2024)
by: Zhang, Shuo, et al.
Published: (2024)
Similar Items
-
LLaVA-Phi: Efficient Multi-Modal Assistant with Small Language Model
by: Zhu, Yichen, et al.
Published: (2024) -
ChatVLA: Unified Multimodal Understanding and Robot Control with Vision-Language-Action Model
by: Zhou, Zhongyi, et al.
Published: (2025) -
Language-Conditioned Robotic Manipulation with Fast and Slow Thinking
by: Zhu, Minjie, et al.
Published: (2024) -
MMRo: Are Multimodal LLMs Eligible as the Brain for In-Home Robotics?
by: Li, Jinming, et al.
Published: (2024) -
TinyVLA: Towards Fast, Data-Efficient Vision-Language-Action Models for Robotic Manipulation
by: Wen, Junjie, et al.
Published: (2024)