Saved in:
Bibliographic Details
Main Authors: Zhai, Mingliang, Li, Cheng, Guo, Zengyuan, Yang, Ningrui, Qin, Xiameng, Zhao, Sanyuan, Han, Junyu, Tao, Ji, Wu, Yuwei, Jia, Yunde
Format: Preprint
Published: 2024
Subjects:
Online Access:https://arxiv.org/abs/2412.06324
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866913631340331008
author Zhai, Mingliang
Li, Cheng
Guo, Zengyuan
Yang, Ningrui
Qin, Xiameng
Zhao, Sanyuan
Han, Junyu
Tao, Ji
Wu, Yuwei
Jia, Yunde
author_facet Zhai, Mingliang
Li, Cheng
Guo, Zengyuan
Yang, Ningrui
Qin, Xiameng
Zhao, Sanyuan
Han, Junyu
Tao, Ji
Wu, Yuwei
Jia, Yunde
contents The Multi-modal Large Language Models (MLLMs) with extensive world knowledge have revitalized autonomous driving, particularly in reasoning tasks within perceivable regions. However, when faced with perception-limited areas (dynamic or static occlusion regions), MLLMs struggle to effectively integrate perception ability with world knowledge for reasoning. These perception-limited regions can conceal crucial safety information, especially for vulnerable road users. In this paper, we propose a framework, which aims to improve autonomous driving performance under perceptionlimited conditions by enhancing the integration of perception capabilities and world knowledge. Specifically, we propose a plug-and-play instruction-guided interaction module that bridges modality gaps and significantly reduces the input sequence length, allowing it to adapt effectively to multi-view video inputs. Furthermore, to better integrate world knowledge with driving-related tasks, we have collected and refined a large-scale multi-modal dataset that includes 2 million natural language QA pairs, 1.7 million grounding task data. To evaluate the model's utilization of world knowledge, we introduce an object-level risk assessment dataset comprising 200K QA pairs, where the questions necessitate multi-step reasoning leveraging world knowledge for resolution. Extensive experiments validate the effectiveness of our proposed method.
format Preprint
id arxiv_https___arxiv_org_abs_2412_06324
institution arXiv
publishDate 2024
record_format arxiv
spellingShingle World knowledge-enhanced Reasoning Using Instruction-guided Interactor in Autonomous Driving
Zhai, Mingliang
Li, Cheng
Guo, Zengyuan
Yang, Ningrui
Qin, Xiameng
Zhao, Sanyuan
Han, Junyu
Tao, Ji
Wu, Yuwei
Jia, Yunde
Computer Vision and Pattern Recognition
The Multi-modal Large Language Models (MLLMs) with extensive world knowledge have revitalized autonomous driving, particularly in reasoning tasks within perceivable regions. However, when faced with perception-limited areas (dynamic or static occlusion regions), MLLMs struggle to effectively integrate perception ability with world knowledge for reasoning. These perception-limited regions can conceal crucial safety information, especially for vulnerable road users. In this paper, we propose a framework, which aims to improve autonomous driving performance under perceptionlimited conditions by enhancing the integration of perception capabilities and world knowledge. Specifically, we propose a plug-and-play instruction-guided interaction module that bridges modality gaps and significantly reduces the input sequence length, allowing it to adapt effectively to multi-view video inputs. Furthermore, to better integrate world knowledge with driving-related tasks, we have collected and refined a large-scale multi-modal dataset that includes 2 million natural language QA pairs, 1.7 million grounding task data. To evaluate the model's utilization of world knowledge, we introduce an object-level risk assessment dataset comprising 200K QA pairs, where the questions necessitate multi-step reasoning leveraging world knowledge for resolution. Extensive experiments validate the effectiveness of our proposed method.
title World knowledge-enhanced Reasoning Using Instruction-guided Interactor in Autonomous Driving
topic Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2412.06324