Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Yu, Seungjun, Lee, Seonho, Kim, Namho, Shin, Jaeyo, Park, Junsung, Ryu, Wonjeong, Jung, Raehyuk, Shim, Hyunjung
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition Artificial Intelligence
Online Access:	https://arxiv.org/abs/2511.20022
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866918331712274432
author	Yu, Seungjun Lee, Seonho Kim, Namho Shin, Jaeyo Park, Junsung Ryu, Wonjeong Jung, Raehyuk Shim, Hyunjung
author_facet	Yu, Seungjun Lee, Seonho Kim, Namho Shin, Jaeyo Park, Junsung Ryu, Wonjeong Jung, Raehyuk Shim, Hyunjung
contents	Recent advancements in multimodal large language models (MLLMs) have shown strong understanding of driving scenes, drawing interest in their application to autonomous driving. However, high-level reasoning in safety-critical scenarios, where avoiding one traffic risk can create another, remains a major challenge. Such reasoning is often infeasible with only a single front view and requires a comprehensive view of the environment, which we achieve through multi-view inputs. We define Safety-Critical Reasoning as a new task that leverages multi-view inputs to address this challenge. Then, we distill Safety-Critical Reasoning into two stages: first resolve the immediate risk, then mitigate the decision-induced downstream risks. To support this, we introduce WaymoQA, a dataset of 35,000 human-annotated question-answer pairs covering complex, high-risk driving scenarios. The dataset includes multiple-choice and open-ended formats across both image and video modalities. Experiments reveal that existing MLLMs underperform in safety-critical scenarios compared to normal scenes, but fine-tuning with WaymoQA significantly improves their reasoning ability, highlighting the effectiveness of our dataset in developing safer and more reasoning-capable driving agents. Our code and data are provided in https://github.com/sjyu001/WaymoQA
format	Preprint
id	arxiv_https___arxiv_org_abs_2511_20022
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	WaymoQA: A Multi-View Visual Question Answering Dataset for Safety-Critical Reasoning in Autonomous Driving Yu, Seungjun Lee, Seonho Kim, Namho Shin, Jaeyo Park, Junsung Ryu, Wonjeong Jung, Raehyuk Shim, Hyunjung Computer Vision and Pattern Recognition Artificial Intelligence Recent advancements in multimodal large language models (MLLMs) have shown strong understanding of driving scenes, drawing interest in their application to autonomous driving. However, high-level reasoning in safety-critical scenarios, where avoiding one traffic risk can create another, remains a major challenge. Such reasoning is often infeasible with only a single front view and requires a comprehensive view of the environment, which we achieve through multi-view inputs. We define Safety-Critical Reasoning as a new task that leverages multi-view inputs to address this challenge. Then, we distill Safety-Critical Reasoning into two stages: first resolve the immediate risk, then mitigate the decision-induced downstream risks. To support this, we introduce WaymoQA, a dataset of 35,000 human-annotated question-answer pairs covering complex, high-risk driving scenarios. The dataset includes multiple-choice and open-ended formats across both image and video modalities. Experiments reveal that existing MLLMs underperform in safety-critical scenarios compared to normal scenes, but fine-tuning with WaymoQA significantly improves their reasoning ability, highlighting the effectiveness of our dataset in developing safer and more reasoning-capable driving agents. Our code and data are provided in https://github.com/sjyu001/WaymoQA
title	WaymoQA: A Multi-View Visual Question Answering Dataset for Safety-Critical Reasoning in Autonomous Driving
topic	Computer Vision and Pattern Recognition Artificial Intelligence
url	https://arxiv.org/abs/2511.20022

Similar Items