Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Weng, Yibing, Gu, Yu, Ren, Fuji
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2503.11342
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866912274624544768
author	Weng, Yibing Gu, Yu Ren, Fuji
author_facet	Weng, Yibing Gu, Yu Ren, Fuji
contents	Road rage, triggered by driving-related stimuli such as traffic congestion and aggressive driving, poses a significant threat to road safety. Previous research on road rage regulation has primarily focused on response suppression, lacking proactive prevention capabilities. With the advent of Vision-Language Models (VLMs), it has become possible to reason about trigger events visually and then engage in dialog-based comforting before drivers' anger escalates. To this end, we propose the road rage reasoning task, along with a finely annotated test dataset and evaluation metrics, to assess the capabilities of current mainstream VLMs in scene understanding, event recognition, and road rage reasoning. The results indicate that current VLMs exhibit significant shortcomings in scene understanding within the visual modality, as well as in comprehending the spatial relationships between objects in the textual modality. Improving VLMs' performance in these areas will greatly benefit downstream tasks like antecedent-focused road rage regulation.
format	Preprint
id	arxiv_https___arxiv_org_abs_2503_11342
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Road Rage Reasoning with Vision-language Models (VLMs): Task Definition and Evaluation Dataset Weng, Yibing Gu, Yu Ren, Fuji Computer Vision and Pattern Recognition Road rage, triggered by driving-related stimuli such as traffic congestion and aggressive driving, poses a significant threat to road safety. Previous research on road rage regulation has primarily focused on response suppression, lacking proactive prevention capabilities. With the advent of Vision-Language Models (VLMs), it has become possible to reason about trigger events visually and then engage in dialog-based comforting before drivers' anger escalates. To this end, we propose the road rage reasoning task, along with a finely annotated test dataset and evaluation metrics, to assess the capabilities of current mainstream VLMs in scene understanding, event recognition, and road rage reasoning. The results indicate that current VLMs exhibit significant shortcomings in scene understanding within the visual modality, as well as in comprehending the spatial relationships between objects in the textual modality. Improving VLMs' performance in these areas will greatly benefit downstream tasks like antecedent-focused road rage regulation.
title	Road Rage Reasoning with Vision-language Models (VLMs): Task Definition and Evaluation Dataset
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2503.11342

Similar Items