Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Charoenpitaks, Korawat, Nguyen, Van-Quang, Suganuma, Masanori, Arai, Kentaro, Totsuka, Seiji, Ino, Hiroshi, Okatani, Takayuki
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2501.05733
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866913643556241408
author	Charoenpitaks, Korawat Nguyen, Van-Quang Suganuma, Masanori Arai, Kentaro Totsuka, Seiji Ino, Hiroshi Okatani, Takayuki
author_facet	Charoenpitaks, Korawat Nguyen, Van-Quang Suganuma, Masanori Arai, Kentaro Totsuka, Seiji Ino, Hiroshi Okatani, Takayuki
contents	The application of Multi-modal Large Language Models (MLLMs) in Autonomous Driving (AD) faces significant challenges due to their limited training on traffic-specific data and the absence of dedicated benchmarks for spatiotemporal understanding. This study addresses these issues by proposing TB-Bench, a comprehensive benchmark designed to evaluate MLLMs on understanding traffic behaviors across eight perception tasks from ego-centric views. We also introduce vision-language instruction tuning datasets, TB-100k and TB-250k, along with simple yet effective baselines for the tasks. Through extensive experiments, we show that existing MLLMs underperform in these tasks, with even a powerful model like GPT-4o achieving less than 35% accuracy on average. In contrast, when fine-tuned with TB-100k or TB-250k, our baseline models achieve average accuracy up to 85%, significantly enhancing performance on the tasks. Additionally, we demonstrate performance transfer by co-training TB-100k with another traffic dataset, leading to improved performance on the latter. Overall, this study represents a step forward by introducing a comprehensive benchmark, high-quality datasets, and baselines, thus supporting the gradual integration of MLLMs into the perception, prediction, and planning stages of AD.
format	Preprint
id	arxiv_https___arxiv_org_abs_2501_05733
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	TB-Bench: Training and Testing Multi-Modal AI for Understanding Spatio-Temporal Traffic Behaviors from Dashcam Images/Videos Charoenpitaks, Korawat Nguyen, Van-Quang Suganuma, Masanori Arai, Kentaro Totsuka, Seiji Ino, Hiroshi Okatani, Takayuki Computer Vision and Pattern Recognition The application of Multi-modal Large Language Models (MLLMs) in Autonomous Driving (AD) faces significant challenges due to their limited training on traffic-specific data and the absence of dedicated benchmarks for spatiotemporal understanding. This study addresses these issues by proposing TB-Bench, a comprehensive benchmark designed to evaluate MLLMs on understanding traffic behaviors across eight perception tasks from ego-centric views. We also introduce vision-language instruction tuning datasets, TB-100k and TB-250k, along with simple yet effective baselines for the tasks. Through extensive experiments, we show that existing MLLMs underperform in these tasks, with even a powerful model like GPT-4o achieving less than 35% accuracy on average. In contrast, when fine-tuned with TB-100k or TB-250k, our baseline models achieve average accuracy up to 85%, significantly enhancing performance on the tasks. Additionally, we demonstrate performance transfer by co-training TB-100k with another traffic dataset, leading to improved performance on the latter. Overall, this study represents a step forward by introducing a comprehensive benchmark, high-quality datasets, and baselines, thus supporting the gradual integration of MLLMs into the perception, prediction, and planning stages of AD.
title	TB-Bench: Training and Testing Multi-Modal AI for Understanding Spatio-Temporal Traffic Behaviors from Dashcam Images/Videos
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2501.05733

Similar Items