Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Nguyen, Huy-Dung, Bairouk, Anass, Maras, Mirjana, Xiao, Wei, Wang, Tsun-Hsuan, Chareyre, Patrick, Hasani, Ramin, Blanchon, Marc, Rus, Daniela
Format:	Preprint
Published:	2024
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2409.10095
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866908932101898240
author	Nguyen, Huy-Dung Bairouk, Anass Maras, Mirjana Xiao, Wei Wang, Tsun-Hsuan Chareyre, Patrick Hasani, Ramin Blanchon, Marc Rus, Daniela
author_facet	Nguyen, Huy-Dung Bairouk, Anass Maras, Mirjana Xiao, Wei Wang, Tsun-Hsuan Chareyre, Patrick Hasani, Ramin Blanchon, Marc Rus, Daniela
contents	Autonomous driving systems require a comprehensive understanding of the environment, achieved by extracting visual features essential for perception, planning, and control. However, models trained solely on single-task objectives or generic datasets often lack the contextual information needed for robust performance in complex driving scenarios. In this work, we propose a unified encoder trained on multiple computer vision tasks crucial for urban driving, including depth, pose, and 3D scene flow estimation, as well as semantic, instance, panoptic, and motion segmentation. By integrating these diverse visual cues-similar to human perceptual mechanisms-the encoder captures rich features that enhance navigation-related predictions. We evaluate the model on steering estimation as a downstream task, leveraging its dense latent space. To ensure efficient multi-task learning, we introduce a multi-scale feature network for pose estimation and apply knowledge distillation from a multi-backbone teacher model. Our findings highlight two key findings: (1) the unified encoder achieves competitive performance across all visual perception tasks, demonstrating strong generalization capabilities; and (2) for steering estimation, the frozen unified encoder-leveraging dense latent representations-outperforms both its fine-tuned counterpart and the same frozen model pretrained on generic datasets like ImageNet. These results underline the significance of task-specific visual features and demonstrate the promise of multi-task learning in advancing autonomous driving systems. More details and the pretrained model are available at https://hi-computervision.github.io/uni-encoder/.
format	Preprint
id	arxiv_https___arxiv_org_abs_2409_10095
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	Human Insights Driven Latent Space for Different Driving Perspectives: A Unified Encoder for Efficient Multi-Task Inference Nguyen, Huy-Dung Bairouk, Anass Maras, Mirjana Xiao, Wei Wang, Tsun-Hsuan Chareyre, Patrick Hasani, Ramin Blanchon, Marc Rus, Daniela Computer Vision and Pattern Recognition Autonomous driving systems require a comprehensive understanding of the environment, achieved by extracting visual features essential for perception, planning, and control. However, models trained solely on single-task objectives or generic datasets often lack the contextual information needed for robust performance in complex driving scenarios. In this work, we propose a unified encoder trained on multiple computer vision tasks crucial for urban driving, including depth, pose, and 3D scene flow estimation, as well as semantic, instance, panoptic, and motion segmentation. By integrating these diverse visual cues-similar to human perceptual mechanisms-the encoder captures rich features that enhance navigation-related predictions. We evaluate the model on steering estimation as a downstream task, leveraging its dense latent space. To ensure efficient multi-task learning, we introduce a multi-scale feature network for pose estimation and apply knowledge distillation from a multi-backbone teacher model. Our findings highlight two key findings: (1) the unified encoder achieves competitive performance across all visual perception tasks, demonstrating strong generalization capabilities; and (2) for steering estimation, the frozen unified encoder-leveraging dense latent representations-outperforms both its fine-tuned counterpart and the same frozen model pretrained on generic datasets like ImageNet. These results underline the significance of task-specific visual features and demonstrate the promise of multi-task learning in advancing autonomous driving systems. More details and the pretrained model are available at https://hi-computervision.github.io/uni-encoder/.
title	Human Insights Driven Latent Space for Different Driving Perspectives: A Unified Encoder for Efficient Multi-Task Inference
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2409.10095

Similar Items