Saved in:
Bibliographic Details
Main Authors: Nguyen, Huy-Dung, Bairouk, Anass, Maras, Mirjana, Xiao, Wei, Wang, Tsun-Hsuan, Chareyre, Patrick, Hasani, Ramin, Blanchon, Marc, Rus, Daniela
Format: Preprint
Published: 2024
Subjects:
Online Access:https://arxiv.org/abs/2409.10095
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866908932101898240
author Nguyen, Huy-Dung
Bairouk, Anass
Maras, Mirjana
Xiao, Wei
Wang, Tsun-Hsuan
Chareyre, Patrick
Hasani, Ramin
Blanchon, Marc
Rus, Daniela
author_facet Nguyen, Huy-Dung
Bairouk, Anass
Maras, Mirjana
Xiao, Wei
Wang, Tsun-Hsuan
Chareyre, Patrick
Hasani, Ramin
Blanchon, Marc
Rus, Daniela
contents Autonomous driving systems require a comprehensive understanding of the environment, achieved by extracting visual features essential for perception, planning, and control. However, models trained solely on single-task objectives or generic datasets often lack the contextual information needed for robust performance in complex driving scenarios. In this work, we propose a unified encoder trained on multiple computer vision tasks crucial for urban driving, including depth, pose, and 3D scene flow estimation, as well as semantic, instance, panoptic, and motion segmentation. By integrating these diverse visual cues-similar to human perceptual mechanisms-the encoder captures rich features that enhance navigation-related predictions. We evaluate the model on steering estimation as a downstream task, leveraging its dense latent space. To ensure efficient multi-task learning, we introduce a multi-scale feature network for pose estimation and apply knowledge distillation from a multi-backbone teacher model. Our findings highlight two key findings: (1) the unified encoder achieves competitive performance across all visual perception tasks, demonstrating strong generalization capabilities; and (2) for steering estimation, the frozen unified encoder-leveraging dense latent representations-outperforms both its fine-tuned counterpart and the same frozen model pretrained on generic datasets like ImageNet. These results underline the significance of task-specific visual features and demonstrate the promise of multi-task learning in advancing autonomous driving systems. More details and the pretrained model are available at https://hi-computervision.github.io/uni-encoder/.
format Preprint
id arxiv_https___arxiv_org_abs_2409_10095
institution arXiv
publishDate 2024
record_format arxiv
spellingShingle Human Insights Driven Latent Space for Different Driving Perspectives: A Unified Encoder for Efficient Multi-Task Inference
Nguyen, Huy-Dung
Bairouk, Anass
Maras, Mirjana
Xiao, Wei
Wang, Tsun-Hsuan
Chareyre, Patrick
Hasani, Ramin
Blanchon, Marc
Rus, Daniela
Computer Vision and Pattern Recognition
Autonomous driving systems require a comprehensive understanding of the environment, achieved by extracting visual features essential for perception, planning, and control. However, models trained solely on single-task objectives or generic datasets often lack the contextual information needed for robust performance in complex driving scenarios. In this work, we propose a unified encoder trained on multiple computer vision tasks crucial for urban driving, including depth, pose, and 3D scene flow estimation, as well as semantic, instance, panoptic, and motion segmentation. By integrating these diverse visual cues-similar to human perceptual mechanisms-the encoder captures rich features that enhance navigation-related predictions. We evaluate the model on steering estimation as a downstream task, leveraging its dense latent space. To ensure efficient multi-task learning, we introduce a multi-scale feature network for pose estimation and apply knowledge distillation from a multi-backbone teacher model. Our findings highlight two key findings: (1) the unified encoder achieves competitive performance across all visual perception tasks, demonstrating strong generalization capabilities; and (2) for steering estimation, the frozen unified encoder-leveraging dense latent representations-outperforms both its fine-tuned counterpart and the same frozen model pretrained on generic datasets like ImageNet. These results underline the significance of task-specific visual features and demonstrate the promise of multi-task learning in advancing autonomous driving systems. More details and the pretrained model are available at https://hi-computervision.github.io/uni-encoder/.
title Human Insights Driven Latent Space for Different Driving Perspectives: A Unified Encoder for Efficient Multi-Task Inference
topic Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2409.10095