Saved in:
Bibliographic Details
Main Authors: Spencer, Jaime, Russell, Chris, Hadfield, Simon, Bowden, Richard
Format: Preprint
Published: 2024
Subjects:
Online Access:https://arxiv.org/abs/2403.01569
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866917603108192256
author Spencer, Jaime
Russell, Chris
Hadfield, Simon
Bowden, Richard
author_facet Spencer, Jaime
Russell, Chris
Hadfield, Simon
Bowden, Richard
contents Self-supervised learning is the key to unlocking generic computer vision systems. By eliminating the reliance on ground-truth annotations, it allows scaling to much larger data quantities. Unfortunately, self-supervised monocular depth estimation (SS-MDE) has been limited by the absence of diverse training data. Existing datasets have focused exclusively on urban driving in densely populated cities, resulting in models that fail to generalize beyond this domain. To address these limitations, this paper proposes two novel datasets: SlowTV and CribsTV. These are large-scale datasets curated from publicly available YouTube videos, containing a total of 2M training frames. They offer an incredibly diverse set of environments, ranging from snowy forests to coastal roads, luxury mansions and even underwater coral reefs. We leverage these datasets to tackle the challenging task of zero-shot generalization, outperforming every existing SS-MDE approach and even some state-of-the-art supervised methods. The generalization capabilities of our models are further enhanced by a range of components and contributions: 1) learning the camera intrinsics, 2) a stronger augmentation regime targeting aspect ratio changes, 3) support frame randomization, 4) flexible motion estimation, 5) a modern transformer-based architecture. We demonstrate the effectiveness of each component in extensive ablation experiments. To facilitate the development of future research, we make the datasets, code and pretrained models available to the public at https://github.com/jspenmar/slowtv_monodepth.
format Preprint
id arxiv_https___arxiv_org_abs_2403_01569
institution arXiv
publishDate 2024
record_format arxiv
spellingShingle Kick Back & Relax++: Scaling Beyond Ground-Truth Depth with SlowTV & CribsTV
Spencer, Jaime
Russell, Chris
Hadfield, Simon
Bowden, Richard
Computer Vision and Pattern Recognition
Artificial Intelligence
Robotics
Self-supervised learning is the key to unlocking generic computer vision systems. By eliminating the reliance on ground-truth annotations, it allows scaling to much larger data quantities. Unfortunately, self-supervised monocular depth estimation (SS-MDE) has been limited by the absence of diverse training data. Existing datasets have focused exclusively on urban driving in densely populated cities, resulting in models that fail to generalize beyond this domain. To address these limitations, this paper proposes two novel datasets: SlowTV and CribsTV. These are large-scale datasets curated from publicly available YouTube videos, containing a total of 2M training frames. They offer an incredibly diverse set of environments, ranging from snowy forests to coastal roads, luxury mansions and even underwater coral reefs. We leverage these datasets to tackle the challenging task of zero-shot generalization, outperforming every existing SS-MDE approach and even some state-of-the-art supervised methods. The generalization capabilities of our models are further enhanced by a range of components and contributions: 1) learning the camera intrinsics, 2) a stronger augmentation regime targeting aspect ratio changes, 3) support frame randomization, 4) flexible motion estimation, 5) a modern transformer-based architecture. We demonstrate the effectiveness of each component in extensive ablation experiments. To facilitate the development of future research, we make the datasets, code and pretrained models available to the public at https://github.com/jspenmar/slowtv_monodepth.
title Kick Back & Relax++: Scaling Beyond Ground-Truth Depth with SlowTV & CribsTV
topic Computer Vision and Pattern Recognition
Artificial Intelligence
Robotics
url https://arxiv.org/abs/2403.01569