Saved in:
Bibliographic Details
Main Authors: Qian, Jian, Sun, Miao, Lee, Ashley, Li, Jie, Zhuo, Shenglong, Chiang, Patrick Yin
Format: Preprint
Published: 2024
Subjects:
Online Access:https://arxiv.org/abs/2409.08159
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866917774786297856
author Qian, Jian
Sun, Miao
Lee, Ashley
Li, Jie
Zhuo, Shenglong
Chiang, Patrick Yin
author_facet Qian, Jian
Sun, Miao
Lee, Ashley
Li, Jie
Zhuo, Shenglong
Chiang, Patrick Yin
contents Depth completion aims to predict dense depth maps with sparse depth measurements from a depth sensor. Currently, Convolutional Neural Network (CNN) based models are the most popular methods applied to depth completion tasks. However, despite the excellent high-end performance, they suffer from a limited representation area. To overcome the drawbacks of CNNs, a more effective and powerful method has been presented: the Transformer, which is an adaptive self-attention setting sequence-to-sequence model. While the standard Transformer quadratically increases the computational cost from the key-query dot-product of input resolution which improperly employs depth completion tasks. In this work, we propose a different window-based Transformer architecture for depth completion tasks named Sparse-to-Dense Transformer (SDformer). The network consists of an input module for the depth map and RGB image features extraction and concatenation, a U-shaped encoder-decoder Transformer for extracting deep features, and a refinement module. Specifically, we first concatenate the depth map features with the RGB image features through the input model. Then, instead of calculating self-attention with the whole feature maps, we apply different window sizes to extract the long-range depth dependencies. Finally, we refine the predicted features from the input module and the U-shaped encoder-decoder Transformer module to get the enriching depth features and employ a convolution layer to obtain the dense depth map. In practice, the SDformer obtains state-of-the-art results against the CNN-based depth completion models with lower computing loads and parameters on the NYU Depth V2 and KITTI DC datasets.
format Preprint
id arxiv_https___arxiv_org_abs_2409_08159
institution arXiv
publishDate 2024
record_format arxiv
spellingShingle SDformer: Efficient End-to-End Transformer for Depth Completion
Qian, Jian
Sun, Miao
Lee, Ashley
Li, Jie
Zhuo, Shenglong
Chiang, Patrick Yin
Computer Vision and Pattern Recognition
Depth completion aims to predict dense depth maps with sparse depth measurements from a depth sensor. Currently, Convolutional Neural Network (CNN) based models are the most popular methods applied to depth completion tasks. However, despite the excellent high-end performance, they suffer from a limited representation area. To overcome the drawbacks of CNNs, a more effective and powerful method has been presented: the Transformer, which is an adaptive self-attention setting sequence-to-sequence model. While the standard Transformer quadratically increases the computational cost from the key-query dot-product of input resolution which improperly employs depth completion tasks. In this work, we propose a different window-based Transformer architecture for depth completion tasks named Sparse-to-Dense Transformer (SDformer). The network consists of an input module for the depth map and RGB image features extraction and concatenation, a U-shaped encoder-decoder Transformer for extracting deep features, and a refinement module. Specifically, we first concatenate the depth map features with the RGB image features through the input model. Then, instead of calculating self-attention with the whole feature maps, we apply different window sizes to extract the long-range depth dependencies. Finally, we refine the predicted features from the input module and the U-shaped encoder-decoder Transformer module to get the enriching depth features and employ a convolution layer to obtain the dense depth map. In practice, the SDformer obtains state-of-the-art results against the CNN-based depth completion models with lower computing loads and parameters on the NYU Depth V2 and KITTI DC datasets.
title SDformer: Efficient End-to-End Transformer for Depth Completion
topic Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2409.08159