Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Qian, Jian, Sun, Miao, Lee, Ashley, Li, Jie, Zhuo, Shenglong, Chiang, Patrick Yin
Format:	Preprint
Published:	2024
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2409.08159
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866917774786297856
author	Qian, Jian Sun, Miao Lee, Ashley Li, Jie Zhuo, Shenglong Chiang, Patrick Yin
author_facet	Qian, Jian Sun, Miao Lee, Ashley Li, Jie Zhuo, Shenglong Chiang, Patrick Yin
contents	Depth completion aims to predict dense depth maps with sparse depth measurements from a depth sensor. Currently, Convolutional Neural Network (CNN) based models are the most popular methods applied to depth completion tasks. However, despite the excellent high-end performance, they suffer from a limited representation area. To overcome the drawbacks of CNNs, a more effective and powerful method has been presented: the Transformer, which is an adaptive self-attention setting sequence-to-sequence model. While the standard Transformer quadratically increases the computational cost from the key-query dot-product of input resolution which improperly employs depth completion tasks. In this work, we propose a different window-based Transformer architecture for depth completion tasks named Sparse-to-Dense Transformer (SDformer). The network consists of an input module for the depth map and RGB image features extraction and concatenation, a U-shaped encoder-decoder Transformer for extracting deep features, and a refinement module. Specifically, we first concatenate the depth map features with the RGB image features through the input model. Then, instead of calculating self-attention with the whole feature maps, we apply different window sizes to extract the long-range depth dependencies. Finally, we refine the predicted features from the input module and the U-shaped encoder-decoder Transformer module to get the enriching depth features and employ a convolution layer to obtain the dense depth map. In practice, the SDformer obtains state-of-the-art results against the CNN-based depth completion models with lower computing loads and parameters on the NYU Depth V2 and KITTI DC datasets.
format	Preprint
id	arxiv_https___arxiv_org_abs_2409_08159
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	SDformer: Efficient End-to-End Transformer for Depth Completion Qian, Jian Sun, Miao Lee, Ashley Li, Jie Zhuo, Shenglong Chiang, Patrick Yin Computer Vision and Pattern Recognition Depth completion aims to predict dense depth maps with sparse depth measurements from a depth sensor. Currently, Convolutional Neural Network (CNN) based models are the most popular methods applied to depth completion tasks. However, despite the excellent high-end performance, they suffer from a limited representation area. To overcome the drawbacks of CNNs, a more effective and powerful method has been presented: the Transformer, which is an adaptive self-attention setting sequence-to-sequence model. While the standard Transformer quadratically increases the computational cost from the key-query dot-product of input resolution which improperly employs depth completion tasks. In this work, we propose a different window-based Transformer architecture for depth completion tasks named Sparse-to-Dense Transformer (SDformer). The network consists of an input module for the depth map and RGB image features extraction and concatenation, a U-shaped encoder-decoder Transformer for extracting deep features, and a refinement module. Specifically, we first concatenate the depth map features with the RGB image features through the input model. Then, instead of calculating self-attention with the whole feature maps, we apply different window sizes to extract the long-range depth dependencies. Finally, we refine the predicted features from the input module and the U-shaped encoder-decoder Transformer module to get the enriching depth features and employ a convolution layer to obtain the dense depth map. In practice, the SDformer obtains state-of-the-art results against the CNN-based depth completion models with lower computing loads and parameters on the NYU Depth V2 and KITTI DC datasets.
title	SDformer: Efficient End-to-End Transformer for Depth Completion
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2409.08159

Similar Items