TY - JOUR
T1 - A2GSTran: Depth Map Super-resolution via Asymmetric Attention with Guidance Selection
AU - Zuo, Yifan
AU - Xu, Yaping
AU - Zeng, Yifeng
AU - Fang, Yuming
AU - Huang, Xiaoshui
AU - Yan, Jiebin
N1 - Funding information: This work was supported in part by the National Natural Science Foundation of China, under Grant 62271237, the Natural Science Foundation of Jiangxi Province, under Grants 20224ACB212005, 20224BAB212012 and
20232BAB202001, Double Thousand Plan of Jiangxi Province, under Grant
jxsq2019101076.
PY - 2024/6/1
Y1 - 2024/6/1
N2 - Currently, Convolutional Neural Network (CNN) has dominated guided depth map super-resolution (SR). However, the inefficient receptive field growing and input-independent convolution limit the generalization of CNN. Motivated by vision transformer, this paper proposes an efficient transformer-based backbone A
2GSTran for guided depth map SR, which resolves the above intrinsic defect of CNN. In addition, state-of-the-art (SOTA) models only refine depth features with the guidance which is implicitly selected without supervision. So, there is no explicit guarantee to mitigate the artifacts of texture copying and edge blurring. Accordingly, the proposed A
2GSTran simultaneously solves two sub-problems, i.e., guided monocular depth estimation and guided depth SR, in separate branches. Specifically, the explicit supervision upon monocular depth estimation lifts the efficiency of guidance selection. The feature fusion between branches is designed via bi-directional cross attention. Moreover, since guidance domain is defined in high resolution (HR), we propose asymmetric cross attention to maintain the guidance information via pixel unshuffle instead of pooling which has unequal channel number to depth features. Based on the supervisions to depth reconstruction and guidance selection, the final depth features are refined by fusing the output features of the corresponding branches via channel attention to generate the HR depth map. Sufficient experimental results on synthetic and real datasets for multiple scales validate our contributions compared with SOTA models. The code and models are public via https://github.com/alex-cate/Depth_Map_Super-resolution_via_Asymmetric_Attention_with_Guidance_Selection
AB - Currently, Convolutional Neural Network (CNN) has dominated guided depth map super-resolution (SR). However, the inefficient receptive field growing and input-independent convolution limit the generalization of CNN. Motivated by vision transformer, this paper proposes an efficient transformer-based backbone A
2GSTran for guided depth map SR, which resolves the above intrinsic defect of CNN. In addition, state-of-the-art (SOTA) models only refine depth features with the guidance which is implicitly selected without supervision. So, there is no explicit guarantee to mitigate the artifacts of texture copying and edge blurring. Accordingly, the proposed A
2GSTran simultaneously solves two sub-problems, i.e., guided monocular depth estimation and guided depth SR, in separate branches. Specifically, the explicit supervision upon monocular depth estimation lifts the efficiency of guidance selection. The feature fusion between branches is designed via bi-directional cross attention. Moreover, since guidance domain is defined in high resolution (HR), we propose asymmetric cross attention to maintain the guidance information via pixel unshuffle instead of pooling which has unequal channel number to depth features. Based on the supervisions to depth reconstruction and guidance selection, the final depth features are refined by fusing the output features of the corresponding branches via channel attention to generate the HR depth map. Sufficient experimental results on synthetic and real datasets for multiple scales validate our contributions compared with SOTA models. The code and models are public via https://github.com/alex-cate/Depth_Map_Super-resolution_via_Asymmetric_Attention_with_Guidance_Selection
KW - Asymmetric Cross Attention
KW - Convolutional neural networks
KW - Estimation
KW - Feature extraction
KW - Guidance Selection
KW - Guided Depth Map Super-resolution
KW - Image reconstruction
KW - Self-attention
KW - Solid modeling
KW - Superresolution
KW - Transformers
KW - Vision Transformer
UR - http://www.scopus.com/inward/record.url?scp=85181578402&partnerID=8YFLogxK
U2 - 10.1109/tcsvt.2023.3327766
DO - 10.1109/tcsvt.2023.3327766
M3 - Article
SN - 1051-8215
VL - 34
SP - 4668
EP - 4681
JO - IEEE Transactions on Circuits and Systems for Video Technology
JF - IEEE Transactions on Circuits and Systems for Video Technology
IS - 6
ER -