
Di Lin1 Ruimao Zhang2 Yuanfeng Ji1 Ping Li3 Hui Huang1*
1Shenzhen University 2Chinese University of Hong Kong 3MacauUniversity
Fig. 1. Correlation observed between the depth and object co-existence: in common the near regions, e.g., highlighted in blue rectangles, have relatively simple object co-existences, while very likely different objects coexist densely in the far regions, such as highlighted in red rectangles.
Fig. 2. The overview of our switchable context network (SCN). Given a RGB image, we produce the convolutional feature maps layer-by-layer in a resolution-descending order. Our SCN firstly produces the local structural feature maps, which are used to compute the context representations in topdown switchable information propagation. The context representations are combined with the convolutional features to form the intermediate feature maps, which are used for the final semantic segmentation.
Abstract
Fig. 3. The construction of our contextual representation undergoes two information propagations: (a) local structural information propagation. In this stage, each region (a color node of the regular grid in the intermediate feature maps) receives the information from the regions located in the same super-pixel. The regions (the enlarged node of the regular grid) having richer information constitute the local structural feature maps; (b) top-down switchable information propagations. We compute the average depth value for each super-pixel. In the last column, the super-pixels highlighted in red and blue contain the regions that provide the information output by compression and expansion architectures, respectively. Each region (a color node of the regular grid in the local structural feature maps) receives the information from the regions located in the adjacent super-pixels, and form the region (the highlighted red node in the context representations) having accurate contextual information. For illustration, the context representations are shown in the same size with the local structural feature maps. Actually, the context representations have larger resolution than the local structural feature maps do.
Fig. 5. A sample of the comparison to the state-of-the-art model [35] and our SCN. The images are selected from NYUDv2 [38] dataset.
Fig. 6. A sample of the comparison to the state-of-the-art model [35] and our SCN. The images are selected from SUN-RGBD [21] dataset.
Acknowledgements
We thank the reviewers and editors for their valuable comments. This work was supported in parts by NSFC (61702338, 61522213, 61761146002, 61861130365), National 973 Program (2015CB352501), Guangdong Science and Technology Program (2015A030312015), Shenzhen Innovation Program (KQJSCX20170727101233642) and the Macau Science and Technology Development Fund under Grant (0027/2018/A1).
Bibtex
@article{SCN20,
title = {SCN: Switchable Context Network for Semantic Segmentation of RGB-D Images},
author = {Di Lin and Ruimao Zhang and Yuanfeng Ji and Ping Li and Hui Huang},
journal = {IEEE Transactions on Cybernetics},
volume = {59},
number = {3},
pages = {1120--1131},
year = {2020},
}