Zig-Zag Network for Semantic Segmentation of RGB-D Images

IEEE Transactions on Pattern Analysis and Machine Intelligence 2020

 Di Lin     Hui Huang*

Shenzhen University


Semantic segmentation of images requires an understanding of appearances of objects and their spatial relationships in scenes. The fully convolutional network (FCN) has been successfully applied to recognize objects’ appearances, which are represented with RGB channels. Images augmented with depth channels provide more understanding of the geometric information of the scene in an image. In this paper, we present a multiple-branch neural network to utilize depth information to assist in the semantic segmentation of images. Our approach splits the image into layers according to the “scene-scale”. We introduce the context-aware receptive field (CARF), which provides better control of the relevant context information of learned features. Each branch of the network is equipped with CARF to adaptively aggregate the context information of image regions, leading to a more focused domain that is easier to learn. Furthermore, we propose a new zig-zag architecture to exchange information between the feature maps at different levels, augmented by the CARFs of the backbone network and decoder network. With the flexible information propagation allowed by our zig-zag network, we enrich the context information of feature maps for the segmentation. We show that the zig-zag network achieves state-of-the-art performances on several public datasets. 

Fig. 2. Overview of our network. Given a color image, we use CNN to compute the convolutional feature maps. These are passed to the zig-zag architectures, which gradually recover their resolutions. Each zig-zag architecture has multiple branches. The discrete depth image is layered, where each layer represents a scenescale and is used to match the image regions to corresponding network branches. Each branch has the contextaware receptive field (CARF), which produces context feature map to combine with the feature from an adjacent branch. The predictions of all branches are merged to achieve the eventual segmentation result. Please see Fig. 3 for details of the CARF.

Fig. 3. Two-stage weighting scheme of CARF: (a) image partitioned into super-pixels with different sizes; (b) each neuron of the convolutional feature map is augmented by local weighting, which uses the information of neurons residing in the same super-pixel; (c) after local weighting, the neurons residing in each super-pixel are augmented; (d) each neuron is further augmented by high-order weighting, which uses the content of adjacent super-pixels, to form the context feature map. The two-stage weighting is repeatedly applied to the images partitioned by super-pixels of diverse sizes. Note that the feature map has smaller resolution than the image due to down-sampling of the network.

Fig. 8. Sample of the comparison to state-of-the-art models [30], [27] and ours. Scene images are taken from the NYUDv2 dataset [42].

Fig. 9. Sample of the comparison to state-of-the-art models [30], [27] and ours. Scene images are taken from the SUN-RGBD dataset [44].


We thank the anonymous reviewers and editors for their constructive suggestions. This work was supported in parts by NSFC (61702338), National 973 Program (2015CB352501), Guangdong Science and Technology Program (2015A030312015), Shenzhen Innovation Program (KQJSCX20170727101233642), LHTD (20170003), and the National Engineering Laboratory for Big Data System Computing Technology

title = {
Zig-Zag Network for Semantic Segmentation of RGB-D Images},
author = {Di Lin and Hui Huang},
journal = {IEEE Transactions on Pattern Analysis and Machine Intelligence 2019},
volume = {42},
number = {10},  

pages = {2642-2655}, 

year = {2020},

Downloads (faster for people in China)

Downloads (faster for people in other places)