RGB-D Salient Object Detection Using the Siamese Network

Published in

Analytics Vidhya

11 min readAug 10, 2021

RGB_image, Ground_Truth, Model_predicted

Introduction:

Salient object detection(SOD) is nothing but the detection or identifying of the object in an image which a human generally focuses on when he/she sees that image. Many models have been developed to do SOD using both RGB image and depth image, some of them are State Of The Art models too. These existing models treat the RGB image and Depth image separately to extract the features and fuse these features for final predictions. These feature fusion can be done in three ways 1.Early Fusion 2.Late fusion and 3.Middle Fusion as shown in figure. As these models extract the features from both RGB and Depth images independently they generate a large number of parameters and they require a large amount of data to train.

Here, we are Implementing a novel joint learning and densely cooperative fusion (JL-DCF) architecture which was Explained in this paper. Unlike the existing models this method extracts the features from RGB image and Depth image simultaneously through a Siamese network as a shared backbone. It uses the middle fusion method to fuse the features. Since this method uses a single CNN type network to extract the features from both the inputs, it has less number of parameters and memory, computational wise better than the existing models. The architecture of the model shown below.

The Framework consists of two modules: Joint Learning and Densely Cooperative Fusion as shown above. The Joint Learning component takes the RGB image and Depth image as a batch like 320 x 320 x 3 x 2 and extracts the features simultaneously through Siamese network(CNN). These features are then fed to the DCF component through CP (Compression Module), this CP Modules receives the input from the Siamese network through the side paths and compresses the Channels then passes to Cross-modal fusion module(CM) before concatenating with the output of FA module.

We can use different networks like ResNet-101,ResNet-50,VGG-16 as backbone in JL Component. Here, in this project we are training the different models with both Resnet101 and VGG16 as backbone networks and performing the SOD task on the NJU2K dataset. The data was taken from here.

Model Architecture:

Let’s understand the model architecture deeply, we can divide the total Framework in two parts i.e Joint Learning (JL) and Densely Cooperative Fusion (DCF)

Joint Learning:

In the joint learning component we can use a CNN type network as a shared backbone . As shown in the above figure, we combine both the RGB image and it’s Corresponding Depth image in the 4th dimension i.e batch dimension like 320x320x3x2 and then give it as input to the CNN network. The hierarchical features from the shared CNN backbone are then leveraged in a side-output way through the side paths as shown in figure. These side paths contain different filter sizes and channel numbers. The outputs of these side paths are having the different channel numbers and we pass them to the compression modules (CP)(CP1∼CP6 in Fig. 3, practically implemented by convolutional layers plus ReLU nonlinearities) here in this CP module we compress the features to a standard k channels. The outputs of these CP modules are still batches, we fed these outputs to the Cross-modal Fusion (CM) module of the DCF component. We split the output of the last Compression module i.e CP6 into two parts and then pass through the 1x1,1 Convolution layer and then compare these outputs with the down sampled Ground truth and calculate the loss. This loss is called Global loss Lg, the backbone network learns from this loss and gets updated through the back propagation.

Densely Cooperative Fusion(DCF):

The outputs from the CP modules contain both the RGB and Depth information in batches. We fed these outputs to the DCF component, through the Cross-modal Fusion module (CM). In this CM module we first split the batches and then perform the element wise addition and element wise multiplication to conduct the feature fusion which we call cooperative fusion as shown in figure. Mathematically, let a batch feature be denoted by {Xrgb, Xd}, where Xrgb, Xd represent the RGB and depth feature tensors, each with k channels, respectively. The CM module conducts the fusion as:

CM({Xrgb, Xd}) = Xrgb ⊕ Xd ⊕ (Xrgb ⊗ Xd),

where “⊕” and “⊗” denote element-wise addition and multiplication. Since, we did the element wise addition and multiplication the blended output still contains the k channels.

We then fed these outputs from CM modules to the Feature Aggregation (FA) modules. These FA modules are connected with the dense connection as shown in above figure. In these FA modules we perform non linear aggregation and transformation with different convolution layers and max-pooling layer as shown in figure. The final FA module FA1 has the finest features, we fed these features to the 1x1,1 Convolution layer to get the Final output.This final prediction is supervised by the resized ground truth (GT) map during training as shown in figure. We call the loss generated in this stage as Final Loss and denote as Lf .

Loss Function:

The total loss function of our model is composed of the global guidance loss Lg and final loss Lf . Assume that G denotes supervision from the ground truth, Sc rgb and Sc d denote the coarse prediction maps contained in the batch after module CP6, and Sf is the final prediction after module FA1. The total loss function is defined as:

Ltotal = Lf (S f , G) + λ X x∈{rgb,d} Lg(S c x , G), (2)

where λ balances the emphasis of global guidance here in this project we gave equal weights i.e λ=1 , and we adopt the widely used Binary cross-entropy loss for Lg and Lf as:

L(S, G) = − X i [Gi log(Si) + (1 − Gi) log(1 − Si)], (3)

where i denotes pixel index, and S ∈ {Scrgb, Scd , Sf }.

Performance Metrics:

For evaluation purposes we used five metrics, all metrics were calculated based on the Final output from FA1 module. The metrics are:

Precision: Precision can be defined as the ratio of True positives(TP) to the sum of True Positives(TP) and False Positives (FP).We can calculate the precision by comparing the binary output from FA1 module and Resized Binary Ground truth.
Recall: Recall can be defined as the ratio of True positives(TP) to the sum of True Positives(TP) and False negatives(FN).We can calculate the Recall by comparing the binary output from FA1 module and Resized Binary Ground truth.
F-Beta Measure: F-measure can be defined as

Where β is the weight between the precision and the recall. We set β^2 = 0.3 as suggested in the paper.

4. E-Score: The Enhanced Alignment Measure(E-measure) was used in our task to evaluate the performance of the model. This paper helps us to better understand the metric. This measure takes both pixel level and image level statistics into consideration. It will be calculated as below:

First calculate the Bias matrix ϕ which is a distance between the each pixel value of a binary (Predicted saliency map) FM and it’s global mean.

Image from the paper

Where A is a matrix with all elements valued as 1. Calculate this bias matrix for both Foreground map(FM) and Ground truth(GT) as ϕFM, ϕGT. Then,calculate the alignment matrix ξFM which is a correlation matrix between ϕFM and ϕGT.

Where, o denotes Hadamard product which is nothing but element wise product. Now, calculate the Enhanced Alignment matrix as shown below:

φFM = f(ξFM)

Where, f is the quadratic form function f(x) = 1/ 4 (1 + x) 2

Finally, now calculate the Enhanced Alignment Measure(E-measure)

Where, w and h are the width and height of the Foreground Map i.e Predicted map.

5. Mean Absolute Error: MAE can defined as

where Smap(x, y) and G(x, y) correspond to the saliency value and ground truth value at pixel location(x,y).

Model Implementation:

Let’s Implement the model step by step. Build the sub parts of the model separately and then we will combine all of these using a functional API. Here, we are building the model with VGG16 backbone. The whole model was built and executed in Tensorflow.

VGG16 Backbone: We built the VGG16 without the top Three Dense layers using the Functional API. For computational speed we reduce the input shape i.e image dimensions to 160x160x3. Code for the construction of VGG16 given below:

Here is the Gist

2. Side Paths: After Backbone we have to build the Side paths. These side paths are nothing but the Convolution layers with different parameters as shown in figure. Parameters in the below brackets from left to right are: kernel size, channel number, stride, dilation rate, and padding

These are the side paths parameters which are recommended in the paper, we built the side paths accordingly. We built these Side paths in the Class object, the code was give below:

Here is the Gist

3. Compression Module(CP): CP modules are simple Convolution layers with kernel size of 3x3 and number of filters are 64 as suggested in the paper.

4.Cross-modal Fusion Module(CM): The CM module was built as shown in the CM module figure above. Here we have to split the output of the CP module first and then element wise addition and multiplication. The code was given below:

Here is the Gist

5. Feature Aggregation Module(FA): The FA module was built as explained in the FA figure given above. Here is the code:

Here is the Gist

Up to now, we built all the required sub modules of the model. Now, let’s define the performance metrics. We calculated the metrics using callbacks in model.fit. The code for the callbacks given below:

E-Score:

Here is the Gist

All the other metrics:

Here is the Gist

We created all the sub modules and performance metrics that are required. Now, let’s build the Final Model.

Final Model:

Here, we connect all the sub modules and build the final model. At first we need to give both RGB image and Depth image as inputs and need to form a batch of 2 and need to get three outputs, two from the Joint learning component and one is the final output. For this we use multiple input and multiple output models from tensorflow. The code for input layers is given below.

Here is the Gist

Now, we connect the side paths with the VGG16 as :

side_path1 to conv1_2

side_path2 to conv2_2

side_path3 to conv3_3

side_path4 to conv4_3

side_path5 to conv5_3

side_path6 to pool5 of VGG16 network.

We connect these side paths to the CP module and the outputs of these CP modules fed to the CM modules. To connect the outputs of the CM modules with the FA module output we have to up sample the outputs of FA modules at various rates to match the spatial dimensions of the CM module output. The code for the Final Model given below:

Here is the Gist

Now, We compile this model with the Adam Optimizer with the learning_rate of 0.0001.Tried different Learning rates but 0.0001 giving the better convergence.

Like this we built 4 different models they are:

JL_DCF with VGG16 and basic Parameters. Here is the Code.
JL_DCF with VGG16 and additional conv layers. Here is the Code.
JL_DCF with ResNet101 and basic Parameters. Here is the Code.
JL_DCF with ResNet101 and additional Conv layers. Here is the Code.

In the case of models with Additional Conv layers, we added the extra Conv layers in addition to the existing layers in the FA module and to the outputs of CP6 and FA1.

For models with the Resnet101 network we just change the backbone network and all the remaining setups are unchanged. As the first convolutional layer of ResNet101 already has a stride of 2, the features from the shallowest level have a spatial size of 80x80 in our case. To obtain the full size (160x160) features without trivial up-sampling, we borrow the conv1_1 and conv1_2 layers from VGG-16 for feature extraction. Side path1∼path6 are connected to conv1_2, and conv1, res2c, res3b3, res4b22, res5c of the ResNet101, respectively. We also change the stride of the res5a block from 2 to 1. Full Code for these models given above.

Post Training Analysis:

After the training of four models for 40 epochs, we conducted the post training analysis on all four models to pick the best model. For this task we used the Predicted images and E-scores for those images as the references. Based on these predictions and E-scores we selected the JL_DCF with VGG16 as Backbone network and additional Conv layers as our best one. Here is the full code of the Post training analysis Notebook.

In the post training analysis, we also checked that on which type of images the model worked the worst, we distinguished the images with E-score less than 0.50 as the worst. Out of 397 test samples the model predicted 14 samples as worst which is 3.52% of total samples. We concluded that, on the images with multiple objects, low lighting, blurred and images with thin and deep objects the model performed very poorly. For those images the model detecting the other object instead of the object that was masked in the Ground truth images as shown below:

Building the Final Detection System:

Up to now, we built different models with different parameters and we trained them on the NJU2K dataset. Based on their performances we picked the JL_DCF with VGG16 and additional conv layers as our best model. Here, we are building the final Detection system in which the model takes the RGB_image and Depth image as inputs and Detects the salient objects in the image. Here is the final notebook code and we Deployed this code. The demo video with deployment was given below.

Deployment_of_the_model

Future Work:

We trained all the models on NJU2K Dataset only which contain 1985 images, if we train on the other RGB-D Datasets in addition to NJU2K the performance of the models may increase. We can also increase the Performance of the models by doing data augmentations like sharpening, rotation etc., We can also try the JL_DCF model with other Backbones like VGG19,ResNet152 and DenseNet etc., and additional Conv layers to improve the performance.