Image Saliency Prediction using ShallowNet & DeepNet

Kelam Goutam
6 min readJul 25, 2019

In my previous article I wrote about what is image saliency, what are its applications and also a brief background about the evolution of image saliency prediction. In this article I will try to delve into the first end-to-end, data-driven CNN architecture for saliency prediction. The article will explore the two network architectures proposed by Junting Pan, et al. in their paper “Shallow and Deep Convolutional Networks for Saliency Prediction” .

“A humble knowledge of thy self is a surer way to God than a deep search after learning.” — Thomas a Kempis

Junting Pan, et al. chose to address the saliency problem as a regression problem rather than a classification problem. The classification problem needs the class labels to be orthogonal, i.e. if a pixel is predicted as salient, then it cannot be for sure non-salient. In the case of regression, each pixel is assigned a saliency score to determine how important the pixel is in attracting a human attention.

The Shallow Convnet :-

The Shallow Convnet is a lightweight CNN model with less number of parameters trained for predicting saliency from scratch. The model is trained using SALICON dataset which contains 10,000 training images along with their corresponding saliency maps.

  • ShallowNet Architecture:

The ShallowNet consists of five weight layers, where the first three are convolutional layers and the rest two are fully connected layers. Rectified non-linear units (ReLU) and maxpool units follow the convolutional layers. The ReLU layer applies non-linearity to each element and the maxpool layer reduces the size of the input feature progressively while extracting the important features such as edges.

Image Courtesy: Junting Pan et al.

The ShallowNet architecture consists of about 64 million learnable parameters. The input to the ShallowNet is an image of size 96×96. After passing through three convolutional layers and three maxpool layers, the feature size becomes 10×10. The feature set then passes through a fully connected layer which outputs 4,096 activations. These 4,096 activations are divided into two slices each consisting of 2,304 activations and Maxout is applied on each of the slices. In the Maxout layer, the maximum activation from the corresponding pair of slices is considered, this helps in reducing the overfitting. After the Maxout layer, the remaining 2,304 activations passes through another fully connected layer which outputs the same number of activations. The final 2,304 activations are reshaped to generate a saliency map of size 48×48.

  • Training the ShallowNet:

The ShallowNet model is trained for 1,200 epochs. The initial learning rate is set to 3E-2 and is gradually decreased to 1E-4. In every layer, the weights are initialized using Gaussian distribution whose mean and standard deviation are set to zero and 0.01 respectively and the biases are initialized with a constant value of 0.1. The momentum term and the weight decay is set to 0.9 and 5E-4 respectively. Stochastic gradient descent (SGD) with Nesterov momentum is used for training. The Nesterov momentum aids in the faster convergence of the model. The data augmentation methods used are random horizontal flipping and normalizing pixel values to [0,1] range. Data augmentation is applied to the images and their saliency maps. The loss function used is Euclidean distance also known as L2 norm.

  • The ShallowNet Results:

My implementation of ShallowNet is not perfect and requires a further fine tuning. The predicted saliency maps show that our implementation of ShallowNet is reasonably good at predicting the actual salient objects; however, it is also giving some saliency score to the background objects. My results are shown below:

Query Image (Left), Ground Truth (centre) and Saliency Map predicted (right)

The Deep Convnet:-

The design of the DeepNet architecture is such that it can use the pre-trained parameters from the image classification model with the layers designed for saliency prediction.

  • DeepNet Architecture:

As the name suggests, DeepNet is a deeper architecture compared to the ShallowNet. It consists of 10 weight layers and has a total of 25.8 million parameters. DeepNet has 60% fewer parameters compared to the ShallowNet, even though it has more layers. The reason for the reduction in parameters is the exclusion of the fully connected layers in DeepNet.

Image Courtesy: Junting Pan et al.

The input to the DeepNet model is an image of size 320×240. The model is trained using transfer learning. Weights of the first three convolution layers of the DeepNet are initialized with VGG_CNN_M model. The weights of the remaining layers are randomly initialized using He initialization. The ReLU layer follows every convolutional layer and applies elementwise non-linear activation. The feature shape becomes 37×17×1 after the input image passes through all the convolutional layers. Finally, the deconvolution layer generates the saliency map whose size is same as that of the input image.

  • Training the DeepNet:

The images and their corresponding saliency maps in the SALICON dataset are of size 640×480. These are reduced by a factor of 2 and their mean value across all the channels are computed. The images and their saliency maps are zero-centered by subtracting their mean values respectively. After zero-centering the pixel values are scaled to [−1,1] interval. The base learning rate is 0.01 as in the case of ShallowNet. The learning rate is set to 1.3E-7 by normalizing the base learning rate with the total number of predictions made per image. The model is trained for 24,000 iterations (i.e, 5 epochs). The weight decay parameter is set to 5E-4 and Euclidean Norm is used as the loss function. SGD with Nesterov momentum is used for updating the weights of the parameters.

  • The DeepNet Results:

In my implementation of DeepNet I have inverted the color of the salient part in the image (i.e the black part corresponds to the salient region). The model is not accurate can still be improved by finetuning the parameters. My results are shown below:

Query Image (Left), Ground Truth (centre) and Saliency Map predicted (right)

The Quantitative Comparison:-

The quantitative comparison of results of our implementation with that of the authors against the metrics Shuffled Area under Curve (sAUC), Area under Curve Borji (AUC-B) and Pearson’s Correlation Coefficients (CC).

Quantitative comparison of our implementation of the models with the Author’s implementation.

Though our results are closer to the actual ground truth qualitatively, they differ from that of the authors. The authors suggested the reasons for the discrepancy can be attributed to the hyper-parameter tuning and the difference in framework used for implementation of the networks.

The author’s implementation of ShallowNet and DeepNet are publically available in GitHub (Author’s implementation). The authors chose to implement ShallowNet using Lasagne and DeepNet using Caffe. I used PyTorch framework to implement both the models. My implementation is also available in GitHub (My Implementation).

— — — — — — — — — — — — — — — — — — — — — — — — —

Thanks for going through this article. In the next article I will summarize about SalGAN. I sincerely hope it helped you to learn something new. Please feel free to leave a message with comments or suggestions.

--

--