Siamese Networks. Line by line explanation for beginners | by Krishna Prasad | Towards Data Science

Siamese Networks

Summary

Siamese Networks are a class of neural networks capable of one-shot learning. This post is aimed at deep learning beginners, who are comfortable with python and the basics of convolutional neural networks. We will go through line by line explanation of how siamese networks are implemented using Keras in Python. When you are going through the code, in case you feel some things could have been explained or done in a better way, feel free to comment.

Image by Gerd Altmann from Pixabay

Introduction

Let us assume we have a company of 1000 employees. We decide to implement a facial recognition system to record the attendance of your employees. If we were to use traditional neural networks, we will have to face two main problems. First one would be the dataset. It would be nearly impossible to assemble a huge collection of dataset from each of our employees, we would end up with a maximum of 5 photos of each of our employees. But a traditional CNN(Convolutional Neural Networks) won’t be able to learn features with such small collection. We’ll also end up with 1000 output classes. Let’s consider that somehow we got a huge dataset from each of our employees and we trained a really good CNN model. What happens when a new employee joins our organization? How can we include the person into our facial recognition system? All these shortcomings can be overcome using siamese networks. In this post, we will experiment with one-shot learning using Siamese networks that concentrate on the difference rather than feature matching.

Rather than using huge data for each of the class, we calculate the similarity scores between images of different classes. The input to this network will be two images either belonging to the same class or different class. The output will be a floating-point number ranging between 0 and 1, wherein 1 indicates that that the two images are of the same class and 0 indicating they are from different ones. Let me start by explaining how it is different from image classification using CNN Architectures.

Architecture

In case of a CNN model, you have a series of convolutional and pooling layers followed by some dense layers and an output layer probably with a softmax function. The convolutional layers here are responsible for feature extraction from the image, whereas the softmax layer is responsible for providing a range of probability for every class. We then decide the class of the image with the neuron that has the highest probability value.

Take a look at this great article for more information on how CNN works.

Traditional CNN Architecture by Sumit Saha

With siamese networks, it has a similar constitution of convolutional and pooling layers except we don’t have a softmax layer. So, we stop with the dense layers. As explained before since the network has two images as inputs, we will end up with two dense layers. Now we calculate the difference of these two layers and output the result to a single neuron with sigmoid activation function(0 to 1). Thus the training data to this network must be structured in such a way that there is a list consisting of two images and a variable either 0 or 1.

Siamese Networks

Note: There is only one network and both the images are passed through the same network. It’s just that there are two inputs. Thus, both the inputs will be passing through the same weight matrix from the convolution and dense layers.

If you are still not clear of how this works, refer to this link.

Code

For this post, I have used Fruits 360 dataset from Kaggle. However, feel free to experiment with other datasets. The code is hosted in Kaggle. In case you have some doubts with the code, feel free to fork the below notebook and experiment yourself.

https://www.kaggle.com/krishnaprasad96/siamese-network

Importing Libraries

Let us start by importing the libraries that we are using. As mentioned before this code uses Keras for building the model and NumPy, pillow for data preprocessing.

Note: Don’t import Keras as “from tensorflow import Keras”

Data Preprocessing

  • Line 1: Include the base directory of the dataset
  • Line 2: Indicate the percentage that is going to be used for training. The rest will be used for testing
  • Line 3: Since Fruits 360 is a dataset for Image classification, It has a lot of images per category. But for our experiment, a small portion is enough
  • Line 6: Get the list of directories from the folder. Each folder pertains to class
  • Line 10–13: Declare three empty lists to record X(images), y(labels), cat_list(To record the category of each image)
  • Line 16–24: Iterate over the class folders and select ten images from each of the class, convert them to RGB format and append them to a list. keep a record of the class of the image in cat_list[] for further reference
  • Line 26–28: Convert all the list to NumPy arrays. As any image will range from 0–255, divide the array x by 255 for simplification

Train Test Split

  • Line 1: Calculate the number of classes that will be used for training by multiplying with the train_test_split
  • Line 2: Subtract train_size from the total classes available to get the test_size
  • Line 4: Multiply train_size with the number of files in each class to get the total number of training files
  • Line 7–15: Use the value calculated before to subset X, Y and cat_list

Generating Batch

This Section is for generating batch files for training. The Batch files should have an X and Y. In the usual case of image classification, If the batch size is 64 and the image size is (100, 100, 3) the size of X would be a list of size 64 and each element in the list would be of size (100, 100, 3).

In our case since we have 2 inputs, there would be a list (let’s say ‘A’) of size 64 and each element in ‘A’ would have a list (let’s say ‘B’) of length 2 and each element in ‘B’ would be of size (100, 100, 3). For training, we’ll generate a batch such that for half the input pairs B[0] and B[1] are of the same category. Assign the value 0 to these image pairs. For the other half of the input pairs, B[0] and B[1] are of different category. Assign the value 1 to these image pairs.

  • Line 3–7: Store the values of x_train, cat_train and the start and end size of training size in a temporary variable
  • Line 9–11: Assign half of the batch_size of Y as 0 and others as 1
  • Line 13: Generate a random list of classes from the training category list to be used. Also, append two arrays of image_size*batch_size
  • Line 17–25: For each iteration, In case of batch_x[0] select an image from the category specified in the class list. For batch_x[1] select an image from the same category if y[i] is 0, else select batch_x[1] from any other category except for the same one

Siamese Network

  • Line 1: Declare the shape of the input image.
  • Line 2: Declare two inputs with the shape of the image.
  • Line 6–7: Declare parameters for initializing weight and bias of the network. The values are chosen as described in the paper.
  • Line 9–20: Declare a Sequential model with 4 convolutional layers and max-pooling layers. Use a flattening layer at last followed by a dense layer.
  • Line 22–23: Pass both the inputs to the same model.
  • Line 25–27: Subtract the dense layers from both the images and pass it through a single neuron with a sigmoid activation function.
  • Line 29–30: Compile the model with loss as ‘binary_cross_entropy’ and ‘Adam’ optimizer.
  • Line 32: The plot model function for siamese_net outputs the following.

Siamese Network

N-way one-shot Learning

This is a process of validating one-shot learning, we pick ’n’ input pairs such that only one input pair belong to the same category and other all are from different ones. If we consider a 9-way one-shot validation, and each input to the network requires two images, x[0] remains constant for all 9 pairs, x[1] belongs to the same category of x[0] only for 1 in 9 pairs, and different for everything else. If all the 9 pairs are given to the model, it is expected that the pair which belong to the same category will have the lowest value out of the 9 pairs. In such a case, we count it as a successful prediction.

The input parameter n_val refers to the number of validation steps. n_way refers to the number of ways for each validation step. Remember that x[0] mentioned above remains constant for every validation step.

(For a deeper understanding, please fork the notebook from Kaggle and try debugging each line from this function)

  • Line 3–7: Store x_val, cat_test in a temporary variable
  • Line 9: This is the same as Line 13 in the batch generation, except we create a batch of random categories from the test set
  • Line 11–24: For each validation step, we iterate through the n_way, take the corresponding category list from class_list, pick an image from that category and store it in x[0]. For x[1] select an image from the same category if it is the first iteration, and select from a different category for others. This inner loop is almost the same as batch_generation() method discussed above.
  • Line 26–31: For each of the validation step, predict the output using the model and check if result[0] has the minimum value compared to others. Note that the result array will be a list of size n_way. If yes add 1 to n_correct. Repeat the same for all other validation steps.
  • Line 32: Calculate the accuracy using n_correct and the number of validation steps.

Training the Model

Training the model has 4 hyperparameters (epochs, batch_size, n_val, n_way)

  • Line 6–7: Declare two lists to record the loss and accuracy values for further visualizations.
  • Line 8–20: For each epoch, get the batch of x and y, train the model using those inputs and append the loss to the list. For every ’n’ (250 in this case) number of epochs, check how the model is performing by doing N-way one-shot learning.

Results and Future works

The above code was trained for 5000 epochs in Kaggle. Using a GPU would significantly reduce the training time.

Training loss of the modelAccuracy of the model

We were able to achieve an accuracy of 90% on the validation set. To improve accuracy even higher, we can try importing weights from a pretrained model such as VGG-16, ResNet-50 and so on.

Concluding

Let me know if you are facing any issues. I’ll try my best to respond. As this is my first blog post ever, let me know if it was helpful for you on your projects or if I should change the way of explaining things.

I’m looking forward to creating more posts on computer vision. Let me know the topics you wish to be covered. Happy hacking!