Siamese Network Keras for Image and Text similarity.

1. Introduction

This blog is about a network, Siamese Network, which works extremely well for checking similarity between two systems . This network is widely used to solve the problems concerning image similarity and text similarity. For the scope of this blog I would demonstrate the application of this network in both the mentioned scenarios. For image similarity application of Siamese network, I would take the Kaggle problem — Recognizing faces in the wild — that I participated in and secured 28th position on the leader-board using Siamese networks as baseline models . And, for the text similarity application of Saimese network, I would be using Quora Question Pair similarity dataset. If you are only interested in using the code then jump to the references section and find the Github link.

2. Theory

According to Wikipedia, Siamese Neural Network is defined as —

Siamese neural network is an artificial neural network that use the same weights while working in tandem on two different input vectors to compute comparable output vectors. Often one of the output vectors is precomputed, thus forming a baseline against which the other output vector is compared. This is similar to comparing fingerprints or more technical as a distance function for Locality-sensitive hashing.

In the above mentioned datasets the very basic concept of Siamese Network will be as follows:

A basic Siamese network — Source

In Siamese network we keep the basic network for getting features of entities(images/text) same and pass the two entities we want to compare through the exact same network. By the exact same network it is meant that both the entities are passed through the same architecture having same weights as shown in the figure. At the end of common network we get a vectored representation of our input which can then be used for measuring or quantifying the similarity between them.

3. Basic Working

In the case of finding similarity between images we can first get the images in numpy array format and feed them through any architecture(should be same for both images) and get an n-dimensional representation at the end of the common network. These n-dimensional representations can then be used as input to some loss function or a similarity metric or a simple neural network architecture would work as well. One simple example of Siamese network is:

A simple Siamese network-source

left_input and right_input shown in the above image represents the vectorized form of the two input images we wish to compare similarity. These image representation are then fed to a common network (a CNN in the above example). Same operations are performed on both the image vectors and the finally the distance between both of them is calculated by subtracted the vectors. The output of the difference of both the vectors is connected to a simple sigmoid output layer. Note that the input to this network is a list of two inputs — left_input and right_input.

4. Siamese network for image similarity

Siamese networks work very well for image similarity tasks. To show an example I am using the data set from Kaggle problem — Recognizing faces in the wild which I participated in recently and secured 28th rank by just using and tweaking multiple Siamese networks shown below. For the sake of this blog I will be only discussing the Siamese network part of the solution. If you are interested in the other parts such as pre-processing, data generation, etc., kindly check out the project Github link in references.

Let me now explain the Siamese architecture which is being used for checking whether two people are related by blood or not judging by their facial features. To start, our overall model has two inputs. Here input_1 is the image of person 1 and input_2 is the image of person_2. The task is to identify whether these two persons are related by blood or not.

The base model or the common network in the Siamese network(as discussed above) used in this case study is VGGFACE pre-trained model (you can use any architecture of your choice here)in which trainable parameter for all the layers except the last three layers. The significance of doing this is that the top layers are just being used for feature engineering and the later layers responsible for decision making can be fine-tuned.

x1 and x2 shown in the code are the features representing the two images. These two vectors are then sent through Global Max Pool and Global Avg Pool. x3 vector is the difference of of the vectors which then squared. Similarly x4 vector is the difference of the sqaure of x1 and x2 vectors. Further, x5 vector is the cosine similarity between x1 and x2 vectors. Finally, the x3, x4, x5 are concatenated and fed to a dense layer followed by a Dropout layer and then output sigmoid layer.

In simple words, the two images are featurized using a common network, then these two feature vectors can be directly used or sent through some decision making network to check for similarity between images.

5. Siamese network for text similarity

Just like we used Siamese network to check whether two images are similar or not, the same concept can also be used to check whether two pieces of texts are similar or not. For the example, we will consider Quora Question Pair similarity dataset. The problem at hand is to check whether a pair of questions posted on Quora website are similar or not.

To solve this task, we can again use Siamese network for the classification of the text as similar or not. Here, the common network used for featurizing texts is a simple Embedding layer followed by LSTM unit.

In this network. input_1 and input_2 are pre-processed, Keras-tokenized text sequences which are to be compared for similar intent. These two text sequences are then fed through a common network of a basic embedding layer and an LSTM units. Once the feature vectors are obtained from this common network, a series of similarity measures are computed and are concatenated to be finally input into a Dense layer followed by sigmoid output unit which will finally help in classifying whether the given texts are similar or not.

6. Siamese in conjunction with triplet loss

Siamese network is often used in conjunction with triplet loss. Triplet loss synergies quite well with Siamese network because it accepts three different images or any input which is fed through a common Siamese network for getting features which can be used to later predict/classify accordingly.

According to Wikipedia:

Triplet loss is a loss function for artificial neural networks where a baseline (anchor) input is compared to a positive (truthy) input and a negative (falsy) input. The distance from the baseline (anchor) input to the positive (truthy) input is minimized, and the distance from the baseline (anchor) input to the negative (falsy) input is maximized.

It is often used for learning similarity of for the purpose of learning embeddings, like word embeddings and even thought vectors, and metric learning.

In triplet loss basically, we have three inputs- anchor, positive and negative. Anchor can be any image of the person, positive is some other image of the same person and negative is an image of different person. The loss function can be described as: L=max(d(a,p)−d(a,n)+margin,0) where d(a,p) is the distance between anchor image and positive image. Similarly, d(a,n) is the distance between anchor and negative image.

As it can be seen in the above image the triplet loss function tries to maximize the distance between anchor image and negative image while minimizing the distance between anchor image and positive image thereby learning to differentiate similar images to non similar ones. To learn about the theoretical aspects of triplet loss please have a look at this video. I will now focus on the practical aspects of using triplet function by considering the kaggle recognizing faces in wild dataset.

In practical scenarios the datasets we generally work with will have image pairs and output labels. So, we will have to first generate inputs such that instead of image pairs we have triplets of anchor, positive and negative image. We won’t require an output label so any dummy value will work as our triplet loss will try to work on the distances of the images rather than any labels. Here is the code for generating the dataset for the above mentioned dataset, please refer to the github code in the references to follow the precious steps as well.

In the above code you could see that this data generator is yielding three images X1(anchor image), X2(positive image), and X3(negative image) with dummy labels(these can be ignored). Now, we are ready to feed this data into our Siamese network with triplet loss. The code for triplet loss is:

Further, this loss can be utilized in our custom networks(any network you want to work with) as:

Here, our model accepts input_1(anchor), input_2(positive) and input_3(negative). These image vectors are then sent through a common Siamese network which in this case is VGGFace model. After, getting passed through Maxpool layers, these three vectors can be used by triplet loss function which will work according to the methodology discussed above. Finally, you can run model.fit_generator() method and start training your complete network.

Note: To get even better results with triplet loss, it is important to generate the dataset so that the negative sample taken is not any random image. Instead carefully construct a dataset by taking the negative image which looks similar to the anchor image but in reality it is not.

7. Scope

The scope of this blog was to introduce the applications of a very popular architecture- Siamese network and to provide quick and easy to understand reference code so that you can easily modify it and use it in your projects. The full code is given in the references section point [1] & [2]. If working on the datasets used as examples for this blog, better results can easily be obtained with some hyperparameter tuning and also by playing around with the number of layers or using a different architecture all together. However, the underlying working of Siamese network remains same.

8. References

[1] https://github.com/prabhnoor0212/Kaggle-Recognizing-faces-in-the-wild

[2] https://github.com/prabhnoor0212/Siamese-Network-Text-Similarity

[3] https://www.kaggle.com/c/recognizing-faces-in-the-wild

[4] https://www.kaggle.com/c/quora-question-pairs

[5] https://en.wikipedia.org/wiki/Artificial_neural_network

[6] https://www.mdpi.com/symmetry/symmetry-10-00385/article_deploy/html/images/symmetry-10-00385-g001.png

[7] https://github.com/hlamba28/One-Shot-Learning-with-Siamese-Networks/blob/master/Siamese%20on%20Omniglot%20Dataset.ipynb

[8] https://www.kaggle.com/hsinwenchang/vggface-baseline-197×197

[9] https://www.youtube.com/watch?v=d2XB5-tuCWU

[10] https://github.com/KinWaiCheuk/Triplet-net-keras/blob/master/Triplet%20NN%20Test%20on%20MNIST.ipynb