Similarity learning with Siamese Networks
Table of contents
For now, you must have heard of Classification or Regression problems but there exists a third type of problems called as similarity problems in which we have to find out if two objects are similar or not. There are really various applications of this from face recognition to signature comparison and the amount of data required to train such networks is also not huge. Furthermore, such networks are scalable and we do not need to retrain the models if we want our model to increase its number of outputs. Here are the subtopics we are going to cover in this article:
Mục Lục
What are Classification and Regression?
You might have used Machine learning for Classification or Regression problems. In Classification, we provide a network with an input and the network outputs the label which is from a fixed number of values whereas in Regression the network provides a label that can have continuous value when provided with an input.
For example, you want to train a model that can recognise images of dogs, cats and rats. For this, you have to obtain a labelled dataset containing images of dogs, cats and rats. After training the model, upon giving any input image the network can only output labels as dog or cat or rat. This is a standard computer vision problem known as Image Classification. Now let us see how regression is different from classification.
Take an example in which you want to predict the price of a given property. For this, you will have to collect data for all the sales of properties in a particular area you are interested in. Now after training the model, you will get an output that has a continuous value, since the price is a continuous value.
But there is also a third type of problem that we can tackle with deep learning and these are the problems of comparison. We shall see them in the next section i.e. similarity learning.
What is Similarity learning?
Similarity learning is an area of supervised machine learning in which the goal is to learn a similarity function that measures how similar or related two objects are and returns a similarity value. A higher similarity score is returned when the objects are similar and a lower similarity score is returned when the objects are different. Now let us see some use cases to know why and when similarity learning is used.
Consider a problem in which we have to train a model that can recognise all the students in a class to mark their attendance. We can use Image classification as discussed in the last section, we will collect data for all the students in the class and use them to train the model. After training the model, now we can recognise each student in the class. But what if we have a new student enrolled since we did not train the model on his data, it cannot recognise the new student. We will have to collect the new data and retrain the model, but training the model is expensive in terms of time and computation power. So we can pose this problem as similarity learning problem instead of a classification problem to solve this problem in an optimal way.
Now we will have a model that returns a similarity score instead of labels. So when a student enters, we can compare him with his photo and if the similarity score is higher than a certain threshold, we mark him present. Now if we have an unknown person who does not match any images in the data, the similarity score will be low and he won’t be marked present. Remember we don’t have to retrain the model in order to add new students, we just need his one image from which he can be compared.
Another example of similarity learning can be comparing the signature on the checks. These kinds of networks can also be used to compare the signature of the account holder to the signature on the check. If the similarity score is higher than the check is accepted and if the similarity score is low than the signature is most probably forged
We can also solve NLP problems using similarity learning. One popular problem it can solve is to recognise duplicate questions on popular forums such as Quora or StackOverflow on which thousands of questions are posted every hour. You might think that this is not that hard problem as all we have to do is compare words in these questions. You may be even right in some cases such as the questions “Where do you live?” and “Where are you living?” have almost same words in them so we can say that they are asking the same question. But if you consider another question “where are you going?”, this also looks similar to the last question but has an entirely different meaning. Even in some cases, the words may totally not match but the questions are the same such as “How old are you?” and “What is your age?” are exactly two same questions but have so common words. So here we train a network that returns a high similarity score when the questions are similar and a low similarity score when the questions are different.
Now how do we train a network to learn similarity? We use Siamese neural networks which is discussed next.
Siamese Neural Networks
A Siamese neural network (sometimes called a twin neural network) is an artificial neural network that contains two or more identical subnetworks which means they have the same configuration with the same parameters and weights. Usually, we only train one of the subnetworks and use the same configuration for other sub-networks. These networks are used to find the similarity of the inputs by comparing their feature vectors.
Consider the diagram above, the very first subnetwork takes an image as input and after passing through convolutional layers and fully connected layers,we get a vector representation of my face.Now the second face is actually the one I want to compare with the first face,so I pass this image through a network that is exactly the same with same weights and parameters.Now that we have two encodings F(A) and F(B), we will compare these two to know how similar the two faces are.
It is important to note that the F(A) and F(B) must be quite similar if both the inputs are similar which is the case in this example.And if the faces are different,we want F(A) and F(B) to be very different.So this is how we are going to train the network.
So how do we compare vectors F(A) and F(B) and when can we say that they are similar or different? We simply measure the distance between these vectors and if the distance between them is small than the vectors are similar and if the distance between is larger than the vectors are very different from one another.So we can define a distance function d, that can give us the distance between two vectors such as:
d(A,B)=|| F(A) - F(B) ||2
So when A and B are the same person,d(A,B) is small and when A and B are different person d(A,B) is large.So we can form a loss functions around this.When A and B are a positive pair, i.e. are of a same person we can define the loss function exactly as L2 norm between F(A) and F(B).
L(A,B)=|| F(A) - F(B) ||2
So when we minimise this loss function,we are actually minimizing the distance d(A, B). But for negative pairs(when two images in a pair are of different persons),we use a different kind of loss function known as hinge loss. When the two faces in a pair are different,we want F(A) and F(B) to have a distance greater than m, so if there is already a negative pair which has a distance greater than m between them,we don’t want to waste our effort by further making them apart.This is the reason we are using hinge loss instead of L2 loss.
L(A,B)= max(0,m2 - || F(A) - F(B) ||2)
So, this value is going to be zero when F(A) and F(B) are already distant apart(>m).
Now putting both of these losses together, we get a contrastive loss given as:
L(A,B)= y|| F(A) - F(B) ||2 + (1-y)max(0,m2 - || F(A) - F(B) ||2)
So when A and B are the same person, we will have a label y equal to 1 and when A and B are different,y is equal to zero.
By using contrastive loss, we bring positive pairs together and negative pairs apart. But using this loss function we cannot learn ranking which means we are not able to say how much two pairs are similar to each other, we shall see how to do this in the next section.
Triplet loss
When using contrastive loss we were only able to differentiate between similar and different images but when we use triplet loss we can also find out which image is more similar when compared with other images. In other words, the network learns ranking when trained using triplet loss.
When using triplet loss, we no longer need positive and negative pairs of images. We need triplets of images in which we have an anchor image, a positive image that is similar to anchor image and a negative image that is very different from the anchor image as shown below:
And now the architecture of the siamese network is as :
When computing the vectors for these images, we want the vectors of anchor image and positive image to come closer and we want to make increase the distance between anchor image and negative image.
The distance between anchor vector and the positive vector is given by:
|| F(A) - F(P) ||2
Whereas the distance between anchor vector and the negative vector is given by:
|| F(A) - F(N) ||2
As mentioned above, we want the anchor image and positive image to have less distance between them as compared to the distance between anchor image and negative image, therefore:
|| F(A) - F(P) ||2 < || F(A) - F(N) ||2
So we can form the loss function as following:
L(A,P,N)= max(0, || F(A) - F(P) ||2 < || F(A) - F(N) ||2 +m)
Where m is a margin as we also saw in the hinge function of the contrastive loss. So if the positive image is already closer to the anchor than the negative image than the function returns zero and there is no loss. But when the negative image is closer to the anchor than the positive image, we are bringing a positive image closer to the anchor. Remember that we are also using a margin term m, so the anchor point and positive point are not coming very close to each other, and only the distance between anchor image and positive image is smaller as compared to the distance between anchor image and negative image up to a margin m.
When training the network, we may face a problem with choosing the triplets. We can choose triplets in which there is a lot of difference between the positive image and negative image, thus the distance between the anchor image and positive image is already quite smaller as compared to the distance between anchor image and negative image. For example, when the positive image of a person’s face is completely different from the negative image like they can have different hairstyles, face structure and many other factors. In this case, the network is not able to learn completely and may not be able to differentiate on the basis of more minute features such as eye shape, nose shape etc. This may cause the model to not perform correctly when we compare two faces of different persons that do not have much difference.
So to tackle this problem we use a concept called as hard negative mining in which we train the network with hard cases. So we come up with such triplets in which distance between positive image and anchor is somewhat equal to the distance between negative image and anchor.
How to improve similarity learning?
Loss: For now we only saw two types of loss functions,i.e contrastive loss and triplet loss. We can conclude that triplet loss is a bit superior to contrastive loss as it helps us with ranking and is also efficient and leads to better results. But we can certainly improve the performance of the network if we can find a better loss function. “Deep metric learning with angular loss” and “correcting the triplet selection biasfor triplet loss“ are some of the interesting research papers that you should consult if further interested.
Note: Some recent researches are coming up that show that we can also use classification loss such as cross-entropy to train a Siamese network and still get accurate results.
Sampling: We can sample the triplets from the dataset in such a way that increase the accuracy of the model. It is better to include the hard cases to your triplets as discussed in the last section.
Ensembles: We can also use different networks and train each of them on different triplets. Usually, we perform our data into clusters first using the clustering algorithm and then learner for each cluster.
Siamese Network implementation in Keras
Now let us use the concepts we learned above and see how we can make a model based on the siamese network that can identify when two images are similar. Here I am going to use the MNIST handwritten digits dataset and train a model using Keras.Here is the code
import tensorflow.keras as keras import tensorflow as tf from __future__ import absolute_import from __future__ import print_function import numpy as np import random from keras.datasets import mnist from tensorflow.keras.models import Model from tensorflow.keras.layers import Input, Flatten, Dense, Dropout, Lambda from tensorflow.keras.layers import Conv2D, Activation,AveragePooling2D from keras import backend as K num_classes = 10 epochs = 20 def euclid_dis(vects): x,y = vects sum_square = K.sum(K.square(x-y), axis=1, keepdims=True) return K.sqrt(K.maximum(sum_square, K.epsilon())) def eucl_dist_output_shape(shapes): shape1, shape2 = shapes return (shape1[0], 1) def contrastive_loss(y_true, y_pred): y_true=tf.dtypes.cast(y_true, tf.float64) y_pred=tf.dtypes.cast(y_pred, tf.float64) margin = 1 square_pred = K.square(y_pred) margin_square = K.square(K.maximum(margin - y_pred, 0)) return K.mean(y_true * square_pred + (1 - y_true) * margin_square) def create_pairs(x, digit_indices): pairs = [] labels = [] n=min([len(digit_indices[d]) for d in range(num_classes)]) -1 for d in range(num_classes): for i in range(n): z1, z2 = digit_indices[d][i], digit_indices[d][i+1] pairs += [[x[z1], x[z2]]] inc = random.randrange(1, num_classes) dn = (d + inc) % num_classes z1, z2 = digit_indices[d][i], digit_indices[dn][i] pairs += [[x[z1], x[z2]]] labels += [1,0] return np.array(pairs), np.array(labels) def create_base_net(input_shape): input = Input(shape = input_shape) x = Conv2D(4, (5,5), activation = 'tanh')(input) x = AveragePooling2D(pool_size = (2,2))(x) x = Conv2D(16, (5,5), activation = 'tanh')(x) x = AveragePooling2D(pool_size = (2,2))(x) x = Flatten()(x) x = Dense(10, activation = 'tanh')(x) model = Model(input, x) model.summary() return model def compute_accuracy(y_true, y_pred): '''Compute classification accuracy with a fixed threshold on distances. ''' pred = y_pred.ravel() < 0.5 return np.mean(pred == y_true) def accuracy(y_true, y_pred): '''Compute classification accuracy with a fixed threshold on distances. ''' return K.mean(K.equal(y_true, K.cast(y_pred < 0.5, y_true.dtype))) # the data, split between train and test sets (x_train, y_train), (x_test, y_test) = mnist.load_data() x_train = x_train.reshape(x_train.shape[0], 28, 28,1) x_test = x_test.reshape(x_test.shape[0], 28, 28, 1) input_shape = (1, 28, 28) print(x_train.shape) x_train = x_train.astype('float32') x_test = x_test.astype('float32') y_train = y_train.astype('float32') y_test = y_test.astype('float32') x_train /= 255 x_test /= 255 input_shape = x_train.shape[1:] input_shape = (28, 28, 1)
Output:
# create training+test positive and negative pairs digit_indices = [np.where(y_train == i)[0] for i in range(num_classes)] tr_pairs, tr_y = create_pairs(x_train, digit_indices) digit_indices = [np.where(y_test == i)[0] for i in range(num_classes)] te_pairs, te_y = create_pairs(x_test, digit_indices) # network definition base_network = create_base_net(input_shape)
input_a = Input(shape=input_shape) input_b = Input(shape=input_shape) processed_a = base_network(input_a) processed_b = base_network(input_b) distance = Lambda(euclid_dis, output_shape=eucl_dist_output_shape)([processed_a, processed_b]) model = Model([input_a, input_b], distance) #train model.compile(loss=contrastive_loss, optimizer='adam', metrics=[accuracy]) model.fit([tr_pairs[:, 0], tr_pairs[:, 1]], tr_y, batch_size=128, epochs=epochs, validation_data=([te_pairs[:, 0], te_pairs[:, 1]], te_y))
Output:
# compute final accuracy on training and test sets y_pred = model.predict([tr_pairs[:, 0], tr_pairs[:, 1]]) tr_acc = compute_accuracy(tr_y, y_pred) y_pred = model.predict([te_pairs[:, 0], te_pairs[:, 1]]) te_acc = compute_accuracy(te_y, y_pred) print('* Accuracy on training set: %0.2f%%' % (100 * tr_acc)) print('* Accuracy on test set: %0.2f%%' % (100 * te_acc))
Output:
import matplotlib.pyplot as plt from PIL import Image number_of_items = 15 im = tf.keras.preprocessing.image.array_to_img( tr_pairs[1,0], data_format=None, scale=True, dtype=None ) plt.figure(figsize=(20, 10)) for item in range(number_of_items): display = plt.subplot(1, number_of_items,item+1) im = tf.keras.preprocessing.image.array_to_img( tr_pairs[item,0], data_format=None, scale=True,dtype=None) plt.imshow(im, cmap="gray") display.get_xaxis().set_visible(False) display.get_yaxis().set_visible(False) plt.show() plt.figure(figsize=(20, 10)) for item in range(number_of_items): display = plt.subplot(1, number_of_items,item+1) im = tf.keras.preprocessing.image.array_to_img( tr_pairs[item,1], data_format=None, scale=True,dtype=None) plt.imshow(im, cmap="gray") display.get_xaxis().set_visible(False) display.get_yaxis().set_visible(False) plt.show() for i in range(number_of_items): print(y_pred[i])
Output:
In the above image, we have plotted some of the images from our testing set along with their predictions. Remember that we have trained our model in such a way that it predicts a similarity score closer to zero when the images are similar and a similarity score closer to 1 when the digits are different. Therefore for the very first example, we can see that both the images are of zero and thus are quite similar. The model has accurately given the value close to zero and when the images are quite different, the model returns the number closer to 1.
We can also use Siamese networks for face recognition, check this article “Face Recognition Using Python and OpenCV” where I have used a pre-trained model based on the same concepts for face recognition.
This brings us to the end of this article where we learned about Siamese networks and similarity learning. To know more about deep learning and neural networks, click on the banner below: