One Shot Learning (N way K Shot): Siamese Network with Contrastive Loss for Pokémon Classification

One Shot Learning (N way K Shot): Siamese Network with Contrastive Loss for Pokémon Classification

Topics

1. Few/One shot learning

2. Contrastive loss

3. About the Dataset

4. Dataset Preprocessing

5. Siamese networks

6. One shot and Few shot learning

7. limitations and productive

8. Keras Code

1. Few shot learning

When we have a tiny dataset, Few shot learning can be applied. A Siamese network with contrastive loss is one of the few-shot learning algorithms.

Let’s first examine the differences between Neural networks and Siamese networks before briefly moving on to Siamese.

Neural Network vs Siamese Network

  • Consider a college student as a neural network model. He would be studying with books that had specific questions and answers. He would absorb everything from the books and use that information to respond to the exam’s questions. He applied the knowledge gained from the book, but he didn’t have it with him throughout the exam. Thus, the purpose of a neural network is that.

  • Siamese networks are comparable to NNs; the exam will now be open-book, and the books are used while the studying is not provided; nevertheless, you will be given a book on the same subject from a different author to help you in answering the questions.

This is merely an intuitive understanding of the Siamese network; the preprocessing and training will differ slightly from those of neural networks, and I’ll go into more detail about how it functions in a moment.

Deep learning is always data-hungry; the more data, the better the performance. For neural network training, we need at least a few thousand data; otherwise, the network will overfit, and even with regularization and fine-tuning, Low precision is expected.

After training in a Small dataset

We constantly lack access to large datasets; instead, we will have little datasets, but we still require good accuracy; we can’t achieve this with machine learning or deep learning alone, therefore we employ few shot learning.

2. Contrastive loss

  • Forget about the Siamese network for the time being as we examine a fascinating loss function.

Loss Function: The inputs for the loss function are true value and predicted value, and the loss function evaluates the divergence between true and predicted value.

Yann Le first introduced contrastive loss in this research paper in 2005.

As in the paper, absolute contrastive loss

Dw== y predicted value, Why is the predicted value just called “Dw”? You can think of Dw as the estimated Euclidean distance between two vectors, which is a result of the model. In the Siamese half, I’ll explain that.

Generic contrast loss function

In order to make things clear, let y_pred= Dw. This is still true, despite the fact that these changes did only affect Siamese notation because distance output was to be used in Siamese.

  • y_true label only can be 0 or 1
  • y_prediction may range from zero to 1.
  • M -margin value =1

y_true == 0 :
part-1
some outcome
part-2
0

y_true == 1 :
part-1
0
part-2
some outcome

Therefore, only part-1 or part-2 act based on the y true.

Let’s take

(1-y_true) * y_pred² + y_true *max(0,m-y_pred)²

y_true = 0

y_pred =0 or 0.10 or 0.25 or …..1

part-1
(1-0) * y_pred²
part-2
0

* This demonstrates that the formula just squares whatever the value of the y pred is since, in order to optimize the model.

y_true = 1

y_pred =0 or 0.10 or 0.25 or …..1

part-1
0
part-2
1 * max( 0, m-y_pred )²

# Margin m=1

====> m - y_pred : 1-0.25=0.75

* 0.75 indicates the difference of y_true and y_pred

# max()

====> max(0,0.75) =0.75 then 0.75^2 = 0.5625

* This demonstrates that it also squares the y pred, However, it is obvious that if the y pred is bigger than the margin value, the loss won't be affected.because max() is utilized, the result becomes 0 if y pred exceeds the margin.

= max(0,1-1.5)^2
= max(0,-0.5)^2
= 0

Ponder how this function evaluates the input with contrastive loss and outputs the results.

this can look alike Cross entropy but cross entropy deals with probabilities, The Contrastive Loss Function, which Yann Le described in his Paper Chapter 2.1., this contrastive loss function maps similar input vectors to nearby points on the output manifold and dissimilar input vectors to distant points

In conclusion, by the above point, similar vectors are mapped to neighboring points and dissimilar vectors to distant points. If two vectors are extremely similar to one another, then y = 0, and if they are dissimilar, then y=1. Considering this will help you comprehend the Siamese network.

Contrastive Loss

3. About the Dataset

We’ll be using this dataset, which contains the Pokémon image and its Name, for this discourse.

Original dataset with 879 classes

Below, you can see the number of images for each of the 879 classes in this dataset, which consists of 5117 photos.

4. Dataset Preprocessing

The dataset will be divided into three folders.

  • Initially, we had 879 distinct classes. Of those, we took 5 classes with labels and images to create the support set. The final set, now referred to as the train set, comprises 874 classes of images.
  • subsequently, the query set is just an input image without a label, so for now, we choose five images from each class in the support set (don’t take label name, only image); they are not copied, just sliced from the support folder, so images in the query won’t be in support set. We then refer to those five images as the query set.
  • The query set will have the same class as the support set, but the train set will have 874 unique classes that are distinct from the support set.

However, why are “support” and “query” set, and what is “n way k shot”? A step-by-step explanation of why we need it will be provided in the One shot and Few shot Learning section.

Keep in mind that the support set is not part of the training data, thus during model training, support set classes won’t be exposed to the Siamese model. However, Siamese will still predict; just wait till we have explored the Siamese topic in full.

5. Siamese networks

The general neural network will attempt to predict the class label using a single vector as input. But consider it in a different way. Consider a new approach here.

Take an image as a vector; we are aware that any image can be used in this way.

Image vectors

As seen in the graphic above, the distance between three classes of image vectors and a new vector corresponding to the only member of the same class is calculated.

The distance function, or D, will show us how similar or dissimilar the two image vectors are to one another, Two image vectors belong to the same class when they are closer to one another; otherwise, they belong to different classes.

Instead of building a model that can classify the image by label, why not train a similarity function that can find differences between images in the data that are similar to one another? like discovering the ‘D’ function

We require two inputs for any similarity function, correct? Since this is what is meant by the term “similarity,” Consequently, that is how the Siamese network will be trained to find image similarities.

As a consequence, the Siamese network can only predict if two inputs are similar (0) or dissimilar (1); as an outcome, so we will create a pair of images.

The first dataset consists of 5084 image vectors divided into 874 classes, Each image vector in the first dataset will be permuted and combined with image vectors from both classes that are similar to it and those that are dissimilar to it. The merging image vector — both similar and dissimilar — will be picked at random while classes are being tracked.

If two image vectors are dissimilar, they are labeled as 1, meaning that they are different and far from one another, and if they are similar, they are labeled as 0, meaning that distance 0 denotes that they are incredibly close points.

Consider, for illustration’s sake, that we start with the first image vector and its class, combine the image vector that falls into that class at random, and then choose the remaining image vector from various classes at random.

Alternatively, you can select N numbers of similar and dissimilar pairs, N means a number of classes, but here we have 874 classes that will make the training dataset larger. For simplicity’s sake, I chose to use one similar and one dissimilar image vector for each image vector in the dataset.

Make pair

Since images are vectors and the processed data is split into two groups (0 and 1), it is possible to compare or contrast the two groups using the straightforward Euclidean distance formula. Why, though, train a Siamese?

So why is it necessary to train Siamese networks?

All we need to do is obtain an accurate spatial representation of the image vector. As the image vectors of the 0 and 1 classes won’t be separated in space because they are just representations of images and don’t say “hey you go there you are different and hey come near we are all the same.”

We should train it for that reason, where similar input vectors map to nearby points on the output manifold and dissimilar input vectors to distant points. (reminds contrastive loss)

You should now be aware of the significance of contrastive loss after reading the aforesaid statement. Although there are other binary loss functions that can be used, distance mapping necessitates the usage of contrastive loss.

From the above, we arrived at certain points in the conclusion

  • Siamese networks accept two inputs and categorize them as 0 or 1.
  • We train a Siamese network to establish this similarity(0) and dissimilarity(1) in spatial dimensions
  • Here, distance(Euclidean or other) is used to determine similarities.

Diagram 1 is produced by the previous 3 points.

  • Image Vectors 1 and 2 are fed into the same embedding CONV network, which keeps the spatial dimension of the two inputs constant by utilizing the same weights. This shared weights network is also known as the sister network.
  • A returned encoded vector from the embedding CONV network has been trained over a contrastive loss function, thus we may expect the distance mapping to function well.
  • The Euclidean function will receive two encoded vectors to determine the distance(y_pred).
  • Since we utilize euclidean, then sigmoid eventually obtained the similarity score that can be regarded as a distance.
  • Backpropagation using distance(y pred) and loss function will enhance the Conv network’s mapping efficiency when creating the encoded vector.

Now that the image vector dataset with a pair of 0 and 1 is available, let’s go to the following phase.

So let’s build an embedding layer using the convolutional network model that was previously described.

Embedding LayerThe embedding layer

We’ll define the Euclidean distance function next.

Euclidean DistanceEuclidean Formula

we will call the above Euclidean function via the lambda layer

Siamese Model

You can see that two image vectors were passed to the same conv layer in the example above, so the image vectors will be using the same conv layer weights. After getting two embedded vector (q and p) representations of the two image vectors from the conv layer, we passed it to the lambda layer (Euclidean), underwent batch normalization, and then applied a dense layer with a sigmoid activation function.

Fit the model

history = siamese.fit(
[x_train_1, x_train_2],
labels_train,
batch_size=24,
epochs=100
)

siamese.save("pokemon.h5")

6. One shot and Few shot learning

First of all one shot and few shot are the same with just subtle changes

N way K shot learning

N — number of class labels

K — number of samples

It is referred to as N-way and K-shot learning based on the N and K count , To distinguish between N≤10 classes with only k≤10 might be a widely practiced size.

One Shot Learning

If K==1 then One Shot Learning

  • Take 5( N ) class labels and 1( K ) image per class from the Support set
  • Take one image and use it as the input image from the query set. This is not the reason it is called One Shot learning; the name comes from the fact that we take k=1 samples from the support set.
  • By combining images from the Support Set, a siamese model can predict the outcome of the Query image.

5 way 1 shot learning

we do as Siamese. Prediction(Query_image , support_data[i]) , i=1-5

  • You can see that when predicting with the Zoroark class, the Unknown query image has a smaller distance (0.02840), indicating that those two points are similar classes while the others are not. Since the model uses the Lamda layer with the Euclidean function, you can think of the model’s output as distance.

These classes and image types are not included in the training set or training phase, so the support and query sets are not exposed to the Siamese model. Nevertheless, we are trying to predict other classes and image types using the Siamese model while still enabling measuring the mapping distance.

Few shot Learning

If K>1 then few Shot Learning

  • Take 5( N ) class labels and 2( K ) images per class from the Support set

N way K shot

  • Using more shots will help improve accuracy because the estimated distances for the unknown image with zoraoark class in this scenario are (0.03541 and 0.5547). This is because deciding what unknown image to categorize may get challenging.
  • There are certain constraints on the N-way K-shot learning; we’ll see what they are in the section below.

7. limitations and productive

Accuracy with No of ways

Many classes in the support set can conflate us with the lower distance prediction with the unrelated class to the original class of query image, so we can’t be sure if the original class gives 0.23 and another unrelated class gives 0.22, then we are destined to fail. It can happen, and I’ve faced it when I increase the class count, so be conscious of it.

Accuracy with No of shorts

Since we showed in the few shot learning(5 way 2 shot) example above that we receive some mixed results for a class, having many shots (examples images) will be advantageous, short of majority voting of the comparable class, it makes sense that as the number of shorts rises, accuracy also increases.

  • Since we use fewer data for few-shot learning situations, Overfitting is more likely to occur, thus test that with regularization and learning rate.

If you want to predict face images from your support and query sets, make sure your train set has the face images.
Even though the Siamese model here is merely a CNN (Learned Kernel) + distance mapping(similarity function), how can you expect it to perform well with facial images if it was trained on cars, bicycles, and other objects?

With this insight, presumably, you won’t have any trouble generating an experiment in the Few-Shot Learning area.

8. Keras Code:

Resources

Paper: http://yann.lecun.com/exdb/publis/pdf/hadsell-chopra-lecun-06.pdf

Paper: https://arxiv.org/pdf/1909.02729.pdf

Paper: https://www.cs.cmu.edu/~rsalakhu/papers/oneshot1.pdf

Coursera’s deeplearning.ai Specialization by Andrew Ng