Dropout in Neural Networks

Dropout in Neural Networks

In this era of deep learning, almost every data scientist must have used the dropout layer at some moment in their career of building neural networks. But, why dropout is so common? How does the dropout layer work internally? What is the problem that it solves? Is there any alternative to dropout?

Figure 0: Indian Jharokhe, dropping out some light (Image by Author)

If you have similar questions regarding dropout layers, then you are in the correct place. In this blog, you will discover the intricacies behind the famous dropout layers. After completing this blog, you would be comfortable answering different queries related to dropout and if you are more of an innovative person, you might come up with a more advanced version of dropout layers.

Let’s start… 🙂

OVERVIEW

This blog is divided into the following sections:

  1. Introduction: The problem it tries to solve
  2. What is a dropout?
  3. How does it solve the problem?
  4. Dropout Implementation
  5. Dropout during Inference
  6. How it was conceived
  7. Tensorflow implementation
  8. Conclusion

INTRODUCTION

So before diving deep into its world, let’s address the first question. What is the problem that we are trying to solve?

The deep neural networks have different architectures, sometimes shallow, sometimes very deep trying to generalise on the given dataset. But, in this pursuit of trying too hard to learn different features from the dataset, they sometimes learn the statistical noise in the dataset. This definitely improves the model performance on the training dataset but fails massively on new data points (test dataset). This is the problem of overfitting. To tackle this problem we have various regularisation techniques that penalise the weights of the network but this wasn’t enough.

The best way to reduce overfitting or the best way to regularise a fixed-size model is to get the average predictions from all possible settings of the parameters and aggregate the final output. But, this becomes too computationally expensive and isn’t feasible for a real-time inference/prediction.

The other way is inspired by the ensemble techniques (such as AdaBoost, XGBoost, and Random Forest) where we use multiple neural networks of different architectures. But this requires multiple models to be trained and stored, which over time becomes a huge challenge as the networks grow deeper.

So, we have a great solution known as Dropout Layers.

Figure 1: Dropout applied to a Standard Neural Network (Image by Nitish)

What is a Dropout?

The term “dropout” refers to dropping out the nodes (input and hidden layer) in a neural network (as seen in Figure 1). All the forward and backwards connections with a dropped node are temporarily removed, thus creating a new network architecture out of the parent network. The nodes are dropped by a dropout probability of p.

Let’s try to understand with a given input x: {1, 2, 3, 4, 5} to the fully connected layer. We have a dropout layer with probability p = 0.2 (or keep probability = 0.8). During the forward propagation (training) from the input x, 20% of the nodes would be dropped, i.e. the x could become {1, 0, 3, 4, 5} or {1, 2, 0, 4, 5} and so on. Similarly, it applied to the hidden layers.

For instance, if the hidden layers have 1000 neurons (nodes) and a dropout is applied with drop probability = 0.5, then 500 neurons would be randomly dropped in every iteration (batch).

Generally, for the input layers, the keep probability, i.e. 1- drop probability, is closer to 1, 0.8 being the best as suggested by the authors. For the hidden layers, the greater the drop probability more sparse the model, where 0.5 is the most optimised keep probability, that states dropping 50% of the nodes.

So how does dropout solves the problem of overfitting?

How does it solve the Overfitting problem?

In the overfitting problem, the model learns the statistical noise. To be precise, the main motive of training is to decrease the loss function, given all the units (neurons). So in overfitting, a unit may change in a way that fixes up the mistakes of the other units. This leads to complex co-adaptations, which in turn leads to the overfitting problem because this complex co-adaptation fails to generalise on the unseen dataset.

Now, if we use dropout, it prevents these units to fix up the mistake of other units, thus preventing co-adaptation, as in every iteration the presence of a unit is highly unreliable. So by randomly dropping a few units (nodes), it forces the layers to take more or less responsibility for the input by taking a probabilistic approach.

This ensures that the model is getting generalised and hence reducing the overfitting problem.

Figure 2: (a) Hidden layer features without dropout; (b) Hidden layer features with dropout (Image by Nitish)

From figure 2, we can easily make out that the hidden layer with dropout is learning more of the generalised features than the co-adaptations in the layer without dropout. It is quite apparent, that dropout breaks such inter-unit relations and focuses more on generalisation.