Dropout in Deep Learning

OVERFITTING

Deep neural networks (deep learning) are just artificial neural networks with lots of layers between the inputs and outputs(prediction). As can be seen from the figure below, there are just two hidden layers but it can be as many as possible, which increases the complexity of the network. When the training dataset has very few examples, there is likely to be overfitting, which is when the network is able to accurately predict the samples of the training data but has poor performance and cannot generalize well on the validation and the test data.

As the model is trained over several epochs, it begins to understand the patterns in the dataset, so when the model is overfitting, the accuracy of the training data would be very high and that of the validation very low, because the model has learned all the patterns in the training data but cannot generalize and make good predictions on data it has not seen before, which defeats the whole purpose of training the model, which is to make predictions in the future on data it has not seen before. What we want to achieve is to reduce overfitting, a situation where the accuracies of our training and validation is close and also high, which suggests that the model can make better predictions on data it has not seen before.

Overfitting in deep learning

OVERFITTING STRATEGIES

There are several things we can do to reduce this overfitting whiles training our neural network models;

We can reduce the complexity of our model (that is reducing the number of hidden layers),
We could augment the data (that is increase the number of samples present in the dataset),
We could also apply regularization, that is applying a penalty factor to the loss function (L1 or L2 regularization).

These are all feasible ways we can use to reduce overfitting in our deep learning models but does not solve our issue of overfitting completely, in the sense that, as we train the model iteratively, all the weights are learned together and some of the neurons adapt better and make better predictions than others, leading to those neurons only participating in the training as the iteration progresses. As the network is trained over several epochs, the stronger neurons learn more, while the weaker ones are ignored. Now, just a portion of the neurons are trained after several iterations and the weaker ones avoid taking part in the training. Therefore we need to find a way to better handle this, and then came the concept of Dropout.

DROPOUT

Dropout is a technique that drops neurons from the neural network or ‘ignores’ them during training, in other words, different neurons are removed from the network on a temporary basis. During training, dropout modifies the idea of learning all the weights in the network to learning just a fraction of the weights in the network. From the figure above, it can be seen that during the standard training phase, all neurons are involved and during dropout, only a few select neurons are involved with the rest ‘turned off’. So after every iteration, different sets of neurons are activated, to prevent some neurons from dominating the process. This, therefore, helps us reduce the menace of overfitting, and allows for the rise of deeper and bigger network architectures that can make good predictions on data the network has not seen before.

INTUITION BEHIND DROPOUT

The concept behind the dropout training process is a straightforward process. In order to make the networks different for each training iteration, we switch off some neurons during training. The way this is done can be seen in the Figure below.

The training phase of the standard network without dropout can be represented mathematically as:

The dot product of the weights(w(l+1)) and the input(y(l)) are added to a bias term(b(l+1)) and passed through an activation function(f), to introduce non-linearity to give the output y(l+1), which is the prediction, meaning all the neurons are involved in the making of a decision.

During dropout, the training is updated to become:

The training is very similar to the standard network, but a new term r which is a new neuron, is introduced, which keeps the neuron active or turns it off, by assigning a 1( neuron participates in the training) or 0(neuron does not participate or is turned off), then the training process continues. This way, overfitting is reduced and our model can now make excellent and accurate predictions on real-world data(data not seen by the model).