How Does Back-Propagation Work in Neural Networks?

Demonstrating how background works in Neural Networks, using an example

Neural Networks learn through iterative tuning of parameters (weights and biases) during the training stage. At the start, parameters are initialized by randomly generated weights, and the biases are set to zero. This is followed by a forward pass of the data through the network to get model output. Lastly, back-propagation is conducted. The model training process typically entails several iterations of a forward pass, back-propagation, and parameters update.

This article will focus on how back-propagation updates the parameters after a forward pass (we already covered forward propagation in the previous article). We will work on a simple yet detailed example of back-propagation. Before we proceed, let’s see the data and the architecture we will use in this post.

Data and the Architecture

The dataset used in this article contains three features, and the target class has only two values — 1 for pass and 0 for fail. The objective is to classify a data point into either of the two categories — a case of binary classification. To make the example easily understandable, we will use only one training example in this post.

Figure 1: The data and the NN architecture we will use. Our training example is highlighted with its corresponding actual value of 1. This 3–4–1 NN is a densely connected network-each node in the current layer is connected to all the neurons in the previous layer except on the input layer. However, we have eliminated some connections to make the Figure less cluttered. A forward pass yields an output of 0.521 (Source: Author).

Understand: A forward pass allows the information to flow in one direction — from input to the output layer, whereas the back-propagation does the reverse — allowing data to flow from output backward while updating the parameters (weights and biases).

Definition: Back-propagation is a method for supervised learning used by NN to update parameters to make the network’s predictions more accurate. The parameter optimization process is achieved using an optimization algorithm called gradient descent (this concept will be very clear as you read along).

A forward pass yields a prediction (yhat) of the target (y) at a loss which is captured by a cost function (E) defined as:

Equation 1: Cost function

where m is the number of training examples, and L is the error/loss incurred when the model predicts yhat instead of actual value y. The objective is to minimize the cost E. This is achieved by differentiating E with respect to (wrt) parameters (weights and parameters) and adjusting the parameters in the opposite direction of the gradient (that is why the optimization algorithm is referred to as gradient descent).

In this post, we consider back-propagation on 1 training example (m=1). With this consideration, E, reduces to

Equation 2: Cost function for one training example.

Choosing the Loss Function, L

The loss function, L, is defined based on the task at hand. For classification problems, Cross-entropy (also known as log loss) and hinge loss are suitable loss functions, whereas, Mean Squared Error (MSE) and Mean Absolute Error (MAE) are appropriate loss functions for regression tasks.

Binary cross-entropy loss is a function suitable for our binary classification task — the data has two classes, 0 or 1. A binary cross-entropy loss function can be applied to our forward-pass example in Figure 1, as shown below

Equation3: Binary cross-entropy loss applied to our example.

t=1 is the truth label, yhat=0.521 is the output of the model and ln is the natural log— log to base 2.

You can read more about the cross entropy loss function on the link below.

Cross-Entropy Loss Function

A loss function used in most classification problems to optimize machine learning model…

towardsdatascience.com

Since we now understand the NN architecture and the cost function we will use, we can proceed directly to cover the steps for backward propagation.

The Data and the Parameters

The table below shows the data on all the layers of the 3–4–1 NN. At the 3-neuron input, the values shown are from the data we provide to the model for training. The second/hidden layer contains the weights (w) and biases (b) we wish to update and the output (f) at each of the 4 neurons during the forward pass. The output contains the parameters (w and b) and the output of the model (yhat) — this value is actually the model prediction at each iteration of model training. After a single forward-pass, yhat=0.521.

Figure 2 The data and parameter initialization (Source: Author)

A. Update Equations and the Loss Function

Important: Recall from the previous section: E(θ)=L(y, yhat) where θ is our parameters — weights and biases. That is to say, E is a function of y and yhat and yhat=g(wx+b), => yhat is a function of w and b. x is a variable of data and g is the activation function. Effectively, E is a function w and b and, therefore can be differentiated with respect to these parameters.

The parameters at each layer are updated with the following equations

Equation 4: Update equations

where t is the learning step, ϵ is the learning rate — a hyper-parameter set by the user. It determines the rate at which the weights and biases are updated. We will use ϵ=0.5 (arbitrary choice).

From Equations 4, the update amounts becomes

Equation 5: Update amounts

As said earlier, since we are dealing with binary classification, we will use the binary cross-entropy loss function defined as:

Equation 6: Binary cross entropy loss

We will use Sigmoid activation across all layers

Equation 7: Sigmoid function.

where z=wx+b is the weighted input into the neuron plus bias.

B. Updating Parameters on the Output-Hidden Layer

Unlike the forward pass, back prop works backward from the output layer to layer 1. We need to compute derivatives/gradients with respect to parameters for all layers. To do that, we have to understand the chain rule of differentiation.

Chain rule of differentiation

Let’s work on updating w²₁₁ and b²₁ as examples. We will follow the routes shown below.

Figure 3: Back-propagation information flow (Source: Author).

B1. Calculating Derivatives for the Weights

By chain rule of differentiation, we have

Remember: when evaluating the above derivatives with respect to w²₁₁, all the other parameters are treated as constants, that is, w²₁₂, w²₁₃, w²₁₄, and b²₁. The derivative of a constant is 0 that is why some values were eliminated in the above derivative.

Next is the derivative of Sigmoid function (refer to this article)

Next, the derivative of cross-entropy loss function (reference material)

The derivatives with the respect to the other three weights on the output layer are as follows (you can confirm this)

B2. Computing Derivatives for the bias

We need to compute

From the previous sections, we have already computed ∂E and ∂yhat, what remains is

We used the same arguments as before, that all other variables except b²₁ are treated as constants therefore on when differentiated they reduce 0.

So far, we have computed the gradients with respect to all the parameters at the output-input layers.

Figure 4: Gradients at output-input layers (Source: Author).

At this point we are ready to update all the weights and biases at the output-input layers.

B3. Updating Parameters at the Output-Hidden Layers

Please compute the rest in the same way and confirm them in the table below

Figure 5: Updated parameters at hidden-output layers (Source: Author)

C. Updating Parameters at Hidden-Input Layer

As before, we need derivatives of E with respect to all the weights and biases at these layers. We have a total of 4x3=12 weights to update and 4 biases. As example, lets work on w¹₄₃ and b¹₂. See the routes in the Figure below.

Figure 6: Back-propagation information to the hidden-input layers (Source: Author).

C1. Gradients of weights

For weights, we need to compute the derivative (follow the route in Figure 6 if the following equation is intimidating)

As we go through each of the above derivatives, note the following important points:

At the model output (when finding derivative of E with respect to the yhat), we are actually differentiating the loss function.
At the layer outputs (f) (where we differentiate wrt z), we find the activation function’s derivative.
In the above two cases, differentiating with respect to weights or biases of a given neuron yields the same results.
The weighted inputs (z) are differentiated with respect to parameters (w or b) that we wish to update. In this case, all parameters are held constant except the parameter of interest.

Doing the same process as in Section B, we get:

weighted inputs for layer 1

derivative of Sigmoid activation function applied to the first layer

Weighted inputs to the output layer. f-values are the outputs of the hidden layer.

Activation function applied to the output of the last layer.

Derivative of binary cross-entropy loss function wrt to yhat.

Then, we can put together all those as

C2. Gradients of bias

Using the same concepts as before, check that, for b¹₂, we have

All gradient values for the hidden-input are tabulated below

Figure 7: Gradients at hidden-input layers (Source: Author).

At this point, we are ready to compute the updated parameters at the hidden-input.

C3. Updating Parameters at the Hidden-Input

Lets go back to the update equations and work on updating w¹₁₃ and b¹₃

So, how many parameters do we have to update?

We have 4x3=12 weights and 4x1=4 biases at the hidden-input layers, 4x1=4 weights, and 1 bias at the output-hidden layers. That is a total of 21 parameters. They are called trainable parameters.

All the updated parameters for hidden-input layers are shown below

Figure 8: Updated parameters for hidden-input layers (Source: Author).

We now have the updated parameters for all the layers in Figure 8 and Figure 5 using back-propagation of error. Running a forward pass with these updated parameters yields a model prediction, yhat of 0.648, up from 0.521. This means that a model is learning — moving close to the true value of 1 after two iterations of training. Other iterations yield 0.758, 0.836, 0.881, 0.908, 0.925, … (In the next article, we will implement back-propagation and forward pass for many training examples and iterations, and you will get to see this).

Definitions

Epoch — One epoch is when the entire dataset is passed through the network once. This comprises of one instance of a forward pass and back-propagation.
Bath size is the number of training examples passed through the network simultaneously. In our case, we have one training example. In cases where we have a large dataset, the data can be passed through the network in batches.
The number of iterations — One iteration equals one pass using training examples set as batch size. One pass is a forward pass and a back-propagation.

Example:
If we have 2000 training examples and set batch size of 20, then it takes 100 iterations to complete 1 epoch.

Conclusion

In this article, we have discussed back-propagation by working on an example. We have seen how chain rule of differentiation is used to get the gradients of different equations — the loss function, activation function, weighting equations and layer output equations. We have also discussed on how derivative with respect to the loss function can be used to update parameters at each layer. In the next article (attached below), we implement the concepts learnt here in Python.