How Does Back-Propagation Work in Neural Networks?
Mục Lục
How Does Back-Propagation Work in Neural Networks?
Demonstrating how background works in Neural Networks, using an example
Neural Networks learn through iterative tuning of parameters (weights and biases) during the training stage. At the start, parameters are initialized by randomly generated weights, and the biases are set to zero. This is followed by a forward pass of the data through the network to get model output. Lastly, back-propagation is conducted. The model training process typically entails several iterations of a forward pass, back-propagation, and parameters update.
This article will focus on how back-propagation updates the parameters after a forward pass (we already covered forward propagation in the previous article). We will work on a simple yet detailed example of back-propagation. Before we proceed, let’s see the data and the architecture we will use in this post.
Data and the Architecture
The dataset used in this article contains three features, and the target class has only two values — 1 for pass and 0 for fail. The objective is to classify a data point into either of the two categories — a case of binary classification. To make the example easily understandable, we will use only one training example in this post.
Figure 1: The data and the NN architecture we will use. Our training example is highlighted with its corresponding actual value of 1. This 3–4–1 NN is a densely connected network-each node in the current layer is connected to all the neurons in the previous layer except on the input layer. However, we have eliminated some connections to make the Figure less cluttered. A forward pass yields an output of 0.521 (Source: Author).
Understand: A forward pass allows the information to flow in one direction — from input to the output layer, whereas the back-propagation does the reverse — allowing data to flow from output backward while updating the parameters (weights and biases).
Definition: Back-propagation is a method for supervised learning used by NN to update parameters to make the network’s predictions more accurate. The parameter optimization process is achieved using an optimization algorithm called gradient descent (this concept will be very clear as you read along).
A forward pass yields a prediction (yhat) of the target (y) at a loss which is captured by a cost function (E) defined as:
Equation 1: Cost function
where m is the number of training examples, and L is the error/loss incurred when the model predicts yhat instead of actual value y. The objective is to minimize the cost E. This is achieved by differentiating E with respect to (wrt) parameters (weights and parameters) and adjusting the parameters in the opposite direction of the gradient (that is why the optimization algorithm is referred to as gradient descent).
In this post, we consider back-propagation on 1 training example (m=1). With this consideration, E, reduces to
Equation 2: Cost function for one training example.
Choosing the Loss Function, L
The loss function, L, is defined based on the task at hand. For classification problems, Cross-entropy (also known as log loss) and hinge loss are suitable loss functions, whereas, Mean Squared Error (MSE) and Mean Absolute Error (MAE) are appropriate loss functions for regression tasks.
Binary cross-entropy loss is a function suitable for our binary classification task — the data has two classes, 0 or 1. A binary cross-entropy loss function can be applied to our forward-pass example in Figure 1, as shown below
Equation3: Binary cross-entropy loss applied to our example.
t=1 is the truth label, yhat=0.521 is the output of the model and ln is the natural log— log to base 2.
You can read more about the cross entropy loss function on the link below.
Cross-Entropy Loss Function
A loss function used in most classification problems to optimize machine learning model…
towardsdatascience.com
Since we now understand the NN architecture and the cost function we will use, we can proceed directly to cover the steps for backward propagation.
The Data and the Parameters
The table below shows the data on all the layers of the 3–4–1 NN. At the 3-neuron input, the values shown are from the data we provide to the model for training. The second/hidden layer contains the weights (w) and biases (b) we wish to update and the output (f) at each of the 4 neurons during the forward pass. The output contains the parameters (w and b) and the output of the model (yhat) — this value is actually the model prediction at each iteration of model training. After a single forward-pass, yhat=0.521.
Figure 2 The data and parameter initialization (Source: Author)
A. Update Equations and the Loss Function
Important: Recall from the previous section: E(θ)=L(y, yhat) where θ is our parameters — weights and biases. That is to say, E is a function of y and yhat and yhat=g(wx+b), => yhat is a function of w and b. x is a variable of data and g is the activation function. Effectively, E is a function w and b and, therefore can be differentiated with respect to these parameters.
The parameters at each layer are updated with the following equations
Equation 4: Update equations
where t is the learning step, ϵ is the learning rate — a hyper-parameter set by the user. It determines the rate at which the weights and biases are updated. We will use ϵ=0.5 (arbitrary choice).
From Equations 4, the update amounts becomes
Equation 5: Update amounts
As said earlier, since we are dealing with binary classification, we will use the binary cross-entropy loss function defined as:
Equation 6: Binary cross entropy loss
We will use Sigmoid activation across all layers
Equation 7: Sigmoid function.
where z=wx+b is the weighted input into the neuron plus bias.
B. Updating Parameters on the Output-Hidden Layer
Unlike the forward pass, back prop works backward from the output layer to layer 1. We need to compute derivatives/gradients with respect to parameters for all layers. To do that, we have to understand the chain rule of differentiation.
Chain rule of differentiation
Let’s work on updating w²₁₁ and b²₁ as examples. We will follow the routes shown below.
Figure 3: Back-propagation information flow (Source: Author).
B1. Calculating Derivatives for the Weights
By chain rule of differentiation, we have
Remember: when evaluating the above derivatives with respect to w²₁₁, all the other parameters are treated as constants, that is, w²₁₂, w²₁₃, w²₁₄, and b²₁. The derivative of a constant is 0 that is why some values were eliminated in the above derivative.
Next is the derivative of Sigmoid function (refer to this article)
Next, the derivative of cross-entropy loss function (reference material)
The derivatives with the respect to the other three weights on the output layer are as follows (you can confirm this)
B2. Computing Derivatives for the bias
We need to compute
From the previous sections, we have already computed ∂E and ∂yhat, what remains is
We used the same arguments as before, that all other variables except b²₁ are treated as constants therefore on when differentiated they reduce 0.
So far, we have computed the gradients with respect to all the parameters at the output-input layers.
Figure 4: Gradients at output-input layers (Source: Author).
At this point we are ready to update all the weights and biases at the output-input layers.
B3. Updating Parameters at the Output-Hidden Layers
Please compute the rest in the same way and confirm them in the table below
Figure 5: Updated parameters at hidden-output layers (Source: Author)
C. Updating Parameters at Hidden-Input Layer
As before, we need derivatives of E with respect to all the weights and biases at these layers. We have a total of 4x3=12 weights to update and 4 biases. As example, lets work on w¹₄₃ and b¹₂. See the routes in the Figure below.
Figure 6: Back-propagation information to the hidden-input layers (Source: Author).
C1. Gradients of weights
For weights, we need to compute the derivative (follow the route in Figure 6 if the following equation is intimidating)
As we go through each of the above derivatives, note the following important points:
- At the model output (when finding derivative of
Ewith respect to theyhat), we are actually differentiating the loss function. - At the layer outputs (
f) (where we differentiate wrtz), we find the activation function’s derivative. - In the above two cases, differentiating with respect to weights or biases of a given neuron yields the same results.
- The weighted inputs (
z) are differentiated with respect to parameters (worb) that we wish to update. In this case, all parameters are held constant except the parameter of interest.
Doing the same process as in Section B, we get:
- weighted inputs for layer
1
- derivative of Sigmoid activation function applied to the first layer
- Weighted inputs to the output layer.
f-values are the outputs of the hidden layer.
- Activation function applied to the output of the last layer.
- Derivative of binary cross-entropy loss function wrt to
yhat.
Then, we can put together all those as
C2. Gradients of bias
Using the same concepts as before, check that, for b¹₂, we have
All gradient values for the hidden-input are tabulated below
Figure 7: Gradients at hidden-input layers (Source: Author).
At this point, we are ready to compute the updated parameters at the hidden-input.
C3. Updating Parameters at the Hidden-Input
Lets go back to the update equations and work on updating w¹₁₃ and b¹₃
So, how many parameters do we have to update?
We have 4x3=12 weights and 4x1=4 biases at the hidden-input layers, 4x1=4 weights, and 1 bias at the output-hidden layers. That is a total of 21 parameters. They are called trainable parameters.
All the updated parameters for hidden-input layers are shown below
Figure 8: Updated parameters for hidden-input layers (Source: Author).
We now have the updated parameters for all the layers in Figure 8 and Figure 5 using back-propagation of error. Running a forward pass with these updated parameters yields a model prediction, yhat of 0.648, up from 0.521. This means that a model is learning — moving close to the true value of 1 after two iterations of training. Other iterations yield 0.758, 0.836, 0.881, 0.908, 0.925, … (In the next article, we will implement back-propagation and forward pass for many training examples and iterations, and you will get to see this).
Definitions
- Epoch — One epoch is when the entire dataset is passed through the network once. This comprises of one instance of a forward pass and back-propagation.
- Bath size is the number of training examples passed through the network simultaneously. In our case, we have one training example. In cases where we have a large dataset, the data can be passed through the network in batches.
- The number of iterations — One iteration equals one pass using training examples set as batch size. One pass is a forward pass and a back-propagation.
Example:
If we have 2000 training examples and set batch size of 20, then it takes 100 iterations to complete 1 epoch.
Conclusion
In this article, we have discussed back-propagation by working on an example. We have seen how chain rule of differentiation is used to get the gradients of different equations — the loss function, activation function, weighting equations and layer output equations. We have also discussed on how derivative with respect to the loss function can be used to update parameters at each layer. In the next article (attached below), we implement the concepts learnt here in Python.
How Back-Propagation Works — A Python Implementation?
Implementing Back-Propagation in Python
towardsdatascience.com


















![Toni Kroos là ai? [ sự thật về tiểu sử đầy đủ Toni Kroos ]](https://evbn.org/wp-content/uploads/New-Project-6635-1671934592.jpg)


