Backpropagation | Brilliant Math & Science Wiki - EU-Vietnam Business Network (EVBN)

The derivation of the backpropagation algorithm is fairly straightforward. It follows from the use of the chain rule and product rule in differential calculus. Application of these rules is dependent on the differentiation of the activation function, one of the reasons the heaviside step function is not used (being discontinuous and thus, non-differentiable).

Preliminaries

For the rest of this section, the derivative of a function \(f(x)\) will be denoted \(f^{\prime}(x)\), so that the sigmoid function’s derivative is \(\sigma^{\prime}(x)\).

To simplify the mathematics further, the bias \(b_i^k\) for node \(i\) in layer \(k\) will be incorporated into the weights as \(w_{0i}^k\) with a fixed output of \(o_0^{k-1} = 1\) for node \(0\) in layer \(k-1\). Thus,

\[w_{0i}^k = b_i^k.\]

To see that this is equivalent to the original formulation, note that

\[a_i^k = b_i^k + \sum_{j = 1}^{r_{k-1}} w_{ji}^k o_j^{k-1} = \sum_{j = 0}^{r_{k-1}} w_{ji}^k o_j^{k-1},\]

where the left side is the original formulation and the right side is the new formulation.

Using the notation above, backpropagation attempts to minimize the following error function with respect to the neural network’s weights:

\[E(X, \theta) = \frac{1}{2N}\sum_{i=1}^N\left( \hat{y_i} – y_i\right)^{2}\]

by calculating, for each weight \(w_{ij}^k,\) the value of \(\frac{\partial E}{\partial w_{ij}^k}\). Since the error function can be decomposed into a sum over individual error terms for each individual input-output pair, the derivative can be calculated with respect to each input-output pair individually and then combined at the end (since the derivative of a sum of functions is the sum of the derivatives of each function):

\[\frac{\partial E(X, \theta)}{\partial w_{ij}^k} = \frac{1}{N}\sum_{d=1}^N\frac{\partial}{\partial w_{ij}^k}\left(\frac{1}{2}\left(\hat{y_d} – y_d\right)^{2}\right) = \frac{1}{N}\sum_{d=1}^N\frac{\partial E_d}{\partial w_{ij}^k}.\]

Thus, for the purposes of derivation, the backpropagation algorithm will concern itself with only one input-output pair. Once this is derived, the general form for all input-output pairs in \(X\) can be generated by combining the individual gradients. Thus, the error function in question for derivation is

\[E = \frac{1}{2}\left( \hat{y} – y\right)^{2},\]

where the subscript \(d\) in \(E_d\), \(\hat{y_d}\), and \(y_d\) is omitted for simplification.

Error Function Derivatives

The derivation of the backpropagation algorithm begins by applying the chain rule to the error function partial derivative

\[\frac{\partial E}{\partial w_{ij}^k} = \frac{\partial E}{\partial a_j^k}\frac{\partial a_j^k}{\partial w_{ij}^k},\]

where \(a_j^k\) is the activation (product-sum plus bias) of node \(j\) in layer \(k\) before it is passed to the nonlinear activation function (in this case, the sigmoid function) to generate the output. This decomposition of the partial derivative basically says that the change in the error function due to a weight is a product of the change in the error function \(E\) due to the activation \(a_j^k\) times the change in the activation \(a_j^k\) due to the weight \(w_{ij}^k\).

The first term is usually called the error, for reasons discussed below. It is denoted

\[\delta_j^k \equiv \frac{\partial E}{\partial a_j^k}.\]

The second term can be calculated from the equation for \(a_j^k\) above:

\[\frac{\partial a_j^k}{\partial w_{ij}^k} = \frac{\partial}{\partial w_{ij}^k} \left(\sum_{l = 0}^{r_{k-1}} w_{lj}^k o_l^{k-1}\right) = o_i^{k-1}.\]

Thus, the partial derivative of the error function \(E\) with respect to a weight \(w_{ij}^k\) is

\[\frac{\partial E}{\partial w_{ij}^k} = \delta_j^k o_i^{k-1}.\]

Thus, the partial derivative of a weight is a product of the error term \(\delta_j^k\) at node \(j\) in layer \(k\), and the output \(o_i^{k-1}\) of node \(i\) in layer \(k-1\). This makes intuitive sense since the weight \(w_{ij}^k\) connects the output of node \(i\) in layer \(k-1\) to the input of node \(j\) in layer \(k\) in the computation graph.

It is important to note that the above partial derivatives have all been calculated without any consideration of a particular error function or activation function. However, since the error term \(\delta_j^k\) still needs to be calculated, and is dependent on the error function \(E\), at this point it is necessary to introduce specific functions for both of these. As mentioned previously, classic backpropagation uses the mean squared error function (which is the squared error function for the single input-output pair case) and the sigmoid activation function.

The calculation of the error \(\delta_j^{k}\) will be shown to be dependent on the values of error terms in the next layer. Thus, computation of the error terms will proceed backwards from the output layer down to the input layer. This is where backpropagation, or backwards propagation of errors, gets its name.

The Output Layer

Starting from the final layer, backpropagation attempts to define the value \(\delta_1^m\), where \(m\) is the final layer \((\)the subscript is \(1\) and not \(j\) because this derivation concerns a one-output neural network, so there is only one output node \(j = 1).\) For example, a four-layer neural network will have \(m=3\) for the final layer, \(m=2\) for the second to last layer, and so on. Expressing the error function \(E\) in terms of the value \(a_1^m\) \(\big(\)since \(\delta_1^m\) is a partial derivative with respect to \(a_1^m\big)\) gives

\[E = \frac{1}{2}\left( \hat{y} – y\right)^{2} = \frac{1}{2}\big(g_o(a_1^m) – y\big)^{2},\]

where \(g_o(x)\) is the activation function for the output layer.

Thus, applying the partial derivative and using the chain rule gives

\[\delta_1^m = \left(g_0(a_1^m) – y\right)g_o^{\prime}(a_1^m) = \left(\hat{y}-y\right)g_o^{\prime}(a_1^m).\]

Putting it all together, the partial derivative of the error function \(E\) with respect to a weight in the final layer \(w_{i1}^m\) is

\[\frac{\partial E}{\partial w_{i1}^m}= \delta_1^m o_i^{m-1} = \left(\hat{y}-y\right)g_o^{\prime}(a_1^m)\ o_i^{m-1}.\]

The Hidden Layers

Now the question arises of how to calculate the partial derivatives of layers other than the output layer. Luckily, the chain rule for multivariate functions comes to the rescue again. Observe the following equation for the error term \(\delta_j^k\) in layer \(1 \le k \lt m:\)

\[\delta_j^k = \frac{\partial E}{\partial a_j^k}= \sum_{l=1}^{r^{k+1}}\frac{\partial E}{\partial a_l^{k+1}}\frac{\partial a_l^{k+1}}{\partial a_j^k},\]

where \(l\) ranges from \(1\) to \(r^{k+1}\) (the number of nodes in the next layer). Note that, because the bias input \(o_0^k\) corresponding to \(w_{0j}^{k+1}\) is fixed, its value is not dependent on the outputs of previous layers, and thus \(l\) does not take on the value \(0\).

Plugging in the error term \(\delta_l^{k+1}\) gives the following equation:

\[\delta_j^k = \sum_{l=1}^{r^{k+1}}\delta_l^{k+1}\frac{\partial a_l^{k+1}}{\partial a_j^k}.\]

Remembering the definition of \(a_l^{k+1}\)

\[a_l^{k+1} = \sum_{j=1}^{r^k}w_{jl}^{k+1}g\big(a_j^k\big),\]

where \(g(x)\) is the activation function for the hidden layers,

\[\frac{\partial a_l^{k+1}}{\partial a_j^k} = w_{jl}^{k+1}g^{\prime}\big(a_j^k\big).\]

Plugging this into the above equation yields a final equation for the error term \(\delta_j^k\) in the hidden layers, called the backpropagation formula:

\[\delta_j^k = \sum_{l=1}^{r^{k+1}}\delta_l^{k+1}w_{jl}^{k+1}g^{\prime}\big(a_j^k\big) = g^{\prime}\big(a_j^k\big)\sum_{l=1}^{r^{k+1}}w_{jl}^{k+1}\delta_l^{k+1}.\]

Putting it all together, the partial derivative of the error function \(E\) with respect to a weight in the hidden layers \(w_{ij}^k\) for \(1 \le k \lt m\) is

\[\frac{\partial E}{\partial w_{ij}^k} = \delta_j^k o_i^{k-1} = g^{\prime}\big(a_j^k\big)o_i^{k-1}\sum_{l=1}^{r^{k+1}}w_{jl}^{k+1}\delta_l^{k+1}.\]

Backpropagation as Backwards Computation

This equation is where backpropagation gets its name. Namely, the error \(\delta_j^k\) at layer \(k\) is dependent on the errors \(\delta_k^{k+1}\) at the next layer \(k+1\). Thus, errors flow backward, from the last layer to the first layer. All that is needed is to compute the first error terms based on the computed output \(\hat{y} = g_o(a_1^m)\) and target output \(y\). Then, the error terms for the previous layer are computed by performing a product sum \(\big(\)weighted by \(w_{jl}^{k+1}\big)\) of the error terms for the next layer and scaling it by \(g^{\prime}\big(a_j^k\big)\), repeated until the input layer is reached.

This backwards propagation of errors is very similar to the forward computation that calculates the neural network’s output. Thus, calculating the output is often called the forward phase while calculating the error terms and derivatives is often called the backward phase. While going in the forward direction, the inputs are repeatedly recombined from the first layer to the last by product sums dependent on the weights \(w_{ij}^k\) and transformed by nonlinear activation functions \(g(x)\) and \(g_o(x)\). In the backward direction, the “inputs” are the final layer’s error terms, which are repeatedly recombined from the last layer to the first by product sums dependent on the weights \(w_{jl}^{k+1}\) and transformed by nonlinear scaling factors \(g_o^{\prime}\big(a_j^m\big)\) and \(g^{\prime}\big(a_j^k\big)\).

Furthermore, because the computations for backwards phase are dependent on the activations \(a_j^k\) and outputs \(o_j^k\) of the nodes in the previous (the non-error term for all layers) and next layer (the error term for hidden layers), all of these values must be computed before the backwards phase can commence. Thus, the forward phase precedes the backward phase for every iteration of gradient descent. In the forward phase, activations \(a_j^k\) and outputs \(o_j^k\) will be remembered for use in the backwards phase. Once the backwards phase is completed and the partial derivatives are known, the weights \(\big(\)and associated biases \(b_j^k = w_{0j}^k\big)\) can be updated by gradient descent. This process is repeated until a local minimum is found or convergence criterion is met.