Fully Connected Neural Network Algorithms - Andrew Gibiansky

In order to tell how well the neural network is doing, we compute the error $E(y^L)$. This error function can be a number of different things, such as binary cross-entropy or sum of squared residuals. However, we require that the derivative $\frac{dE(y^L)}{dy_i^L}$ depends only on $y_i^L$. This is the case with the functions listed previously, and effectively means that our error must be computed per-output and summed. This rules out a lot of more complex error functions that would be harder to learn with.

Backpropagation

The purpose of being able to compute the error, of course, is to be able to optimize the weights to minimize the error; that is, the process of learning. We learn via an algorithm known as backpropagation, which we can derive in a similar manner to forward propagation. In order to use gradient descent (or another algorithm) to train our network, we need to compute the derivative of the error with respect to each weight. Using the chain rule, we get that
$$\frac{\partial E}{\partial w_{ij}^\ell} = \frac{\partial E}{\partial x_j^{\ell + 1}}\frac{\partial x_j^{\ell + 1}}{\partial w_{ij}^\ell}$$
Note that we only get a contribution from $x_j^{\ell +1}$ since that weight appears nowhere else. Looking at the equation for forward propagation $\left(x_i^\ell = \sum_j w_{ji}^{\ell – 1} y_j^{\ell – 1}\right)$, we see that the partial with respect to any given weight is just the activation from its origin neuron. Thus, the chain rule above becomes
$$\frac{\partial E}{\partial w_{ij}^\ell} = y_i^\ell \frac{\partial E}{\partial x_j^{\ell+1}}$$
We already know all the values of $y$, so we just need to compute the partial with respect to the input $x_j$. However, we know that $y_i^\ell = \sigma(x_i^\ell) + I_i^\ell$, so we can once more use the chain rule to write
$$\frac{\partial E}{\partial x_j^\ell} = \frac{\partial E}{\partial y_j^\ell}\frac{\partial y_j^\ell}{\partial x_j^\ell}= \frac{\partial E}{\partial y_j^\ell}\frac{\partial}{\partial x_j^\ell}\left(\sigma(x_j^\ell) + I_j^\ell\right) = \frac{\partial E}{\partial y_j^\ell} \sigma'(x_j^\ell)$$

Again, $y_j^\ell$ is the only expression in which we ever see an $x_j^\ell$ term, so it is the only contribution to the chain rule. The only bit we have yet to derive is the derivative with respect to the activation $y_i^\ell$. If $\ell = L$ (that is, we’re looking at the output layer), then we know that the partial is just the derivative of the error function, which is directly a function of those activations:
$$\frac{\partial E}{\partial y_i^L} = \frac{d}{d y_i^L} E(y^L).$$
As we discussed earlier, we require that this derivative is just a function of $y_i^L$ and none of the other activations in the output layer.

Finally, if we are not looking in the output layer, then we simply use chain rule once more:
$$\frac{\partial E}{\partial y_i^\ell} = \sum \frac{\partial E}{\partial x_j^{\ell + 1}} \frac{\partial x_j^{\ell + 1}}{\partial y_i^\ell} = \sum \frac{\partial E}{\partial x_j^{\ell + 1}} w_{ij}.$$
Unlike the previous two applications, we see that $y_i^\ell$ is in many expressions throughout the neural network (the entire next layer). Applying the chain rule, we sum over all these contributions, and find that what we get is the derivatives of the inputs to the next layer weighted by how much $y_i^\ell$ matters to each input. Intuitively speaking, this means that the error at a particular node in layer $\ell$ is the combination of errors at the next nodes (layer $\ell + 1$), weighted by the size of the contribution of the node in layer $\ell$ to each of those nodes in layer $\ell +1$.

These equations are complete, and allow us to compute the gradient of the error (the partial with respect to all of the weights). The full algorithm follows.

Backpropagation:

Compute errors at the output layer $L$:
$$\frac{\partial E}{\partial y_i^L} = \frac{d}{d y_i^L} E(y^L)$$

Compute partial derivative of error with respect to neuron input (sometimes known as “deltas”) at first layer $\ell$ that has known errors:
$$\frac{\partial E}{\partial x_j^\ell} = \sigma'(x_j^\ell)\frac{\partial E}{\partial y_j^\ell}$$

Compute errors at the previous layer (backpropagate errors):
$$\frac{\partial E}{\partial y_i^\ell} =\sum w_{ij}^\ell \frac{\partial E}{\partial x_j^{\ell + 1}}$$

Repeat steps 2 and 3 until deltas are known at all but the input layer.

Compute the gradient of the error (derivative with respect to weights):
$$\frac{\partial E}{\partial w_{ij}^\ell} = y_i^\ell \frac{\partial E}{\partial x_j^{\ell+1}}$$
Note that in order to compute derivatives with respect to weights in a given layer, we use the activations in that layer and the deltas for the next layer. Thus, we never need to compute deltas for the input layer.