back propagation in CNN

$\begingroup$

A convolution employs a weight sharing principle which will complicate the mathematics significantly but let’s try to get through the weeds. I am drawing most of my explanation from this source.

Forward pass

As you observed the forward pass of the convolutional layer can be expressed as

$x_{i, j}^l = \sum_m \sum_n w_{m,n}^l o_{i+m, j+n}^{l-1} + b_{i, j}^l$

where in our case $k_1$ and $k_2$ is the size of the kernel, in our case $k_1=k_2=2$. So this says for an output $x_{0,0} = 0.25$ like you found. $m$ and $n$ iterate across the dimensions of the kernel.

Backpropagation

Assuming you are using the mean squared error (MSE) defined as

$E = \frac{1}{2}\sum_p (t_p – y_p)^2$,

we want to determine

$\frac{\partial E}{\partial w^l_{m’, n’}}$ in order to update the weights. $m’$ and $n’$ are the indices in the kernel matrix not be confused with its iterators. For example $w^1_{0,0} = -0.13$ in our example. We can also see that for an input image $H$x$K$ the output dimension after the convolutional layer will be

$(H-k_1+1)$x$(W-k_2+1)$.

In our case that would be $4$x$4$ as you showed. Let’s calculate the error term. Each term found in the output space has been influenced by the kernel weights. The kernel weight $w^1_{0,0} = -0.13$ contributed to the output $x^1_{0,0} = 0.25$ and every single other output. Thus we express its contribution to the total error as

$\frac{\partial E}{\partial w^l_{m’, n’}} = \sum_{i=0}^{H-k_1} \sum_{j=0}^{W-k_2} \frac{\partial E}{\partial x^l_{i, j}} \frac{\partial x^l_{i, j}}{\partial w^l_{m’, n’}}$.

This iterates across the entire output space, determines the error that output is contributing and then determines the contribution factor of the kernel weight with respect to that output.

Let us call the contribution to the error from the output space delta for simplicity and to keep track of the backpropagated error,

$\frac{\partial E}{\partial x^l_{i, j}} = \delta^l_{i,j}$.

The contribution from the weights

The convolution is defined as

$x_{i, j}^l = \sum_m \sum_n w_{m,n}^l o_{i+m, j+n}^{l-1} + b_{i, j}^l$,

thus,

$\frac{\partial x^l_{i, j}}{\partial w^l_{m’, n’}} = \frac{\partial}{\partial w^l_{m’, n’}} (\sum_m \sum_n w_{m,n}^l o_{i+m, j+n}^{l-1} + b_{i, j}^l)$.

By expanding the summation we end up observing that the derivative will only be non-zero when $m=m’$ and $n=n’$. We then get

$\frac{\partial x^l_{i, j}}{\partial w^l_{m’, n’}} = o^{l-1}_{i+m’, j+n’}$.

Then back in our error term

$\frac{\partial E}{\partial w^l_{m’, n’}} = \sum_{i=0}^{H-k_1} \sum_{j=0}^{W-k_2} \delta_{i,j}^l o^{l-1}_{i+m’, j+n’}$.

Stochastic gradient descent

$w^{l(t+1)}_{m’, n’} = w^{l(t)}_{m’, n’} – \eta \frac{\partial E}{\partial w^l_{m’, n’}}$

Let’s calculate some of them

import numpy as np
from scipy import signal
o = np.array([(0.51, 0.9, 0.88, 0.84, 0.05), 
              (0.4, 0.62, 0.22, 0.59, 0.1), 
              (0.11, 0.2, 0.74, 0.33, 0.14), 
              (0.47, 0.01, 0.85, 0.7, 0.09),
              (0.76, 0.19, 0.72, 0.17, 0.57)])
d = np.array([(0, 0, 0.0686, 0), 
              (0, 0.0364, 0, 0), 
              (0, 0.0467, 0, 0), 
              (0, 0, 0, -0.0681)])

gradient = signal.convolve2d(np.rot90(np.rot90(d)), o, 'valid')

array([[ 0.044606, 0.094061], [ 0.011262, 0.068288]])

Now you can put that into the SGD equation in place of $\frac{\partial E}{\partial w}$.

Please let me know if theres errors in the derivation.

Update: Corrected code