Backpropagation in a convolutional layer
Mục Lục
Backpropagation in a convolutional layer
Backpropagation in a convolutional layer
Introduction
Motivation
The aim of this post is to detail how gradient backpropagation is working in a convolutional layer of a neural network. Typically the output of this layer will be the input of a chosen activation function (relu
for instance). We are making the assumption that we are given the gradient dy
backpropagated from this activation function. As I was unable to find on the web a complete, detailed, and “simple” explanation of how it works. I decided to do the maths, trying to understand step by step how it’s working on simple examples before generalizing. Before further reading, you should be familiar with neural networks, and especially forward pass, backpropagation of gradient in a computational graph and basic linear algebra with tensors.
Convolution layer — Forward pass & BP
Notations
*
will refer to the convolution of 2 tensors in the case of a neural network (an input x
and a filter w
).
- When
x
andw
are matrices: - if
x
andw
share the same shape,x*w
will be a scalar equal to the sum across the results of the element-wise multiplication between the arrays. - if
w
is smaller thex
, we will obtain an activation mapy
where each value is the predefined convolution operation of a sub-region of x with the sizes of w. This sub-region activated by the filter is sliding all across the input arrayx
. - if
x
andw
have more than 2 dimensions, we are considering the last 3 ones for the convolution, and the last 2 ones for the highlighted sliding area (we just add one depth to our matrix)
Notations and variables are the same as the ones used in the excellent Stanford course on convolutional neural networks for visual recognition and especially the ones of assignment 2. Details on convolutional layer and forward pass will be found in this video and an instance of a naive implementation of the forward pass post.
Convolution layer notations
Goal
Our goal is to find out how gradient is propagating backwards in a convolutional layer. The forward pass is defined like this:
The input consists of N data points, each with C channels, height H and width W. We convolve each input with F different filters, where each filter spans all C channels and has height HH and width WW.
Input:
- x: Input data of shape (N, C, H, W)
- w: Filter weights of shape (F, C, HH, WW)
- b: Biases, of shape (F,)
- conv_param: A dictionary with the following keys:
- ‘stride’: The number of pixels between adjacent receptive fields in the horizontal and vertical directions.
- ‘pad’: The number of pixels that will be used to zero-pad the input.
During padding, ‘pad’ zeros should be placed symmetrically (i.e equally on both sides) along the height and width axes of the input.
Returns a tuple of:
- out: Output data, of shape (N, F, H’, W’) where H’ and W’ are given by
H’ = 1 + (H + 2 * pad — HH) / stride
W’ = 1 + (W + 2 * pad — WW) / stride
- cache: (x, w, b, conv_param)
Forward pass
Generic case (simplified with N=1, C=1, F=1)
N=1 one input, C=1 one channel, F=1 one filter.
Convolution 2D
x : H×W
x′=x with padding
w : HH×WW
b bias : scalar
y : H′×W′
stride s
Specific case: stride=1, pad=0, and no bias.
Backpropagation
We know:
We want to compute dx, dw and db, partial derivatives of our cost funcion L. We suppose that the gradient of this function has been backpropagated till y.
Trivial case: input x is a vector (1 dimension)
We are looking for an intuition of how it works on an easy setup and later on we will try to generalize.
Input
Output
Forward pass — convolution with one filter w, stride = 1, padding = 0
Backpropagation
We know the gradient of our cost function L with respect to y:
This can be written with the Jacobian notation:
dy and y share the same shape:
We are looking for
db
Using the chain rule and the forward pass formula (1), we can write:
dw
We can notice that dw is a convolution of the input x with a filter dy. Let’s see if it’s still valid with a added dimension.
dx
Once again, we have a convolution. A little bit more complex this time. We should consider an input dy with a 0-padding of size 1 convolved with an “inverted” filter w like (w2,w1)
Next step will be to have a look on how it works on small matrices.
Input x is a matrix (2 dimensions)
Input
Output
Once again, we will choose the easiest case: stride = 1 and no padding. Shape of y will be (3,3)
Forwad pass
We will have:
Written with subscripts:
Backpropagation
We know:
db
Using the Einstein convention to alleviate the formulas (when an index variable appears twice in a multiplication, it implies summation of that term over all the values of the index)
Summation on i and j. And we have:
dw
We are looking for
Using the formula (4) we have:
All terms
Except for (k,l)=(m,n) where it’s 1, case occurring just once in the double sum. Hence:
Using formula (3) we now have:
If we compare this equation with formula (1) giving the result of a convolution, we can distinguish a similar pattern where dy is a filter applied on an input x.
dx
Using the chaine rule as we did for (5), we have:
This time, we are looking for
Using equation (4):
We now have:
In our example, range sets for indices are:
When we set k=m−i+1, we are going to be out of the defined boundaries:(m−i+1)∈[−1,4]
In order to keep confidence in formula above, we choose to extend the definition of matrix w with 0 values as soon as indices will go out of the defined range.
Once again in the double sum , we only have once partial derivative of x equals 1. So:
where w is our 0-extended initial filter, thus:
Lets visualize it on several chosen values for the indices.
Using ∗ notation for convolution, we have:
As dy remain the same, we will only look at the values of indices of w. For dx22, range for w: 3−i,3−j
We now have a convolution between dy and a w’ matrix defined by:
Another instance in order to see what’s happening. dx43, w : 4−i,3−j
Last one dx44
We do see poping up an “inverted filter” w’. This time we have a convolution between an input dy with a 0-padding border of size 1 and a filter w’ slidding with a stride of 1.
Summary of backprop equations
Taking depths into account
Things are becoming slightly more complex when we try to take depth into account (C channels for input x, and F distinct filters for w)
Inputs:
- x: shape (C, H, W)
- w: filter’s weights shape (F, C, HH, WW)
- b: shape (F,)
Outputs:
- y: shape (F, H’, W’)
Maths formulas see many indices emerging, making them more difficult to read. The forward pass formula in our example will be:
db
db computation remains easy as each b_f is related to an activation map y_f:
dw
Using the forward pass formula, as the double sum does not use dy indices, we can write:
Algorithm
Now that we have the intuition of how it’s working, we choose not to write the entire set of equations (which can be pretty tedious), but we’ll use what has been coded for the forward pass, and playing with dimensions try to code the backprop for each gradient. Fortunately we can compute a numerical value of the gradient to check our implementation. This implementation is only valid for a stride=1, thing are becoming slightly more complex with a distinct stride, and another approach is needed. Maybe for another post!
Gradient numerical check
Testing conv_backward_naive function
dx error: 7.489787768926947e-09
dw error: 1.381022780971562e-10
db error: 1.1299800330640326e-10
Almost 0 each time, everything seems tobe OK! 🙂
References
Comments are welcome to improve this post, feel free to contact me!