Does CNN Have Back-Propagation – My Next Interview Question (leaked)

While I have been travelling here and there in my deep learning journey, I had to stop midway as a question popped up in my mind.

Does CNN have Back-propagation?

Thinking back, I never really gave much thought to back-propagation since Andrew Ng’s coursera course taught what it is, and after seeing few more videos about updating weights I thought “Alright its simple – Find Error – Differentiate at every layer starting from output layer – Update weights by subtracting the weight by learning rate times the loss”.

Well Guys, That is Indeed the case for a simple ANN, But what about complex structures like CNN and RNN?. If you have not given it much thought in your learnings, Blame it on Keras… That’s what I do. (Keras is making my machine intelligent and me dumber by abstracting everything) Anyways…

The Answer is YES!!!! CNN Does use back-propagation.

So how could you have arrived at that answer by applying logic is, Basic ANN uses weights as its learning parameter. What does CNN use?

Its Convolution Filters or Masks as some people call it.

CNN tries to update the value of the filters in its back-propagation, correcting them to become more efficient at matching the output and thus helps in the learning.

For the folks who are persistent on learning more… Here is how it works.

we can imagine a CNN as a massive computational graph. Let us say we have a gate f in that computational graph with inputs x and y which outputs z.

No alt text provided for this image

We can easily compute the local gradients — differentiating z with respect to x and y as ∂z/∂x and ∂z/∂y

For the forward pass, we move across the CNN, moving through its layers and at the end obtain the loss, using the loss function. And when we start to work the loss backwards, layer across layer, we get the gradient of the loss from the previous layer as ∂L/∂z. In order for the loss to be propagated to the other gates, we need to find ∂L/∂x and ∂L/∂y.

No alt text provided for this image

Using the chain rule we can calculate ∂L/∂x and ∂L/∂y, which would feed the other gates in the extended computational graph

No alt text provided for this image

That’s Really it – Let’s apply it to a CNN.

Now, lets assume the function f is a convolution between Input X and a Filter F. Input X is a 3×3 matrix and Filter F is a 2×2 matrix, as shown below:

No alt text provided for this image

Convolution between Input X and Filter F, gives us an output O. This can be represented as:

No alt text provided for this imageNo alt text provided for this image

This gives us the forward pass! Let’s get to the Backward pass. As mentioned earlier, we get the loss gradient with respect to the Output O from the next layer as ∂L/∂O, during Backward pass. And combining with our previous knowledge using Chain rule and Backpropagation we get:

No alt text provided for this image

As seen above, we can find the local gradients ∂O/∂X and ∂O/∂F with respect to Output O. And with loss gradient from previous layers — ∂L/∂O and using chain rule, we can calculate ∂L/∂X and ∂L/∂F.

and why do we need to find ∂L/∂X and ∂L/∂F?

No alt text provided for this image

So let’s find the gradients for X and F — ∂L/∂X and ∂L/∂F

Finding ∂L/∂F

Step 1: Finding the local gradient — ∂O/∂F:

This means we have to differentiate Output Matrix O with Filter F. From our convolution operation,

No alt text provided for this image

Step 2: Using the Chain rule:

As described in our previous examples, we need to find ∂L/∂F as:

No alt text provided for this imageNo alt text provided for this image

Substituting the values of the local gradient — ∂O/∂F from Equation A, we get

No alt text provided for this image

If you closely look at it, this represents an operation we are quite familiar with. We can represent it as a convolution operation between input X and loss gradient ∂L/∂O as shown below:

No alt text provided for this image

Finding ∂L/∂X:

Step 1: Finding the local gradient — ∂O/∂X:

Similar to how we found the local gradients earlier, we can find ∂O/∂X as:

No alt text provided for this image

Step 2: Using the Chain rule:

No alt text provided for this image

Expanding this and substituting from Equation B, we get

No alt text provided for this image

even this can be represented as a convolution operation.

∂L/∂X can be represented as ‘full’ convolution between a 180-degree rotated Filter F and loss gradient ∂L/∂O

No alt text provided for this image

Flipping 180*

Now, let us do a ‘full’ convolution between this flipped Filter F and ∂L/∂O, which can be visualized as below: (It is like sliding one matrix over another from right to left, bottom to top)

No alt text provided for this image

The full convolution above generates the values of ∂L/∂X and hence we can represent ∂L/∂X as

No alt text provided for this image

Well, now that we have found ∂L/∂X and ∂L/∂F, we can now come to this conclusion

Both the Forward pass and the Backpropagation of a Convolutional layer are Convolutions

Summing it up:

CNN uses back-propagation and the back propagation is not a simple derivative like ANN but it is a convolution operation as given below.

No alt text provided for this image

As far as the interview is concerned…

I would still give you full marks if you say that the “filters” are updated using back propagation, Instead of saying it out as weights.

I would put in a word for you next round if you rightly answered – back-propagation is also done through a convolution process.