Does CNN Have Back-Propagation – My Next Interview Question (leaked)
While I have been travelling here and there in my deep learning journey, I had to stop midway as a question popped up in my mind.
Does CNN have Back-propagation?
Thinking back, I never really gave much thought to back-propagation since Andrew Ng’s coursera course taught what it is, and after seeing few more videos about updating weights I thought “Alright its simple – Find Error – Differentiate at every layer starting from output layer – Update weights by subtracting the weight by learning rate times the loss”.
Well Guys, That is Indeed the case for a simple ANN, But what about complex structures like CNN and RNN?. If you have not given it much thought in your learnings, Blame it on Keras… That’s what I do. (Keras is making my machine intelligent and me dumber by abstracting everything) Anyways…
The Answer is YES!!!! CNN Does use back-propagation.
So how could you have arrived at that answer by applying logic is, Basic ANN uses weights as its learning parameter. What does CNN use?
Its Convolution Filters or Masks as some people call it.
CNN tries to update the value of the filters in its back-propagation, correcting them to become more efficient at matching the output and thus helps in the learning.
Mục Lục
For the folks who are persistent on learning more… Here is how it works.
we can imagine a CNN as a massive computational graph. Let us say we have a gate f in that computational graph with inputs x and y which outputs z.
We can easily compute the local gradients — differentiating z with respect to x and y as ∂z/∂x and ∂z/∂y
For the forward pass, we move across the CNN, moving through its layers and at the end obtain the loss, using the loss function. And when we start to work the loss backwards, layer across layer, we get the gradient of the loss from the previous layer as ∂L/∂z. In order for the loss to be propagated to the other gates, we need to find ∂L/∂x and ∂L/∂y.
Using the chain rule we can calculate ∂L/∂x and ∂L/∂y, which would feed the other gates in the extended computational graph
That’s Really it – Let’s apply it to a CNN.
Now, lets assume the function f is a convolution between Input X and a Filter F. Input X is a 3×3 matrix and Filter F is a 2×2 matrix, as shown below:
Convolution between Input X and Filter F, gives us an output O. This can be represented as:
This gives us the forward pass! Let’s get to the Backward pass. As mentioned earlier, we get the loss gradient with respect to the Output O from the next layer as ∂L/∂O, during Backward pass. And combining with our previous knowledge using Chain rule and Backpropagation we get:
As seen above, we can find the local gradients ∂O/∂X and ∂O/∂F with respect to Output O. And with loss gradient from previous layers — ∂L/∂O and using chain rule, we can calculate ∂L/∂X and ∂L/∂F.
and why do we need to find ∂L/∂X and ∂L/∂F?
So let’s find the gradients for X and F — ∂L/∂X and ∂L/∂F
Finding ∂L/∂F
Step 1: Finding the local gradient — ∂O/∂F:
This means we have to differentiate Output Matrix O with Filter F. From our convolution operation,
Step 2: Using the Chain rule:
As described in our previous examples, we need to find ∂L/∂F as:
Substituting the values of the local gradient — ∂O/∂F from Equation A, we get
If you closely look at it, this represents an operation we are quite familiar with. We can represent it as a convolution operation between input X and loss gradient ∂L/∂O as shown below:
Finding ∂L/∂X:
Step 1: Finding the local gradient — ∂O/∂X:
Similar to how we found the local gradients earlier, we can find ∂O/∂X as:
Step 2: Using the Chain rule:
Expanding this and substituting from Equation B, we get
even this can be represented as a convolution operation.
∂L/∂X can be represented as ‘full’ convolution between a 180-degree rotated Filter F and loss gradient ∂L/∂O
Flipping 180*
Now, let us do a ‘full’ convolution between this flipped Filter F and ∂L/∂O, which can be visualized as below: (It is like sliding one matrix over another from right to left, bottom to top)
The full convolution above generates the values of ∂L/∂X and hence we can represent ∂L/∂X as
Well, now that we have found ∂L/∂X and ∂L/∂F, we can now come to this conclusion
Both the Forward pass and the Backpropagation of a Convolutional layer are Convolutions
Summing it up:
CNN uses back-propagation and the back propagation is not a simple derivative like ANN but it is a convolution operation as given below.
As far as the interview is concerned…
I would still give you full marks if you say that the “filters” are updated using back propagation, Instead of saying it out as weights.
I would put in a word for you next round if you rightly answered – back-propagation is also done through a convolution process.