Backpropagation in a Convolutional Neural Network

Demystifying Deep Learning: Part 9

Backpropagation in a Convolutional Neural Network

September 10, 2018

6 min read

Introduction

In this post, we will derive the backprop equations for Convolutional Neural Networks. Again there is a Jupyter notebook accompanying the blog post containing the code for classifying handwritten digits using a CNN written from scratch.

In a feedforward neural network, we only had one type of layer (fully-connected layer) to consider, however in a CNN we need to consider each type of layer separately.

The different layers to consider are:

  • Convolution Layer
  • ReLU Layer
  • Pooling Layer
  • Fully-Connected Layer
  • Softmax (Output) Layer

If you are not already comfortable with backpropagation in a feedforward neural network, I’d suggest looking at the earlier post on Backpropagation which contains some useful intuition and general principles on how to derive the algorithm.

I’ll restate the general principles here for convenience:

  • Partial Derivative Intuition: Think of ∂y∂x\frac{\partial{y}}{\partial{x}}∂x∂y​ loosely as quantifying how much yyy would change if you gave the value of xxx a little “nudge” at that point.

  • Breaking down computations – we can use the chain rule to aid us in our computation – rather than trying to compute the derivative in one fell swoop, we break up the computation into smaller intermediate steps.

  • Computing the chain rule – when thinking about which intermediate values to include in our chain rule expression, think about the immediate outputs of equations involving xxx – which other values get directly affected when we slightly nudge xxx?

  • One element at a time – rather than worrying about the entire matrix AAA , we’ll instead look at an element AijA_{ij}Aij​. One equation we will refer to time and time again is:

    Cij=∑kAikBkj  ⟺  C=A.BC_{ij} = \sum_k A_{ik}B_{kj} \iff C=A.BCij​=∑k​Aik​Bkj​⟺C=A.B

    A useful tip when trying to go from one element to a matrix is to look for summations over repeated indices (here it was k) – this suggests a matrix multiplication.

    Another useful equation is the element-wise product of two matrices:

    Cij=AijBij  ⟺  C=A∗BC_{ij} = A_{ij}B_{ij} \iff C=A*BCij​=Aij​Bij​⟺C=A∗B

  • Sanity check the dimensions – check the dimensions of the matrices all match (the derivative matrix should have same dimensions as the original matrix, and all matrices being multiplied together should have dimensions that align.

Convolution Layer

Recall that the forward pass’ equation for position (i,j)(i,j)(i,j) in the kthk^{th}kth activation map in the convolution layer is:

Zi,j,k(mi)=∑a∑b∑cXi+a,j+b,c(mi)∗Wa,b,c,k+bkZ^{(m_i)}_{i,j,k}= \sum_a \sum_b \sum_c X^{(m_i)}_{i+a,j+b,c} * W_{a,b,c,k} + b_kZi,j,k(mi​)​=∑a​∑b​∑c​Xi+a,j+b,c(mi​)​∗Wa,b,c,k​+bk​

Since a convolution cannot be represented as a matrix multiplication, we will just consider a single neuron/weight at a time.

Just like with a standard feedforward neural net, a nudge in the weights results in a nudge in ZZZ corresponding to the magnitude of the input XXX it is connected to. A nudge in the bias has the corresponding magnitude in ZZZ.

So considering our single neuron Zi,j,k(mi)Z^{(m_i)}_{i,j,k}Zi,j,k(mi​)​ and a weight Wa,b,c,kW_{a,b,c,k}Wa,b,c,k​ the corresponding input is Xi+a,j+b,c(mi)X^{(m_i)}_{i+a,j+b,c}Xi+a,j+b,c(mi​)​ from the forward prop equation above. So:

∂Zi,j,k(mi)∂Wa,b,c,k=Xi+a,j+b,c(mi)\frac{\partial{Z^{(m_i)}_{i,j,k}}}{\partial{W_{a,b,c,k}}} = X^{(m_i)}_{i+a,j+b,c}∂Wa,b,c,k​∂Zi,j,k(mi​)​​=Xi+a,j+b,c(mi​)​

∂Zi,j,k(mi)∂bk=1\frac{\partial{Z^{(m_i)}_{i,j,k}}}{\partial{b_{k}}} = 1∂bk​∂Zi,j,k(mi​)​​=1

It helps to refer back to the equivalent neurons representation of a convolution for the kthfilterk^{th} filterkthfilter. Since the weights/bias are shared, we sum partial derivatives across all neurons across the width and the height of the activation map, since a nudge in the weights/bias will affect the outputs of all neurons.

We also average the gradient across the batch of training examples.

So the backprop equations for the weights and bias are, using chain rule:

∂J∂W∗a,b,c,k=1m∑∗m∗i∑i∑j∂J∂Z(mi)∗i,j,k\*X(m∗i)∗i+a,j+b,c\frac{\partial{J}}{\partial{W*{a,b,c,k}}} = \frac{1}{m} \sum*{m*i} \sum_i \sum_j \frac{\partial{J}}{\partial{Z^{(m_i)}*{i,j,k}}}\*X^{(m*i)}*{i+a,j+b,c}∂W∗a,b,c,k∂J​=m1​∑∗m∗i∑i​∑j​∂Z(mi​)∗i,j,k∂J​\*X(m∗i)∗i+a,j+b,c

∂J∂b∗k=1m∑∗m∗i∑i∑j∂J∂Z(mi)∗i,j,k\frac{\partial{J}}{\partial{b*k}} = \frac{1}{m} \sum*{m*i} \sum_i \sum_j \frac{\partial{J}}{\partial{Z^{(m_i)}*{i,j,k}}}∂b∗k∂J​=m1​∑∗m∗i∑i​∑j​∂Z(mi​)∗i,j,k∂J​

So the equations for the partial derivatives with respect to the weights and biases are fairly similar to that of a feedforward neural net, just with parameter sharing.

Now we need to compute the partial derivative with respect to the input XXX so we can propagate the gradient back to the previous layer. This is a little more involved.

Firstly, note that a nudge in the input affects all of the activation maps, so we sum across the activation maps. So now let’s consider the kthk^{th}kth activation map.
Conv sliding backward

Now consider the representation of the convolution as a sliding filter operation. The filter slides over the input across the height and width dimensions so for a given input pixel Xi,j,cX_{i, j, c}Xi,j,c​ , there are F∗FF*FF∗F different outputs it is part of, depending on which part of the filter has scanned over it ( FFF is the filter size). To determine the corresponding output patch when Xi,j,cX_{i, j, c}Xi,j,c​ is multiplied by Wa,b,c,kW_{a,b,c,k}Wa,b,c,k​ , note that in the forward pass, for Zi,j,kZ_{i, j, k}Zi,j,k​ the corresponding input is offset by (+a,+b)(+a, +b)(+a,+b) relative to the output (see equation), so from the perspective of the input, the output is offset by (−a,−b)(-a, -b)(−a,−b) . So given an input Xi,j,cX_{i, j, c}Xi,j,c​ and weight Wa,b,c,kW_{a,b,c,k}Wa,b,c,k​ the corresponding output is Zi−a,j−b,kZ_{i-a, j-b, k}Zi−a,j−b,k​.

So the equation is:

∂J∂X(m∗i)∗i,j,c=∑∗k∑a∑b∂J∂Z(mi)∗i−a,j−b,k\*W_a,b,c\frac{\partial{J}}{\partial{X^{(m*i)}*{i,j,c}} } = \sum*k \sum_a \sum_b \frac{\partial{J}}{\partial{Z^{(m_i)}*{i-a,j-b,k}} } \* W\_{a,b,c}∂X(m∗i)∗i,j,c∂J​=∑∗k∑a​∑b​∂Z(mi​)∗i−a,j−b,k∂J​\*W_a,b,c

Note that this is actually itself a convolution! When implementing, we need to zero-pad the output, since around the edges, the indices (i-a,j-b) may be negative (i.e. Zi−a,j−b,kZ_{i-a, j-b, k}Zi−a,j−b,k​ does not exist) – so we set these values to zero, as non-existent values shouldn’t contribute to the sum of the gradients.

When implementing, we broadcast and vectorise the operations when calculating the gradient with respect to WWW and bbb.

def

conv_backward

(

dZ

,

x

,

w

,

padding

=

"same"

)

:

m

=

x

.

shape

[

0

]

db

=

(

1

/

m

)

*

np

.

sum

(

dZ

,

axis

=

(

0

,

1

,

2

)

,

keepdims

=

True

)

if

padding

==

"same"

:

pad

=

(

w

.

shape

[

0

]

-

1

)

//

2

else

:

pad

=

0

x_padded

=

np

.

pad

(

x

,

(

(

0

,

0

)

,

(

pad

,

pad

)

,

(

pad

,

pad

)

,

(

0

,

0

)

)

,

'constant'

,

constant_values

=

0

)

x_padded_bcast

=

np

.

expand_dims

(

x_padded

,

axis

=

-

1

)

dZ_bcast

=

np

.

expand_dims

(

dZ

,

axis

=

-

2

)

dW

=

np

.

zeros_like

(

w

)

f

=

w

.

shape

[

0

]

w_x

=

x_padded

.

shape

[

1

]

for

a

in

range

(

f

)

:

for

b

in

range

(

f

)

:

dW

[

a

,

b

,

:

,

:

]

=

(

1

/

m

)

*

np

.

sum

(

dZ_bcast

*

x_padded_bcast

[

:

,

a

:

w_x

-

(

f

-

1

-

a

)

,

b

:

w_x

-

(

f

-

1

-

b

)

,

:

,

:

]

,

axis

=

(

0

,

1

,

2

)

)

dx

=

np

.

zeros_like

(

x_padded

,

dtype

=

float

)

Z_pad

=

f

-

1

dZ_padded

=

np

.

pad

(

dZ

,

(

(

0

,

0

)

,

(

Z_pad

,

Z_pad

)

,

(

Z_pad

,

Z_pad

)

,

(

0

,

0

)

)

,

'constant'

,

constant_values

=

0

)

for

m_i

in

range

(

x

.

shape

[

0

]

)

:

for

k

in

range

(

w

.

shape

[

3

]

)

:

for

d

in

range

(

x

.

shape

[

3

]

)

:

dx

[

m_i

,

:

,

:

,

d

]

+=

ndimage

.

convolve

(

dZ_padded

[

m_i

,

:

,

:

,

k

]

,

w

[

:

,

:

,

d

,

k

]

)

[

f

//

2

:

-

(

f

//

2

)

,

f

//

2

:

-

(

f

//

2

)

]

dx

=

dx

[

:

,

pad

:

dx

.

shape

[

1

]

-

pad

,

pad

:

dx

.

shape

[

2

]

-

pad

,

:

]

return

dx

,

dW

,

db

ReLU Layer

Recall that ReLU(x)=max(x,0)ReLU(x) = max(x,0)ReLU(x)=max(x,0) . When x>0x>0x>0 this returns xxx so this is linear in this region of the function (gradient is 1), and when x<0x<0x<0 this is always 0 (so a constant value), so gradient is 0. NB the gradient at exactly 0 is technically undefined but in practice we just set it to zero.

def

relu

(

z

,

deriv

=

False

)

:

if

(

deriv

)

:

return

z

>

0

else

:

return

np

.

multiply

(

z

,

z

>

0

)

Pooling Layer

For the pooling layer, we do not need to explicitly compute partial derivatives, since there are no parameters. Instead we are concerned with how to distribute the gradient to each of the values in the corresponding input 2×2 patch.

there are two options:

  • Max Pooling
  • Average Pooling

Max Pool Backward

Max Pooling:

Intuitively a nudge in the non-max values of each 2×2 patch will not affect the output, since the output is only concerned about the max value in the patch. Therefore the non-max values have a gradient of 0.

For the max value in each patch, since the output is just that value, a nudge in the max value will have a corresponding magnitude nudge in the output – so the gradient is 1.

So effectively, we are just routing the gradient through only the max value of that corresponding input patch.

Average Pooling:

Here for a given 2×2 patch XXX the output ZZZ is given by the mean of the values i.e:

Z=X∗11+X∗12+X∗21+X∗224Z = \frac{X*{11}+X*{12} + X*{21} + X*{22}}{4}Z=4X∗11+X∗12+X∗21+X∗22​

Thus a nudge in any of these values in the patch will have a output nudge that is only a quarter of the magnitude – i.e. the gradient here is 0.25 for all values across the image.

Intuitively think of this as sharing the gradient equally between the values in the patch.

It is worth bringing up the code from the forward pass, to detail how the mask, which is used to allocate gradients, is created.

Max Pooling:

We use np.repeat() to copy one ouput value across the corresponding 2×2 patch (i.e. we double the width/height of the output) – so it has the same dimensionality as the input.

To get the position of the max element in the patch, rather than doing an argmax, a simple trick is to elementwise compare the input with the scaled up output, since the output is only equal to the max value of the patch.

One subtlety with floating point multiplication is that we can get floating point rounding errors, so we use np.isclose() rather than np.equal() to have a tolerance for this error.

We then convert the mask to int type explicitly, to be used in the backward pass – so it zeros the gradient for the non-max values.

Average Pooling:

Here the gradient is just 0.25 for all values, so we create a mask matrix of the same dimensionality of all 0.25s.

For the backward pass, we scale the gradient matrix up by copying the value of the gradient for each patch to all values in that 2×2 patch, then we allocate gradients by applying the pre-computed mask in the forward pass.

def

pool_forward

(

x

,

mode

=

"max"

)

:

x_patches

=

x

.

reshape

(

x

.

shape

[

0

]

,

x

.

shape

[

1

]

//

2

,

2

,

x

.

shape

[

2

]

//

2

,

2

,

x

.

shape

[

3

]

)

if

mode

==

"max"

:

out

=

x_patches

.

max

(

axis

=

2

)

.

max

(

axis

=

3

)

mask

=

np

.

isclose

(

x

,

np

.

repeat

(

np

.

repeat

(

out

,

2

,

axis

=

1

)

,

2

,

axis

=

2

)

)

.

astype

(

int

)

elif

mode

==

"average"

:

out

=

x_patches

.

mean

(

axis

=

3

)

.

mean

(

axis

=

4

)

mask

=

np

.

ones_like

(

x

)

*

0.25

return

out

,

mask

def

pool_backward

(

dx

,

mask

)

:

return

mask

*

(

np

.

repeat

(

np

.

repeat

(

dx

,

2

,

axis

=

1

)

,

2

,

axis

=

2

)

)

I make content about my software engineering journey, curated in my newsletter!

Tips from my time at Cambridge and Facebook, and early access to technical tutorials on machine learning, compilers and beyond.

Check out previous issues!

Email Address

By subscribing, you agree with Revue’s Terms of Service and Privacy Policy

Fully-Connected Layer

The fully-connected layer is identical to that used in the feedforward neural network, so we will skip the derivation (see original backprop post and just list the equations below.

∂J∂W[l]_jk=1m∂J∂Z[l].A[l−1]T\frac{\partial{J}}{\partial{W^{[l]}\_{jk}}}= \frac{1}{m} \frac{\partial{J}}{\partial{Z^{[l]}}}.A^{[l-1]T}∂W[l]_jk∂J​=m1​∂Z[l]∂J​.A[l−1]T

∂J∂b[l]=1m∑i=1m∂J∂Z[l](i)\frac{\partial{J}}{\partial{b^{[l]}}} = \frac{1}{m} \sum_{i=1}^{m}\frac{\partial{J}}{\partial{Z^{[l](i)}}}∂b[l]∂J​=m1​∑i=1m​∂Z[l](i)∂J​

∂J∂A[l−1]=W[l]T.∂J∂z[l]\frac{\partial{J}}{\partial{A^{[l-1]}}} = W^{[l]T}.\frac{\partial{J}}{\partial{z^{[l]}}}∂A[l−1]∂J​=W[l]T.∂z[l]∂J​

The code is the same as the feedforward network layer, so again we’ll just list it below:

def

fc_backward

(

dA

,

a

,

x

,

w

,

b

)

:

m

=

dA

.

shape

[

1

]

dZ

=

dA

*

relu

(

a

,

deriv

=

True

)

dW

=

(

1

/

m

)

*

dZ

.

dot

(

x

.

T

)

db

=

(

1

/

m

)

*

np

.

sum

(

dZ

,

axis

=

1

,

keepdims

=

True

)

dx

=

np

.

dot

(

w

.

T

,

dZ

)

return

dx

,

dW

,

db

Softmax Layer:

The equation we are concerned with computing is:

∂J∂Z[l]=∂J∂A[l]∗g′(Z[L])\frac{\partial{J}}{\partial{Z^{[l]}}} = \frac{\partial{J}}{\partial{A^{[l]}}}*g'(Z^{[L]})∂Z[l]∂J​=∂A[l]∂J​∗g′(Z[L])

The rest of the equations for the output layer are identical to the fully-connected layer, since the softmax layer only differs in activation function.

Recall that the cost function is the cross-entropy cost function:

J(W,b)=−1m∑∗i=1m∑∗ky(i)_klog⁡(y^(i)_k)J(W,b) = \frac{-1}{m} \sum*{i=1}^{m} \sum*{k} y^{(i)}\_k \log(\hat{y}^{(i)}\_k)J(W,b)=m−1​∑∗i=1m∑∗ky(i)_klog(y^​(i)_k)

Again, we can consider a single output layer node – so consider the node corresponding to the ithi^{th}ith training example and the jthj^{th}jth class, aj(i)[L]=y^j(i){a}^{(i)[L]}_j = \hat{y}^{(i)}_jaj(i)[L]​=y^​j(i)​.

∂J∂y^(i)_j=−y(i)_jy^(i)_j\frac{\partial{J}}{\partial{\hat{y}^{(i)}\_j}} =\frac{-y^{(i)}\_j}{\hat{y}^{(i)}\_j}∂y^​(i)_j∂J​=y^​(i)_j−y(i)_j​

Next we consider the effect of nudging the weighted input of tehe output code corresponding to the ithi^{th}ith training example and the jthj^{th}jth class, zj(i)[L]{z}^{(i)[L]}_jzj(i)[L]​.

Recall the equation for the softmax layer:

softmax(z(i)[L]_j)=ez(i)[L]_j∑kez(i)[L]_ksoftmax(z^{(i)[L]}\_j) = \frac{e^{z^{(i)[L]}\_j}}{\sum_k e^{z^{(i)[L]}\_k}}softmax(z(i)[L]_j)=∑k​ez(i)[L]_kez(i)[L]_j​

Clearly a nudge in zj(i)[L]{z}^{(i)[L]}_jzj(i)[L]​ will affect the value of the corresponding output node aj(i)[L]{a}^{(i)[L]}_jaj(i)[L]​ since we exponentiate zj(i)[L]{z}^{(i)[L]}_jzj(i)[L]​ in the numerator.

However, there is another subtletly. Since the nudge affects the value ezj(i)[L]e^{z^{(i)[L]}_j}ezj(i)[L]​, it will affect the sum of the exponentiated values, so it will affect the outputs of all of the ouput nodes, since the denominator of all nodes is the aforementioned sum of the exponentiated values.

Let’s consider the two cases separately.

First, let’s compute the derivative of the output of the corresponding output aj(i)[L]{a}^{(i)[L]}_jaj(i)[L]​ with respect to zj(i)[L]{z}^{(i)[L]}_jzj(i)[L]​- we use quotient rule since both the numerator and denominator are dependent on zj(i)[L]{z}^{(i)[L]}_jzj(i)[L]​.

∂aj(i)[L]∂zj(i)[L]=ezj(i)[L]∑kezk(i)[L]−(ezj(i)[L]∑kezk(i)[L])2=y^j(i)(1−y^j(i))\frac{\partial{a^{(i)[L]}_j}}{\partial{z^{(i)[L]}_j}} = \frac{e^{z^{(i)[L]}_j}}{\sum_k e^{z^{(i)[L]}_k}} – (\frac{e^{z^{(i)[L]}_j}}{\sum_k e^{z^{(i)[L]}_k}})^2 = \hat{y}^{(i)}_j(1 – \hat{y}^{(i)}_j)∂zj(i)[L]​∂aj(i)[L]​​=∑k​ezk(i)[L]​ezj(i)[L]​​−(∑k​ezk(i)[L]​ezj(i)[L]​​)2=y^​j(i)​(1−y^​j(i)​)

Next, consider the derivative of a different node am(i)[L]{a}^{(i)[L]}_mam(i)[L]​ with respect to zj(i)[L]{z}^{(i)[L]}_jzj(i)[L]​. Here the numerator doesn’t depend on zj(i)[L]{z}^{(i)[L]}_jzj(i)[L]​ since we have exponentiated a different node zm(i)[L]{z}^{(i)[L]}_mzm(i)[L]​.

So the derivative is just:

∂am(i)[L]∂zj(i)[L]=−ezj(i)[L]ezm(i)[L](∑kezk(i)[L])2=−y^j(i)∗y^m(i)\frac{\partial{a^{(i)[L]}_m}}{\partial{z^{(i)[L]}_j}} = – \frac{e^{z^{(i)[L]}_j}e^{z^{(i)[L]}_m}}{({\sum_k e^{z^{(i)[L]}_k}})^2} = – \hat{y}^{(i)}_j*\hat{y}^{(i)}_m∂zj(i)[L]​∂am(i)[L]​​=−(∑k​ezk(i)[L]​)2ezj(i)[L]​ezm(i)[L]​​=−y^​j(i)​∗y^​m(i)​

Since am(i)[L]=y^m(i)a^{(i)[L]}_m = \hat{y}^{(i)}_mam(i)[L]​=y^​m(i)​, and a nudge in zj(i)[L]z^{(i)[L]}_jzj(i)[L]​ affects all output nodes, we sum partial derivatives across nodes, so using chain rule we have:

∂J∂z(i)[L]_j=∑k∂J∂y^(i)_k\*∂y^(i)_k∂z(i)[L]_j\frac{\partial{J}}{\partial{z^{(i)[L]}\_j}} = \sum_k \frac{\partial{J}}{\partial{\hat{y}^{(i)}\_k}}\*\frac{\partial{\hat{y}^{(i)}\_k}}{\partial{z^{(i)[L]}\_j}}∂z(i)[L]_j∂J​=∑k​∂y^​(i)_k∂J​\*∂z(i)[L]_j∂y^​(i)_k​

Again, we split into the two cases:

∂J∂zj(i)[L]=∂J∂y^(i)_j\*∂y^(i)_j∂z(i)[L]_j+∑k≠j∂J∂y^(i)_k\*∂y^(i)_k∂z(i)[L]_j\frac{\partial{J}}{\partial{z^{(i)[L]}_j}} = \frac{\partial{J}}{\partial{\hat{y}^{(i)}\_j}}\*\frac{\partial{\hat{y}^{(i)}\_j}}{\partial{z^{(i)[L]}\_j}}+ \sum_{k \neq j} \frac{\partial{J}}{\partial{\hat{y}^{(i)}\_k}}\*\frac{\partial{\hat{y}^{(i)}\_k}}{\partial{z^{(i)[L]}\_j}}∂zj(i)[L]​∂J​=∂y^​(i)_j∂J​\*∂z(i)[L]_j∂y^​(i)_j​+∑k=j​∂y^​(i)_k∂J​\*∂z(i)[L]_j∂y^​(i)_k​

\frac{\partial{J}}{\partial{z^{(i)[L]}_j}} = \frac{-y^{(i)}\_j}{\hat{y}^{(i)}\_j} \* \hat{y}^{(i)}\_j(1 – \hat{y}^{(i)}\_j) + \sum_{k \neq j} \frac{-y^{(i)}\_k}{\hat{y}^{(i)}\_k} _- \hat{y}^{(i)}\_j_\hat{y}^{(i)}\_k

Tidying up and combining the jjj term with the ∑_k≠j\sum\_{k \neq j}∑_k=j term to get ∑k\sum_k∑k​:

∂J∂z(i)[L]_j=−y(i)_j+y^(i)_j\*∑ky(i)_k\frac{\partial{J}}{\partial{z^{(i)[L]}\_j}} = -y^{(i)}\_j + \hat{y}^{(i)}\_j\*\sum_k y^{(i)}\_k∂z(i)[L]_j∂J​=−y(i)_j+y^​(i)_j\*∑k​y(i)_k

Since the probabilities of yyy across the output nodes sum to 1, this reduces our equation to:

∂J∂z(i)[L]_j=y^(i)_j−y(i)_j\frac{\partial{J}}{\partial{z^{(i)[L]}\_j}} = \hat{y}^{(i)}\_j-y^{(i)}\_j∂z(i)[L]_j∂J​=y^​(i)_j−y(i)_j

So wrapping up, having considered the equation for one neuron, we can generalise across the matrix Z[L]Z^{[L]}Z[L] to get:

∂J∂Z[l]=Y^−Y\frac{\partial{J}}{\partial{Z^{[l]}}} = \hat{Y} – Y∂Z[l]∂J​=Y^−Y

Code:

def

softmax_backward

(

y_pred

,

y

,

w

,

b

,

x

)

:

m

=

y

.

shape

[

1

]

dZ

=

y_pred

-

y

dW

=

(

1

/

m

)

*

dZ

.

dot

(

x

.

T

)

db

=

(

1

/

m

)

*

np

.

sum

(

dZ

,

axis

=

1

,

keepdims

=

True

)

dx

=

np

.

dot

(

w

.

T

,

dZ

)

return

dx

,

dW

,

db

This wraps up our discussion of convolutional neural networks. CNNs have revolutionised computer vision tasks, and are more interpretable than standard feedforward neural networks as we can visualise their activations as images (see start of post). We look at the activations in more detail in the notebook.

Next we will look at another specialised class of neural networks – recurrent neural networks, which are optimised for input sequences (e.g sentences for NLP).