Backpropagation in Fully Convolutional Networks (FCNs)
Mục Lục
Backpropagation in Fully Convolutional Networks (FCNs)
Backpropagation is one of the most important phases during the training of neural networks. As a target, it determines the neural network’s knowledge to be understood as the ability to respond properly to future urges. The terms “knowledge” and “properly” must consider two aspects: (i) the first is given by experience, or better, by the information (features) that the network learns to extract from the samples of the dataset (ii) the second by the error calculus or also called loss. In supervised learning, each sample is labelled with a class (classification), with one or more values (regression) or both. Let y_actual be the label assigned to the sample by the supervisor and y_pred the network prediction, the backpropagation phase will ensure that the distance (error) between the two is acceptable. Both errors and features have weights associated with them, which when modified during backpropagation, constitute the network’s learned knowledge.
Before leaving you, I would like to add a few more notes. Backpropagation has two phases. The first one is the calculation of the gradient, or better, of the derivative of the loss function to each network’s weigths. The second is the updating of them (delta rule). This last one can be performed with various optimization algorithms, to name a few Momentum, Adam etc … In this paper, I will concentrate only on the first phase leaving the deepening of the second one to a successive moment. Also, I have not found much evidence on backpropagation applied to convolutional networks and every calculation and interpretation is the result of personal studies. In case of errors or misinterpretations, I would be happy to discuss. Thank you and good reading.
Notation
1. Kernel and convolution
One of the questions we might ask is why use a Convolutional Neural Network instead of MLPs to analyze images? Let’s take some example. An image is a sequence of values (pixel intensities) arranged in a matrix. To use an MLPs we should transform the matrix into a vector and present it to the input layer of the network. In this regard, we consider an image X of size (224,224,3) where 224 indicates the height and width while 3 is the depth in colour scale Red, Green, Blue. By transforming it we get a vector of size (150,528, 1) as shown in Fig.4.
Fig.4: Representation of a Multi-Layer Perceptrons (MLPs) with input an image X. (Photo by source link, edited with permission by the author)
There are mainly two errors in this approach. The first is that the images are spatially ordered data. To be more clear, the order in which the pixels are arranged to form the matrix provides a great help in features extraction. So it is information that we cannot lose. The second is that using a MLPs the number of parameters is very high. Taking our example, there would be 150,528 neurons only for the input layer. Considering 100.000 neurons in the second layer, our network would have 150,528×100,000 = 15,052,800,000 parameters only between the first and second layer, let’s imagine considering the whole network. (Note: more the number of parameters increases, more the training will require time, more it will be difficult to find the optimal, more the model has a greater weight in terms of memory).
Fig. 5: Convolution between the image X of dimensions (4,4,1) and the Kernel W of dimensions (2,2) with stride S = [1,1]. The output feature map is denoted by Z and has dimensions (2,2) (Source: Image by me)
The introduction of the Kernel solves these problems. A Kernel W is a square matrix usually of dimension 2×2, 3×3 or 5×5 that runs by a certain stride S along the width M and the height N of the image X, performing a weighted sum between the values w (weights) of the Kernel and the overlap region. This operation, called convolution, considers the spatial order of the pixels and allow a sharing of weight w with the pixels of the image, reducing the number of parameters. An example of convolution is shown in Fig.5. The output Z is a feature map or rather a new image in which specific characteristics of X are emphasized. Assuming the Kernel W highlights the edges, the convolution operation produces a feature map like this:
Fig.6: Convolution operation to emphasize the edges of the X image. Factor 1/16 is introduced to reduce noise in the feature map. (Photo by source link, edited with permission by the author)
As can be seen, in both Fig.5 and Fig.6, Z does not have the same dimensions as the starting image. This effect is called stride downsampling. Assuming the dimensions are for X: (Mx, Nx) for W: (Mw, Nw) and stride S, the dimensions of the feature map Z: (Mz, Nz) are calculated:
Formula (1)
2. Feed-forward
Fig.7: Representation of a Convolutional Neural Network with two convolutional layers (Source: Image by me)
Let’s consider a network with two convolutional layers, both with a Kernel W of size 2×2 and stride 1 along both the height and the width, and two neurons in the output layer. As an activation function, we do not choose one in particular, generically denote it by g. The input of the network is a grayscale image (single channel) of size 4x4x1 Fig.7.
Let’s examine the forward phase. Before starting a final note. To build the computational graph, for each pixel of the feature map, it is important to remember which pixels were used during the convolution operation. What we need is the position that we will save in set P.
- Convolution Input X and Layer 1
(Source: Image by me)
- Convolution output Layer 1 and Layer 2
(Source: Image by me)
- Fully-Connected
(Source: Image by me)
3. Computational graph
We group the information obtained in the forward phase in a computational graph (Fig. 8). This is a representation of our Convolutional Neural Network with a graph in which each node corresponds to a single forward step. In particular, we can see that each of the convolution nodes z has as inputs the nodes with index in P.
Fig.8: Computational graph’s representation of the network in Fig. 7 (Source: Image by me)
4. Backpropagation
Formula (2)
Let’s consider L as Loss function, the computational graph allows you to easily define the gradient vector ∇wL. This will be used by the delta rule to update the network weights. Formula (2) represents the gradient vector of the network in Fig.7. It consists of three components, also vectors, about the weights of Layer 3 and respectively to the Kernels of Layer 2 and 1. In the following, I analyze each of them using the chain rule to partial derivatives calculus.
- ∂L /∂W^(3) calculus
Formula (3)
Formula (3) is the gradient vector of the weights W Layer 3. A single dependence [Appendix 1.1] must be solved to calculate its elements. As an example, the calculation of the Loss derivative to the weight w1 Layer 3 is shown in Formula (4):
Formula (4)
Denoting by F (fully) the number of neurons in the last layer for each element n of the vector Formula (3), we obtain:
Formula (5)
- ∂L /∂W^(2) calculus
Formula (6)
Formula (6) is the gradient vector of the matrix W Layer 2. To calculate its elements it is necessary to solve first a single dependence due, as analyzed, to the gradient of Layer 3 and then a multiple dependence [Appendix 1.2] for each of the neurons f. Looking at Fig.8, we can see that (i) each weight w in Layer 2 is present in the different sums zi^(2) due to the shared weights characteristic of the convolution (ii) each neuron f is connected to each of the activation output in Layer 2 for fully connected (since it is a fully connected layer).
Fig.9: Calculation of downstream gradients for each Layer 3 zf node (Source: Image by me)
As an example, we consider the gradient computation respect to the weight w1 in Layer 2. For ease of discussion, for each activation node at Layer 2, we compute the upstream gradient obtained by summing the downstream gradients [Appendix 2] of the respective zf Layer 3 nodes. The calculation of the downstreams is shown in Fig.9, each pair is then summed to form the formalized upstream for each activation in Formula (10).
Formula (10)
At this point, the Loss gradient with respect to w1 in Layer 2 will be given by Formula (11) in which each of the ∂L /∂ai upstream gradient is multiplied by the derivatives of the respective ai and zi nodes. After that, each contribution is summed given step (i) described above.
Formula (11)
Starting from what we have seen and generalizing, for each node ai in Layer 2 the upstream gradient will be given by Formula (12). The latter is the Formula (5) in which instead of deriving zf Layer 3 for the weight w we derive for the activation ai, going then to add the contributions.
Formula (12)
Instead, for each element n of the vector Formula (6), we get:
Formula (13)
For completeness, combining Formula (12) and (13) we get:
- ∂L /∂W^(1) calculus
Formula (14)
Formula (14) is the gradient vector of the matrix W Layer 1. Compared to the previous case, for the computation of partial derivatives it is necessary to solve different steps of multiple dependencies depending on the fully connected and convolution layers. Looking at Fig.8 it is possible to see that (i) each weight of W^(1) is present in the different zj weighted sums of Layer 1 (shared weights).
As an example, we consider the calculation of the gradient to the weight w1 Layer 1. Again, for each activation node aj in Layer 1 we calculate the upstream gradient formalized in Formula (15). To compute it we use the information we already have. In particular, the upstream gradients in Formula (10) that once we obtain the ∂L /∂zi^(2) will allow us to calculate the individual downstream Fig.10.
Fig.10: Calculation of downstream gradients for each node zi Layer 2 (Source: Image by me)Formula (15)
At this point, the Loss gradient to w1 Layer 1 will be given by Formula (16) in which each of the ∂L /∂aj upstream gradients, formalized in Formula (15), is multiplied by the derivatives of the respective activation nodes aj and zj. After that, each contribution is summed for point (i) described above.
Formula (16)
Starting from what we have seen and generalizing, for each node aj in Layer 1 the upstream gradient will be given by Formula (13) in which instead of deriving zi Layer 2 to weight w, we derive it to aj in Layer 1:
Formula (17)
Instead, for each element n of the vector Formula (14), we get:
Formula (18)
For completeness, combining Formula (12), (17) and (18) we get:
5.Conclusion
In this paper, we have explored the backpropagation phase focusing our analysis on the computational graph of the dummy network in Fig.7. In general, to compute the gradient vector of an l-th convolutional Layer it will be necessary:
- to calculate the upstream gradient for each activation output from layer l
- sum the contributions of the derivative of z with respect to w in layer l (due to shared weights)
Modern networks are much more complex. Some differences could be in the number of channels of each features map block (imagine more features maps one behind the other as a cube, each of them is a channel) with a consequent increase in the depth of the Kernel W, or we could have more than one Kernel for each layer, or we could have Batch Normalization layers and then apply backpropagation here, or have a bias term for each layer and consider a backpropagation for this term. So the one presented in this article is a simple, but hopefully explanatory example of backpropagation.