Mục Lục

Deep Learning Fundamentals

Convolutional Layers vs Fully Connected Layers

What is really going on when you use a convolutional layer vs a fully connected layer?

Image by Author

The design of a Neural Network is quite a difficult thing to get your head around at first. Designing a neural network involves choosing many design features like the input and output sizes of each layer, where and when to apply batch normalization layers, dropout layers, what activation functions to use, etc. In this article, I want to discuss what is really going on behind fully connected layers and convolutions, and how the output size of convolutional layers can be calculated.

Introduction

Deep learning is a field of research that has skyrocketed in the past few years with the increase in computational power and advances in the architecture of models. Two kinds of networks you’ll often hear when reading about deep learning are fully connected neural nets (FCNN), and convolutional neural nets (CNNs). These two are the basis of deep learning architectures, and almost all other deep learning neural networks stem from these. In this article I’ll first explain how fully connected layers work, then convolutional layers, finally I’ll go through an example of a CNN).

Fully Connected Layers (FC Layers)

Neural networks are a set of dependent non-linear functions. Each individual function consists of a neuron (or a perceptron). In fully connected layers, the neuron applies a linear transformation to the input vector through a weights matrix. A non-linear transformation is then applied to the product through a non-linear activation function f.

Image by Author

Here we are taking the dot product between the weights matrix W and the input vector x. The bias term (W0) can be added inside the non-linear function. I will ignore it in the rest of the article as it doesn’t affect the output sizes or decision-making and is just another weight.

If we take as an example a layer in a FC Neural Network with an input size of 9 and an output size of 4, the operation can be visualised as follows:

Image by Author

The activation function f wraps the dot product between the input of the layer and the weights matrix of that layer. Note that the columns in the weights matrix would all have different numbers and would be optimized as the model is trained.

The input is a 1×9 vector, the weights matrix is a 9×4 matrix. By taking the dot product and applying the non-linear transformation with the activation function we get the output vector (1×4).

One can also visualize this layer the following way:

Image by Author

The image above shows why we call these kinds of layers “Fully Connected” or sometimes “densely connected”. All possible connections layer to layer are present, meaning every input of the input vector influences every output of the output vector. However, not all weights affect all outputs. Look at the lines between each node above. The orange lines represent the first neuron (or perceptron) of the layer. The weights of this neuron only affect output A, and do not have an effect on outputs B, C or D.

Convolutional Layers (Conv Layers)

Image by Author

A convolution is effectively a sliding dot product, where the kernel shifts along the input matrix, and we take the dot product between the two as if they were vectors. Below is the vector form of the convolution shown above. You can see why taking the dot product between the fields in orange outputs a scalar (1×4 • 4×1 = 1×1).

Image by Author

Once again, we can visualize this convolutional layer as follows:

Image by Author

Convolutions are not densely connected, not all input nodes affect all output nodes. This gives convolutional layers more flexibility in learning. Moreover, the number of weights per layer is a lot smaller, which helps a lot with high-dimensional inputs such as image data. These advantages are what give CNNs their well-known characteristic of learning features in the data, such as shapes and textures in image data.

Working with CNNs

In FC layers, the output size of the layer can be specified very simply by choosing the number of columns in the weights matrix. The same cannot be said for Conv layers. Convolutions have a lot of parameters that can be changed to adapt the output size of the operation.

I strongly recommend you check out this link to Francesco’s explanation of convolutions. In it, he explains all variations of convolutions, such as convolutions with and without padding, strides, transposed convolutions, and more. It is by far the best, most visual interpretation I’ve ever seen, and I still refer back to it often.

Conv Output Size

To determine the output size of the convolution, the following equation can be applied:

Image by Author

The output size is equal to the input size plus two times the padding minus the kernel size over the stride plus one. Most of the time we are dealing with square matrices so this number will be the same for rows and columns. If the fraction does not result in an integer we round up. I recommend trying to make sense of the equation. Dividing by the stride makes sense as when we skip over operations we are dividing the output size by that number. Two times the padding comes from the fact that the padding is added on both sides of the matrix, and therefore is added twice.

Transposed Conv Size

From the equation above, the output will always be equal to or smaller than the output unless we add a lot of padding. However adding too much padding to increase the dimensionality would result in great dificulty in learning as the inputs to each layer would be very sparse. To combat this, transposed convolutions are used to increase the size of the input. Example applications are for example in convolutional VAEs or GANs.

Image by Author

The above equation can be used to calculate the output size of a transposed convolutional layer.

With these two equations you are now ready to design a convolutional neural network. Let’s take a look at the design of a GAN and understand it using the equations above.

GAN Example

Here I’ll go through the architecture of a Generative Adversarial Network that uses convolutional and transposed convolutional layers. You’ll see why the equations above are so important and why you cannot design a CNN without them.

Let’s first take a look at the discriminator:

Image by Author

The input size to the discriminator is a 3x64x64 image, the output size is a binary 1×1 scalar. We are heavily reducing the dimensionality, therefore standard convolutional layers are ideal for this application.

Note that between each convolutional layer (denoted as Conv2d in PyTorch) the activation function is specified (in this case LeakyReLU), and batch normalization is applied.

Conv Layer in Discriminator

nn.Conv2d(nc, ndf, k = 4, s = 2, p = 1, bias=False)

The first convolutional layer applies “ndf” convolutions to each of the 3 layers of the input. Image data often has 3 layers, each for red green and blue (RGB images). We can apply a number of convolutions to each of the layers to increase the dimensionality.

The first convolution applied has a kernel size of 4, stride of 2, and a padding of 1. Plugging this into the equation gives:

Image by Author

So the output is a 32×32 image, as is mentioned in the code. You can see we have halved the size of the input. The next 3 layers are identical, meaning the output sizes of each layer are 16×16, then 8×8, then 4×4. The final layer uses a kernel size of 4, stride of 1, and padding of 0. Plugging into the formula we get an output size of 1×1.

Transposed Conv layer in Generator

nn.ConvTranspose2d( nz, ngf * 8, 4, 1, 0, bias=False)

Image by Author

Let’s look at the first layer in the generator. The generator has an input of a 1x1x100 vector (1xnz), and the wanted output is a 3x64x64. We are increasing the dimensionality, so we want to use transposed convolution.

The first convolution uses a kernel size of 4, a stride of 1 and a padding of 0. Let’s plug it in the transposed convolution equation:

Image by Author

The output size of the transposed convolution is 4×4, as indicated in the code. The next 4 convolutional layers are identical with a kernel size of 4, a stride of 2 and a padding of 1. This doubles the size of each input. So 4×4 turns to 8×8, then 16×16, 32×32 and finally 64×64.

Conclusion

In this article, I explained how fully connected layers and convolutional layers are computed. I also explain how to calculate the output sizes of convolutional and transposed convolutional layers. Without understanding these, one cannot design their own CNN.

Support me 👏

Hopefully this helped you, if you enjoyed it you can follow me!

You can also become a medium member using my referral link, get access to all my articles and more: https://diegounzuetaruedas.medium.com/membership