📚 Check out our editorial recommendations on the best machine learning books . 📚

This tutorial’s code is available on Github and its full implementation as well on Google Colab .

Join Towards AI, by becoming a member, you will not only be supporting Towards AI, but you will have access to…

Introduction

Yann LeCun and Yoshua Bengio introduced convolutional neural networks in 1995 [1], also known as convolutional networks or CNNs. A CNN is a particular kind of multi-layer neural network [2] to process data with an apparent, grid-like topology. The base of its network bases on a mathematical operation called convolution. Fundamentally, machine learning algorithms use matrix multiplication, but in contrast, CNNs use convolutions in place of matrix multiplications at least in one layer — a convolution is a specialized kind of linear operation.

Convolutional neural networks (CNNs) are undoubtedly the most popular deep learning architecture. Their applications are everywhere, including image and video recognition, image analysis, recommendation systems, natural language processing, computing interfaces, financial time-series, and several others [3].

Biological findings inspire the development of the neural network with the following standard capabilities:

Input → Weights → Logic function → Output

Essential facts about CNNs:

CNNs are neurobiologically-driven by the findings of locally sensitive and orientation-selective nerve cells in the visual cortex.
They are a multi-layer neural network.
They implicitly extract relevant features.
They are a feed-forward network that can extract topological features from images.
They recognize visual patterns directly from pixel images with minimal preprocessing.
They are astonishingly powerful because they can easily recognize patterns that have extreme variability. e.g., hand-writing.
CNNs are trained with a version of the backpropagation algorithm.
CNNs have the neuronal cells in the visual cortex, making the base behind CNNs and watches for particular features.

Why are CNNs Required?

CNNs have several advantages for image recognization and other applications like:

Detection using CNN is robust to distortions like change in shape due to camera lens, different lighting conditions, different poses, the presence of partial occlusions, horizontal and vertical shifts, and others.
It requires less memory for processing and execution.
It is straightforward and suitable for training. By using CNNs, we can dramatically reduce the number of parameters. Therefore, the training time is also proportionately reduced.

Types of Convolutional Neural Networks (CNNs)

These are some of the different types of CNNs [4]:

1D CNN → In this case, the Kernal moves in one direction. The input and output data of a 1D CNN is two-dimensional. 1D CNNs are mostly used on time-series.
2D CNN → Under a 2D CNN, the kernel moves in two directions. The input and output data of 2D CNN is three-dimensional. We usually use this on image data problems.
3D CNN → Here, the kernel moves in three directions. The input and output data of a 3D CNN is four-dimensional. Engineers use 3D CNNs on 3D images like DICOM images of MRIs, CT Scans, and other complex applications.

Network Architecture

A CNN architecture is developed by a stack of different layers that convert the input volume into an output volume through a differentiable function. A few different types of layers are commonly used.

Below is the stack of different layers in CNNs:

Convolutional layers
Pooling layer
Fully connected layer

In summary, the example of complete layers of CNNs:

Figure 1: An example of a full convolutional neural network (CNN) architecture.

The complete architecture of CNNs:

Figure 2: Complete overview of a CNN architecture.Figure 3: How does a convolutional neural network behave | Source: Breaking it down: A Q&A on Machine Learning [5]

Image processing is a process to perform operations on an image to get an enhanced image or extract some critical information from it. There are three different ways to perform image processing:

Histogram processing.
Transformation function.
Convolution.

Convolution

A convolution is a mathematical calculation on two functions named f and g that gives a third function (f * g). This third function reveals how the shape of one is modified by the other. Its mathematical equation is as follows:

Figure 4: Equation of a convolution.

It is essential to understand the concept of a mask or filter before the concept of convolution.

Figure 5: Convolution equation.

Mask or Filter

A mask is a small matrix whose values are called weight. A two-dimensional matrix represents it. It is also known as filtering. Its interesting point is that it should be in odd numbers. Otherwise, it is difficult to find the mid of the mask.

Figure 6: Mask of an array. | Convolutional Neural Networks (CNNs) Figure 6: Mask of an array.

Below code example of a mask from an array:

import numpy as np
import numpy.ma as ma
original_array = np.array([1, 2, 3, -1, 5])
original_array

Figure 7: Original array.

Create a mask of the original array:

masked = ma.masked_array(original_array, mask=[0, 0, 0, 1, 0])
masked

Figure 8: Mask of the original array.

Why are Convolutions Important in CNNs?

The convolution cycle in CNNs is crucial because it can manipulate images in the following cases:

Blurring
Sharpening
Edge detection
Noise reduction

How is a Convolution Performed?

These are the steps to perform a convolution:

Flip the mask horizontally and vertically only once.
Slide the mask onto the image.
Multiply the analogous elements, following by adding them.
Repeat all the above steps until all values of an image have been calculated [8].

Figure 9: Convolved feature or activation map or feature map.

Following the steps above:

Below mask:

Figure 10: Mask array. | Convolutional Neural Networks (CNNs) Figure 10: Mask array.

Flip → Horizontally

Figure 11: Flip mask horizontally.

Flip → Vertically

Figure 12: Flip mask vertically.

Let’s take the dimension of an image like below:

Figure 13: The dimension of an image.

Now, to calculate the convolution follow the steps below:

Place the core of the mask at each component of an image.
Multiply the analogous elements and add them
Finally, paste the result onto the image’s element on which the mask’s center is placed.

Figure 14: Mask on the image.

From figure 14:

The green box is the mask and green values in the box is the value of the mask
The blue box and its value is related to the image

Now, calculate the first pixel of the image ↓

px1 = (5 * 2) + (4 *4) + (1* 0)

px1 = 10+ 16+16+10

px1 = 52

The result of the 1st pixel of the image is 52. Therefore, based on the result, we follow the following steps:

Place the value 52 in the original image at the first index.
Repeat this step for each pixel of the image.

Convolutional Layers

A CNN is a neural network with some convolutional layers and some other layers. A convolutional layer has several filters that do the convolutional operation. Convolutional layers are applied to bidimensional inputs and are very famous due to their fantastic image classification job performance. They are based on the discrete convolution of a small kernel k with a bidimensional input, and this input can be the output of another convolutional layer. The convolutional layer is the core building block of a CNN [9].

Figure 15: Convolutional layer with filter.Figure 16: Convolutional layer.

Convolution shares the same parameters across all spatial locations; however, traditional matrix multiplication does not share any parameters.

Figure 17: Same parameter sharing by convolution.

Building a convolution layer in Keras:

from keras.models import Sequential
from keras.layers.convolutional import Conv2D
model = Sequential()
model.add(Conv2D(32, (3, 3), input_shape=(32, 32, 3), padding='same', activation='relu'))

Explanation from the code implementation above:

The output will have 32 feature maps.
The kernel size is going to be 3×3.
The input shape is 32×32 with three channels.
padding = same. It means the same dimensional output is required as input.
Activation specifies the activation function.

Next, build a convolutional layer with different parameter values as below:

model.add(Conv2D(32, (3, 3), activation='relu', padding='valid')

So, from the above code of convolutional layer:

Kernel =3X3
padding=valid: This means that the output dimension can take any form [10].

Pooling Layers/Sub Sampling Layer

Fundamentally, the pooling layer is used to reduce the dimensionality of the image. It is also used for detecting edges, eyes, nose, corners, and others in the image using multiple filters. Its function is to reduce the number of parameters and also reduce the spatial size in the network. There are two ways in which we can achieve pooling:

Max Pooling: It states the maximum output within a rectangular neighborhood.
Average Pooling: It states the average output of a rectangular neighborhood.

The most used pooling is max-pooling and average pooling. Spatial size is reduced for images because it gives fewer pixels and fewer features or parameters for further computations.

Hence, pooling layers serve two significant purposes:

Continuous reduction of the feature map’s spatial size as the network moves from one convolution layer to the next, thus reducing the number of parameters.
Progressively identifying essential features while discarding the card (this is true more in the max-pooling than average pooling).

Figure 18: Input and output matrix in pooling layer.

The above picture shows a MaxPool with a 2X2 filter with stride 2.

Below depiction of max pooling and average pooling:

Figure 19: Max pooling and average pooling.

Implement Max Pool layer in Keras as below:

model.add(MaxPooling2D(pool_size=(2, 2)))

Here, Kernel size = 2 x 2

Subsampling pixels will not change the object, so pooling can subsample the pixels to make the image smaller.

Figure 20: Subsampling by pooling.

Stride

It is a component in the neural network, which mainly modifies the movement of videos and images. Stride is a parameter that works in conjunction with padding. For example, If a stride is set to 1, we move one pixel or unit at a time. Similarly, if the stride is set to 2, we move 2 units pixels or units.

Essentially, the stride is the number of pixels a convolutional filter transits, like a sliding window, after moving on the weighted average value of all the pixels it just covered. The old weighted average value becomes one pixel in the feature map in the next layer. The next weighted average proceeds from a new collection of pixels, and it forms the next pixel in the feature map in the subsequent layer.

Below, please find an animated presentation of a stride:

The stride of 1:

Figure 21: The stride of 1. | Convolutional Neural Networks (CNNs) Figure 21: The stride of 1.

The Stride of 2:

Figure 22: The stride of 2. | Convolutional Neural Networks (CNNs) Figure 22: The stride of 2.

The animation of stride in figure 22 simply explains that:

Stride in a convolutional neural network dilutes how many steps can be skipped while scanning features horizontally and vertically on the image.

In CNNs, striding goes from one network layer to another layer. Therefore there are two choices to either decrease the data size or keep it to the same size. So, both the padding and stride impacts the data size. Padding is essential in stride because, without padding, the next layer will reduce the data size.

When a stride is used, it starts with the filer in the top left corner and calculates the value of the first node, and when it moves the node by two units, it goes on when the filter extends outside the image, creating a space. Thus, padding is used to fill the void created by striding.

Let’s take an input layer of 5X5 with kernel 3X3 as below:

Figure 23: 5x5 input layer. | Convolutional Neural Networks (CNNs) Figure 23: 5×5 input layer.

Apply Stride of 1:

Figure 24: Stride of 1. | Convolutional Neural Networks (CNNs) Figure 24: Stride of 1.

Apply Stride of 2:

Figure 25: Stride of 2. | Convolutional Neural Networks (CNNs) Figure 25: Stride of 2.

Suppose we apply a stride of 3 while still looking at the 5×5 input — what would happen?

Figure 26: Applying a stride of 3.

Consequently, padding is required here. For the entire input, the padding data is added with a width equal to the kernel width minus one or height equal to kernel height minus one if it is above and beneath so that the kernel can look at the extreme edges as shown in figure 27:

Figure 27: Stride with padding.

Hence, from the above pictorial representation:

Having no padding means that the data size will get decreased for the next layer. At the same time, the introduction of sufficient padding will retain the size intact. Furthermore, it limits the overlap of two subsequent dot products in the convolution operation with more strides. It means that every output value in the activation will be more independent of the neighboring values.

Fully Connected Layer

This layer is the summation of all the input and weights which determine the final prediction — representing the output of the last pooling layer. Fully connected, as the name states, makes every node in the first layer connected to the nodes in the second layer. Performing classification based on the features extracted by the previous layers [11]. It connects every neuron in one layer to every neuron in another layer.

Figure 28: A fully connected layer.

CNNs can be broken down into two categories:

Feature extraction
Classification

The fully connected layer’s main responsibility is to do classification. It is used with a softmax or sigmoid activation unit for the result.

Non-Linear Layers

The activation function applied to the last layer is very different from the others. The activation used for multiclass is the softmax function that normalizes the fully connected layer with probabilities of 0 and 1, which sum up to 1.

Typically Softmax is used only for the output layer, for neural networks that need to classify inputs into multiple categories. Neural networks in common and CNNs, in particular, rely on a non-linear “trigger” function to signal definite identification of possible features on each hidden layer.

To efficiently implement this non-linear layer, CNNs use the below functions:

ReLUs (Rectified Linear Units)
Continuous Trigger function

Keras code as below with non-linear function “Relu”:

model.add(Dense(512, activation='relu'))

Here, 512 hidden units.

Keras code as below with non-linear function “Softmax”:

model.add(Dense(10, activation='softmax'))

Python Implementation of Convolutional Neural Networks (CNNs)

Keras CNNs layers code implementation for the CNNs:

Import all required libraries

import numpy as np
import pandas as pd
from keras.optimizers import SGD
from keras.datasets import cifar10
from keras.models import Sequential
from keras.utils import np_utils as utils
from keras.layers import Dropout, Dense, Flatten
from keras.layers.convolutional import Conv2D, MaxPooling2D

Load Cifar01 data:

(X, y), (X_test, y_test) = cifar10.load_data()

Display test dataset

X_test

Figure 29: Cifar01. | Convolutional Neural Networks (CNNs) Figure 29: Cifar01.

Normalize the data:

X, X_test = X.astype('float32')/255.0, X_test.astype('float32')/255.0

Convert to categorical:

y, y_test = utils.to_categorical(y, 10), u.to_categorical(y_test, 10)

Initialize the model:

model = Sequential()

Add Convolutional Layer with below parameters:

Features map = 32
Kernel size = 3×3
Input shape = 32×32
Channels = 3
Padding = 3 → It means the same dimension output as input.

model.add(Conv2D(32, (3, 3), input_shape=(32, 32, 3), padding='same', activation='relu'))

Add the dropout rate:

model.add(Dropout(0.2))

Add another CNN layer with padding = valid.

padding = valid → It means output dimension can take any form.

model.add(Conv2D(32, (3, 3), activation='relu', padding='valid'))

Add a Max Pooling layer.

model.add(MaxPooling2D(pool_size=(2, 2)))

Flatten the data:

In CNNs, it is important to flatten the data before the input it into the output or dense layer.

model.add(Flatten())

Add dense layer:

model.add(Dense(512, activation='relu'))

Here, the number of hidden units is 521.

Add dropout:

model.add(Dropout(0.3))

Add the output dense layer:

model.add(Dense(10, activation='softmax'))

Compile the model:

model.compile(loss='categorical_crossentropy',              optimizer=SGD(momentum=0.5, decay=0.0004), metrics=['accuracy'])

Fit the algorithm with 25 epochs:

model.fit(X, y, validation_data=(X_test, y_test), epochs=25,          batch_size=512)

Figure 30: Training of CNNs. | Convolutional Neural Networks (CNNs) Figure 30: Training of CNNs.

Check accuracy:

print("Accuracy: &2.f%%" %(model.evaluate(X_test, y_test)[1]*100))

Hyperparameters for CNNs

Hyperparameter is very important to control the learning process. It is applied before the training that manages the network structures like the number of hidden units. The following should be kept in intelligence when optimizing:

Max Pooling Shape

In max pooling, the maximum value is selected within a matrix. The size of the matrix could be 2×2 or 3×3. Typical values are 2×2. Huge input volumes may warrant 4×4 pooling in the lower layers. So, choosing larger shapes will dramatically reduce the signal’s dimension and may result in excess information loss.

Code example:

model.add(MaxPooling1D(pool_size=2))

Filter Shape

It is crucial to find the right level of granularity in a given dataset without overfitting.

Code example:

model.add(Conv1D(filters=32, kernel_size=3, padding='same', activation='relu'))

Number of Filters

The number of filters should be selected carefully because the number of feature maps directly controls the capacity and depends on the number of available examples and task complexities [9].

Code example:

model.add(Conv1D(filters=32, kernel_size=3, padding='same', activation='relu'))

Regularization Methods in CNNs

Regularization is a method of including extra information to solve an irregular problem or to stop overfitting. CNN also uses regularization to handle all those problems. Below are different types of regularization techniques used by CNNs:

Empirical
Explicit

Different categories of empirical regularization:

Dropout
DropConnect
Stochastic pooling

Code implementation of dropout in the layer:

model.add(Dropout(0.2))

Different categories of explicit regularization:

Early stopping
Weight decay
Number of parameters
Max norm constraints

Early Stopping

Overfitting is a common problem in machine learning and deep learning. There are several ways to avoid such kinds of problems, and early stopping is one of them. It stops the process early.

Code snippet implementation:

from keras.callbacks import EarlyStopping
earlystop = EarlyStopping(monitor = 'val_loss', min_delta = 0, 
patience = 3, verbose = 1, restore_best_weights = True)

Explanation from the above code:

monitor: Monitors the value. i.e., val_loss
min_delta: It is the monitored value. For example, if min_delta = 1, then it means that the training process will be stopped if the absolute change of the monitored value is less than 1 [12].
patience: If there is no improvement after a certain number of epochs, training will be stopped.
restore_best_weights: If its value is set to true, then it keeps the best weighs once stopped.

Conclusion

Convolutional neural networks are a special kind of multi-layer neural network, mainly designed to extract the features. They recognize visual patterns directly from pixel images with very minimal processing.

CNNs use two operations called convolution and pooling to reduce an image into its essential features and uses those features to understand and classify the image appropriately [6].

Another benefit of CNNs is that they are easier to train and have fewer parameters than fully connected networks with the same number of hidden units [13].

Convolutional neural networks (CNNs) are used in various fields such as healthcare to diagnose diseases like pneumonia, diabetes, and breast cancer, self-driving cars, surveillance monitoring, and others [7].