Mục Lục

Neural Networks

Convolutional Neural Networks Explained — How To Successfully Classify Images in Python

A visual explanation of Deep Convolutional Nets with a complete Python example that teaches you how to build your own DCNs

Image recognition with Deep Convolutional Neural Networks (DCN). Image by author.

Intro

A particular category of Neural Networks called Convolutional Neural Networks (CNN) is designed for image recognition. While it may sound super fancy, I assure you that anyone can grasp the main ideas behind it.

In this article, I will go through the essential components of CNNs and provide you with illustrated examples of how each part works. I will also talk you through the Python code that you can use to build Deep Convolutional Neural Networks with the help of Keras/Tensorflow libraries.

Convolutional Neural Networks within the universe of Machine Learning algorithms
What is the structure of Convolutional Neural Networks, and how do they work?
A complete Python example showing you how to build and train your own Deep CNN models

Deep Convolutional Neural Networks (DCN) within the Machine Learning universe

The below chart is my attempt to categorise the most common Machine Learning algorithms.

While we often use Neural Networks in a supervised manner with labelled training data, I felt that their unique approach to Machine Learning deserved a separate category.

Hence, my graph shows Neural Networks (NNs) branching out from the core of the Machine Learning universe. Convolutional Neural Networks occupy a sub-branch of NNs and contain algorithms such as DCN, DN and DCIGN.

The below graph is interactive, so please click on different categories to enlarge and reveal more👇.

Machine Learning algorithm classification. Interactive chart created by the author.

If you enjoy Data Science and Machine Learning, please subscribe to get an email with my new articles.

What is the structure of Convolutional Neural Networks, and how do they work?

Let’s start by comparing the structure of a typical Feed-Forward Neural Network and a Convolutional Neural Network.

In a traditional Feed-Forward Neural Network, we have Input, Hidden and Output layers, where each of them may contain multiple nodes. We commonly refer to networks with more than one Hidden layer as “Deep” networks.

Illustration of a Deep Feed-Forward Neural Network. Deep Feed-Forward Neural Network. Image by author.

Meanwhile, Convolutional Neural Networks (CNN) tend to be multi-dimensional and contain some special layers, unsurprisingly called Convolutional layers. Moreover, Convolutional layers are often accompanied by Pooling layers (Max or Average), which help reduce the size of convolved features.

Illustration of a Convolutional Neural Network. Convolutional Neural Network. Image by author.

Convolutional layer

It is worth highlighting that we can have Convolutional layers of different dimensions:

One-dimensional (Conv1D) — suitable for text embeddings, time-series or other sequences.
Two-dimensional (Conv2D) — typical choice for images.
Three-dimensional (Conv3D) — can be used for videos, which are essentially just sequences of images, or for 3D images such as MRI scans.

Since I focus on image recognition in this article, let’s take a closer look at how 2D convolution works. 1D and 3D convolutions work in the same way, except they have one fewer or one extra dimension.

Input image being passed through a convolutional layer. Image by author.

Note that for a greyscale picture, we would only have one channel. Meanwhile, we would have three separate channels for a colour picture, each containing values for its respective colour (Red, Green, Blue).

We can also specify how many filters we want to have in the Convolutional layer. Having multiple filters lets us extract a broader range of features from the image.

How does convolution work?

There are three parts to a convolution: Input (e.g., 2D image), a filter (a.k.a. kernel) and an output (a.k.a. convolved feature).

Convolution. The first calculation in the iterative process of applying a filter over the input data. Image by author.

The convolution process is iterative. First, a filter is applied over a section of an input image, and the output value is recorded. The filter is then shifted by one position when stride=1 or by multiple positions when the stride is set to a higher number, and the same process is repeated until the convolved feature is complete.

The below gif image illustrates the process of applying a 3×3 filter on a 5×5 input.

Convolution in action. Gif image by author.

Let me elaborate on the above to give you a better understanding of the filter’s purpose. First, you will note that my custom filter has all 1’s down the middle column. This type of filter is designed to identify vertical lines in the input image as it gives a strong signal (high values) whenever vertical lines are present.

For comparison, here is what the Convolved Feature (output) would look like if we applied a filter designed to find horizontal lines:

Filter designed to find horizontal lines. Image by author.

As you can see, the entire output is populated with the same value, meaning that there is no firm indication of a horizontal line being present in the input image.

It is important to note that we do not need to specify values for different filters manually. The creation of filters is handled automatically during the training of the Convolutional Neural Network. Although, we can tell the algorithm how many filters we want to have.

Additional options

There are a couple more options for us to tweak when setting up a Convolutional layer:

Padding — in some scenarios, we may wish for the output to be the same size as the input. We can achieve that by adding some padding. At the same time, it may make it easier for the model to capture essential features residing at the edges of an image.

Convolution with padding around the input. Image by author.

Stride — if we have large images, then we may want to use larger strides, i.e., shifting a filter by multiple pixels at a time. While it does help to reduce the size of the output, larger strides may result in some features being missed, like in the example below:

Convolution in action using stride=(2,2). Gif image by author.

Multiple convolutional layers

It is often beneficial to set up multiple Convolutional layers to improve the network. The benefits arise from subsequent Convolutional layers identifying extra complexity within the image.

The first layer in a Deep Convolutional Network (DCN) tends to find low-level features (e.g., vertical, horizontal, diagonal lines…). Meanwhile, the deeper layers can identify higher-level characteristics, such as more complex shapes, representing real-world elements like eyes, nose, ears etc.

Pooling layer

It is common to add a Pooling layer following a Convolutional layer. Its purpose is to reduce the size of Convolved Features improving computational efficiency. Also, it can help to de-noise the data by keeping the strongest activations.

Pooling is performed to reduce the size of convolved features. Image by author.

There are two commonly used types of pooling:

Max pooling — takes the highest value from the area covered by the kernel (suitable for de-noising).
Average pooling — calculates the average value from the area covered by the kernel.

Illustration of Max Pooling and Average Pooling. Gif image by author.

Flatten and Dense Layers

Once we have finished deriving Convolved Features, we need to flatten them. This enables us to have a one-dimensional input vector and utilise a traditional Feed-Froward Network architecture. In the end, we train the network to find the optimum weights and biases, which enables us to classify images correctly.

Feed-Forward section of the Convolutional Neural Network. Image by author.

Depending on the size and complexity of your data, you may want to have multiple pairs of Convolutional and Pooling layers followed by multiple Dense Layers, making your network “Deep.”

A complete Python example showing you how to build and train your own Deep CNN models

Setup

We will need to get the following data and libraries:

Caltech 101 image data set (source)

Data license: Attribution 4.0 International (CC BY 4.0)

Reference: Fei-Fei, R. Fergus and P. Perona. Learning generative visual models
from few training examples: an incremental Bayesian approach tested on
101 object categories. IEEE. CVPR 2004, Workshop on Generative-Model
Based Vision. 2004

Pandas and Numpy for data manipulation
Open-CV and Matplotlib for ingesting and displaying images
Tensorflow/Keras for building Neural Networks
Scikit-learn library for splitting the data (train_test_split), label encoding (OrdinalEncoder), and model evaluation (classification_report)

Let’s import libraries:

The above code prints package versions I used in this example:

Tensorflow/Keras: 2.7.0
pandas: 1.3.4
numpy: 1.21.4
sklearn: 1.0.1
OpenCV: 4.5.5
matplotlib: 3.5.1

Next, we download and ingest Caltech 101 image data set. Note that we will only use four categories (“dalmatian”, “hedgehog”, “llama”, “panda”) in this example as opposed to all 101.

At the same time, we prep the data by resizing and standardising it, encoding labels and splitting it into train and test samples.

The above code prints the shape of our data, which is [samples, rows, columns, channels] for input data and [samples, labels] for target data:

Shape of whole data:  (237, 128, 128, 3)
Shape of X_train:  (189, 128, 128, 3)
Shape of y_train:  (189, 1)
Shape of X_test:  (48, 128, 128, 3)
Shape of y_test:  (48, 1)

To better understand what data we are working with, let’s display a few input images.

Displaying 10 images from the training data. Image by author.

Training and evaluating Deep Convolutional Neural Network (DCN)

You can follow comments in the code to understand what each section does. In addition to that, here is some high-level description.

I have structured the model to have multiple Convolutional, Pooling and Dropout layers to create a “deep” architecture. As mentioned earlier in the article, the initial Convolutional layers help extract low-level features, while later ones identify more high-level features.

So the structure of my DCN model is:

Input layer
The first set of Convolutional, Max Pooling and Dropout layers
The second set of Convolutional, Max Pooling and Dropout layers
The third set of Convolutional, Max Pooling and Dropout layers
Flatten layer
Dense Hidden layer
Output layer

Note that the Dropout layer randomly sets input units to 0 based on the rate we provided (in this case, 0.2). It means that a random 20% of inputs (features/nodes) will be set to zero and will not contribute meaningful weights to the model. The purpose of the Dropout layer is to help prevent overfitting.

Finally, note that I have listed all possible parameters in the first set of Convolutional and Max Pooling layers as I wanted to give you an easy reference to what you can change. However, we keep most of them at default values, so we do not need to explicitly list them every time (see the second and third set of Convolutional and Max Pooling layers).

With the model structure specified, let’s compile it, train it and print the results.

The above code prints the summary of a model structure:

Deep Convolutional Neural Network (DCN) model summary. Image by author.

It also prints the performance summary in the form of a classification report:

Deep Convolutional Neural Network (DCN) model results. Image by author.

We can see that the model has identified almost all training images correctly (f1-score of 0.99). However, the performance on the test data was not as good, with an f1-score of 0.81.

There may be some overfitting happening, so it is worth experimenting with various parameters and network structures to find the best setup. At the same time, the number of images we have is relatively small, making training and evaluation of the model much harder.

Additional evaluation

Finally, I wanted to see what category the model would put my dog in. While my dog is not a dalmatian, he is black and white. I wondered if the model would recognise him to be a dog and not a panda 😂

Image of my dog in Jupyer Notebook. Image by author.

Prep the image and use the previously trained DCN model to predict the label.

And here are the results:

Shape of the input:  (1, 128, 128, 3)

DCN model prediction:  [['dalmatian']]

Probabilities for each category:
dalmatian  :  0.92895913
hedgehog  :  0.004558794
llama  :  0.010929748
panda  :  0.055552367

So, the model has identified my dog to be a dalmatian, although with a 5.5% probability of being a panda 😆

Final remarks

I sincerely hope you enjoyed reading this article and obtained some new knowledge.

You can find a complete Jupyter Notebook code in my GitHub repository. Feel free to use it to build your own Deep Convolutional Neural Networks, and do not hesitate to get in touch if you have any questions or suggestions.

Also, you can find my other Neural Network articles here: Feed-Forward, Deep Feed-Forward, RNN, LSTM, GRU.

Cheers!
Saul Dobilas

If you have already spent your learning budget for this month, please remember me next time. My personalised link to join Medium: