Deep Learning with Python: Neural Networks (complete tutorial)

I will present some useful Python code that can be easily applied in other similar cases (just copy, paste, run) and walk through every line of code with comments so that you can replicate the examples.

Today, Deep Learning is so popular that many companies want to use it even though they don’t fully understand it. Often data scientists, first have to simplify these complex algorithms for the Business, and then explain and justify the results of the models, which is not always simple with Neural Networks. I think the best way to do it is through visualization.

Neural Networks are based on a collection of connected units (neurons), which, just like the synapses in a brain, can transmit a signal to other neurons, so that, acting like interconnected brain cells, they can learn and make decisions in a more human-like manner.

Deep Learning is a type of machine learning that imitates the way humans gain certain types of knowledge, and it got more popular over the years compared to standard models. While traditional algorithms are linear, Deep Learning models, generally Neural Networks, are stacked in a hierarchy of increasing complexity and abstraction (therefore the “deep” in Deep Learning).

In this article, I will show how to build Neural Networks with Python and how to explain Deep Learning to the Business using visualization and creating an explainer for model predictions.

Mục Lục

Setup

There are two main libraries for building Neural Networks: TensorFlow (developed by Google) and PyTorch (developed by Facebook). They can perform similar tasks, but the former is more production-ready while the latter is good for building rapid prototypes because it is easier to learn.

Those two libraries are favored by the community and businesses because they can leverage the power of the NVIDIA GPUs. That is very useful, and sometimes necessary, for processing big datasets like a corpus of text or a gallery of images.

For this tutorial, I’m going to use TensorFlow and Keras, a higher-level module way more user-friendly than pure TensorFlow and PyTorch, although a bit slower.

The first step is to install TensorFlow through the terminal:

pip install tensorflow

If you want to enable GPU support, you can read the official documentation or follow this guide. After setting it up, your Python instructions will be translated into CUDA by your machine and processed by the GPUs, so your models shall run incredibly faster.

Now we can import on our notebook the main modules from TensorFlow Keras and start coding:

from tensorflow.keras import models, layers, utils, backend as K
import matplotlib.pyplot as plt
import shap

Artificial Neural Networks

ANN are made of layers with an input and an output dimension. The latter is determined by the number of neurons (also called “nodes”), a computational unit that connects the weighted inputs through an activation function (which helps the neuron to switch on/off). The weights, like in most of the machine learning algorithms, are randomly initialized and optimized during the training to minimize a loss function.

The layers can be grouped as:

Input layer has the job to pass the input vector to the Neural Network. If we have a matrix of 3 features (shape N x 3), this layer takes 3 numbers as the input and passes the same 3 numbers to the next layer.
Hidden layers represent the intermediary nodes, they apply several transformations to the numbers in order to improve the accuracy of the final result, and the output is defined by the number of neurons.
Output layer that returns the final output of the Neural Network. If we are doing a simple binary classification or regression, the output layer shall have only 1 neuron (so that it returns only 1 number). In the case of a multiclass classification with 5 different classes, the output layer shall have 5 neurons.

The simplest form of ANN is the Perceptron, a model with one layer only, very similar to the linear regression model. Asking what happens inside a Perceptron is equivalent to asking what happens inside a single node of a multi-layer Neural Network… let’s break it down.

Let’s say we have a dataset of N rows, 3 features and 1 target variable (i.e. binary 1/0):

Image by author. I put some random numbers between 0 and 1 (data should always be scaled before being fed into a Neural Network).

Just like in every other machine learning use case, we are going to train a model to predict the target using the features row by row. Let’s start with the first row:

Image by author

What does “training a model” mean? Searching for the best parameters in a mathematical formula that minimize the error of your predictions. In the regression models (i.e. linear regression) you have to find the best weights, in the tree-based models (i.e. random forest) it’s about finding the best splitting points…

Image by author

Usually, the weights are randomly initialized then adjusted as the learning proceeds. Here I’ll just set them all as 1:

Image by author

So far we haven’t done anything different from a linear regression (which is pretty straightforward for the business to understand). Now, here’s the upgrade from a linear model Σ(xi*wi)=Y to a non-linear one f(Σ(xi*wi))=Y … enter the activation function.

Image by author

The activation function defines the output of that node. There are many and one can even create some custom functions, you can find the details in the official documentation and have a look at this cheat sheet. If we would set a simple linear function in our example, then we would have no difference from a linear regression model.

I shall use a binary step activation function that returns 1 or 0 only:

Image by author

We have the output of our Perceptron, a single-layer Neural Network that takes some inputs and returns 1 output. Now the training of the model would continue by comparing the output with the target, calculating the error and optimizing the weights, reiterating the whole process again and again.

Image by author

And here’s the common representation of a neuron:

Image by author

Deep Neural Networks

One could say that all the Deep Learning models are Neural Networks but not all the Neural Networks are Deep Learning models. Generally speaking, “Deep” Learning applies when the algorithm has at least 2 hidden layers (so 4 layers in total including input and output).

Imagine replicating the neuron process 3 times simultaneously: since each node (weighted sum & activation function) returns a value, we would have the first hidden layer with 3 outputs.

Image by author

Now let’s do it again using those 3 outputs as the inputs for the second hidden layer, which returns 3 new numbers. Finally, we shall add an output layer (1 node only) to get the final prediction of our model.

Image by author

Remember that the layers can have a different number of neurons and a different activation function, and in each node, weights are trained to optimize the final result. That’s why the more layers you add, the bigger the number of trainable parameters gets.

Now you can review the full picture of a Neural Network:

Image by author

Please note that, in order to keep it as simple as possible, I haven’t mentioned certain details that might not be of interest to the Business, but a data scientist should definitely be aware of. In particular:

Bias: inside each neuron, the linear combination of inputs and weights includes also a bias, similar to the constant in a linear equation, therefore the full formula of a neuron is

f( Σ(Xi * Wi ) + bias )

Backpropagation: during training, the model learns by propagating the error back into the nodes and updating the parameters (weights and biases) to minimize the loss.

Gradient Descent: the optimization algorithm used to train Neural Networks which finds the local minimum of the loss function by taking repeated steps in the direction of steepest descent.

Model Design

The easiest way to build a Neural Network with TensorFlow is with the Sequential class of Keras. Let’s use it to make the Perceptron from our previous example, so a model with only one Dense layer. It is the most basic layer as it feeds all its inputs to all the neurons, each neuron providing one output.

model = models.Sequential(name="Perceptron", layers=[
    layers.Dense(             #a fully connected layer
          name="dense",
          input_dim=3,        #with 3 features as the input
          units=1,            #and 1 node because we want 1 output
          activation='linear' #f(x)=x
    )
])
model.summary()

Image by author

The summary function provides a snapshot of the structure and the size (in terms of parameters to train). In this case, we have just 4 (3 weights and 1 bias), so it’s pretty lite.

If you want to use an activation function that is not already included in Keras, like the binary step function that I showed in the visual example, you gotta get your hands dirty with raw TensorFlow:

# define the function
import tensorflow as tf
def binary_step_activation(x):
    ##return 1 if x>0 else 0 
    return K.switch(x>0, tf.math.divide(x,x), tf.math.multiply(x,0))

# build the model
model = models.Sequential(name="Perceptron", layers=[
      layers.Dense(             
          name="dense",
          input_dim=3,        
          units=1,            
          activation=binary_step_activation
      )
])

Now let’s try to move from the Perceptron to a Deep Neural Network. Probably you are gonna ask yourself some questions:

How many layers? The right answer is “try different variants and see what works”. I usually work with 2 Dense hidden layers with Dropout, a technique that reduces overfitting by randomly setting inputs to 0. Hidden layers are useful to overcome the non-linearity of data, so if you don’t need non-linearity then you can avoid hidden layers. Too many hidden layers will lead to overfitting.

Image by author

How many neurons? The number of hidden neurons should be between the size of the input layer and the size of the output layer. My rule of thumb is (number of inputs + 1 output)/2.
What activation function? There are many and we can’t say that one is absolutely better. Anyway, the most used one is ReLU, a piecewise linear function that returns the output only if it’s positive, and it is mainly used for hidden layers. Besides, the output layer must have an activation compatible with the expected output. For example, the linear function is suited for regression problems while the Sigmoid is frequently used for classification.

I’m going to assume an input dataset of N features and 1 binary target variable (most likely a classification use case).

n_features = 10
model = models.Sequential(name="DeepNN", layers=[
    ### hidden layer 1
    layers.Dense(name="h1", input_dim=n_features,
                 units=int(round((n_features+1)/2)), 
                 activation='relu'),
    layers.Dropout(name="drop1", rate=0.2),
    
    ### hidden layer 2
    layers.Dense(name="h2", units=int(round((n_features+1)/4)), 
                 activation='relu'),
    layers.Dropout(name="drop2", rate=0.2),
    
    ### layer output
    layers.Dense(name="output", units=1, activation='sigmoid')
])
model.summary()

Image by author

Please note that the Sequential class isn’t the only way to build a Neural Network with Keras. The Model class gives more flexibility and control over the layers, and it can be used to build more complex models with multiple inputs/outputs. There are two major differences:

The Input layer needs to be specified while in the Sequential class it’s implied in the input dimension of the first Dense layer.
The layers are saved like objects and can be applied to the outputs of other layers like: output = layer(…)(input)

This is how you can use the Model class to build our Perceptron and DeepNN:

# Perceptron
inputs = layers.Input(name="input", shape=(3,))
outputs = layers.Dense(name="output", units=1, 
                       activation='linear')(inputs)
model = models.Model(inputs=inputs, outputs=outputs, 
                     name="Perceptron")

# DeepNN
### layer input
inputs = layers.Input(name="input", shape=(n_features,))
### hidden layer 1
h1 = layers.Dense(name="h1", units=int(round((n_features+1)/2)), activation='relu')(inputs)
h1 = layers.Dropout(name="drop1", rate=0.2)(h1)
### hidden layer 2
h2 = layers.Dense(name="h2", units=int(round((n_features+1)/4)), activation='relu')(h1)
h2 = layers.Dropout(name="drop2", rate=0.2)(h2)
### layer output
outputs = layers.Dense(name="output", units=1, activation='sigmoid')(h2)
model = models.Model(inputs=inputs, outputs=outputs, name="DeepNN")

One can always check if the number of parameters in the model summary is the same as the one from Sequential.

Visualization

Remember, we are telling a story to the business and visualization is our best ally. I prepared a function to plot the structure of an Artificial Neural Network from its TensorFlow model, here’s the full code:

Let’s try it out on our 2 models, first the Perceptron:

visualize_nn(model, description=True, figsize=(10,8))

Image by author

then the Deep Neural Network:

Image by author

TensorFlow provides a tool for plotting the model structure as well, you might want to use it for more complex Neural Networks with more complicated layers (CNN, RNN, …). Sometimes it’s a bit tricky to set up, if you have issues this post might help.

utils.plot_model(model, to_file='model.png', show_shapes=True, show_layer_names=True)

Image by author

That saves this image on your laptop, so if you just want to plot it out on your notebook, you can just run the following to delete the file:

import os
os.remove('model.png')

Train & Test

Finally, it’s time to train our Deep Learning model. In order for it to run, we must “compile”, or to put it in another way, we need to define the Optimizer, the Loss function, and the Metrics. I usually use the Adam optimizer, a replacement optimization algorithm for gradient descent (the best among the adaptive optimizers). The other arguments depend on the use case.

In (binary) classification problems, you should use a (binary) Cross-Entropy loss which compares each of the predicted probabilities to the actual class output. As for the metrics, I like to monitor both the Accuracy and the F1-score, a metric that combines Precision and Recall (the latter must be implemented as it is not already included in TensorFlow).

# define metrics
def Recall(y_true, y_pred):
    true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
    possible_positives = K.sum(K.round(K.clip(y_true, 0, 1)))
    recall = true_positives / (possible_positives + K.epsilon())
    return recall

def Precision(y_true, y_pred):
    true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
    predicted_positives = K.sum(K.round(K.clip(y_pred, 0, 1)))
    precision = true_positives / (predicted_positives + K.epsilon())
    return precision

def F1(y_true, y_pred):
    precision = Precision(y_true, y_pred)
    recall = Recall(y_true, y_pred)
    return 2*((precision*recall)/(precision+recall+K.epsilon()))

# compile the neural network
model.compile(optimizer='adam', loss='binary_crossentropy', 
              metrics=['accuracy',F1])

On the other hand, in regression problems, I usually set the MAE as the loss and the R-squared as the metric.

# define metrics
def R2(y, y_hat):
    ss_res =  K.sum(K.square(y - y_hat)) 
    ss_tot = K.sum(K.square(y - K.mean(y))) 
    return ( 1 - ss_res/(ss_tot + K.epsilon()) )

# compile the neural network
model.compile(optimizer='adam', loss='mean_absolute_error', 
              metrics=[R2])

Before starting the training, we also need to decide the Epochs and Batches: since the dataset might be too large to be processed all at once, it is split into batches (the higher the batch size, the more memory space you need). The backpropagation and the consequent parameters update happen every batch. An epoch is one pass over the full training set. So, if you have 100 observations and the batch size is 20, it will take 5 batches to complete 1 epoch. The batch size should be a multiple of 2 (common: 32, 64, 128, 256) because computers usually organize the memory in power of 2. I tend to start with 100 epochs with a batch size of 32.

During the training, we would expect to see the metrics improving and the loss decreasing epoch by epoch. Moreover, it’s good practice to keep a portion of the data (20%-30%) for validation. In other words, the model will set apart this fraction of data to evaluate the loss and metrics at the end of each epoch, outside the training.

Assuming you got your data ready into some X and y arrays (if not you can simply generate random data like

import numpy as np
X = np.random.rand(1000,10)
y = np.random.choice([1,0], size=1000)

), you can launch and visualize the training as follows:

# train/validation
training = model.fit(x=X, y=y, batch_size=32, epochs=100, shuffle=True, verbose=0, validation_split=0.3)

# plot
metrics = [k for k in training.history.keys() if ("loss" not in k) and ("val" not in k)]    
fig, ax = plt.subplots(nrows=1, ncols=2, sharey=True, figsize=(15,3))
       
## training    
ax[0].set(title="Training")    
ax11 = ax[0].twinx()    
ax[0].plot(training.history['loss'], color='black')    ax[0].set_xlabel('Epochs')    
ax[0].set_ylabel('Loss', color='black')    
for metric in metrics:        
    ax11.plot(training.history[metric], label=metric)    ax11.set_ylabel("Score", color='steelblue')    
ax11.legend()
        
## validation    
ax[1].set(title="Validation")    
ax22 = ax[1].twinx()    
ax[1].plot(training.history['val_loss'], color='black')    ax[1].set_xlabel('Epochs')    
ax[1].set_ylabel('Loss', color='black')    
for metric in metrics:          
    ax22.plot(training.history['val_'+metric], label=metric)    ax22.set_ylabel("Score", color="steelblue")    
plt.show()

Image by author. Classification example, Notebook hereImage by author. Regression example, Notebook here

Those plots are taken from two actual use cases which compare standard machine learning algorithms with Neural Networks (links under each image).

Explainability

We trained and tested our model, but we still haven’t convinced the Business about the results… what can we do? Easy, we build an explainer to show that our Deep Learning model is not a black box.

I find Shap working very well with Neural Networks: for every prediction, it’s able to estimate the contribution of each feature to the value predicted by the model. Basically, it answers the question “why the model says this is a 1 and not a 0?”.

You can use the following code:

Please note that you can use this function on other Machine Learning models as well (i.e. Linear Regression, Random Forest), not just Neural Networks. As you can read from the code, if the X_train argument is kept as None, my function assumes it’s not Deep Learning.

Let’s test it out on the classification and regression examples:

i = 1
explainer_shap(model, 
               X_names=list_feature_names, 
               X_instance=X[i], 
               X_train=X, 
               task="classification", #task="regression"
               top=10)

Image by author. Classification example, Notebook here. Titanic dataset, the prediction is “Survived” mainly because the dummy variable Sex_male = 0, so the passenger was a woman.Image by author. Regression example, Notebook here. House Price dataset, the major driver of this house price is a large basement.

Conclusion

This article has been a tutorial to demonstrate how to design and build Artificial Neural Networks, deep and not. I broke down step by step what happens inside a single neuron and more generally inside the layers. I kept the story as simple as if we are explaining Deep Learning to the Business, using tons of visualization.

In the second part of the tutorial, we used TensorFlow to create some Neural Networks, from the Perceptron to a more complex one. Then, we trained the Deep Learning model and assessed its explainability for both classification and regression use cases.

I hope you enjoyed it! Feel free to contact me for questions and feedback or just to share your interesting projects.