This repository contains Jupyter notebook content associated with my series on fully connected neural networks.

All related code can now be found in my GitHub repository:

You can access the previous articles below. The first provides a simple introduction to the topic of neural networks, to those who are unfamiliar. The second article covers more intermediary topics such as activation functions, neural architecture, and loss functions.

This is the fourth article in my series on fully connected (vanilla) neural networks. In this article, we will be optimizing a neural network and performing hyperparameter tuning in order to obtain a high-performing model on the Beale function — one of many test functions commonly used for studying the effectiveness of various optimization techniques. This analysis can be reused for any function, but I recommend trying this out yourself on another common test function to test your skills. Personally, I find that optimizing a neural network can be incredibly frustrating (although not as bad as a GAN, if you’re familiar with those..) unless you have a clear and well-defined procedure to follow. I hope you enjoy this article and find it insightful.

For those of who reading that are not familiar with the Jupyter notebook, feel free to read more about it here .

By learning how to approach a difficult optimization function, the reader should be more prepared to deal with real-life scenarios for implementing neural networks.

The remainder of this article will follow the Jupyter notebook tutorial on my GitHub repository. We will discuss the way in which one would tackle this kind of artificial landscape. This landscape is analogous to the loss surface of a neural network. When training a neural network, the goal is to find the global minimum on the loss surface by performing some form of optimization — typically stochastic gradient descent.

This function does not look particularly terrifying, right? The reason this is a test function is that it assesses how well the optimization algorithms perform when in flat regions with very shallow gradients. In these cases, it is particularly difficult for gradient-based optimization procedures to reach any minimum, as they are unable to learn effectively.

From just scrolling down the Wikipedia article on optimization test functions, you can see that some of the functions are pretty nasty. Many of them have been chosen as they highlight specific issues that can plague optimization algorithms. For this article, we will be looking at a relatively innocuous-looking function called the Beale function.

When applied mathematicians develop a new optimization algorithm, one thing they like to do is test it on a test function, which is sometimes called an artificial landscape. These artificial landscapes help us find a way of comparing the performance of various algorithms in terms of their:

Neural networks are fairly commonplace now in industry and research, but an embarrassingly large proportion of them are unable to work with them well enough to be able to produce high-performing networks that are capable of outperforming most other algorithms.

Before we touch any neural networks, we first have to define the function and find its minimum (otherwise, how will we know we got the right answer?). The first step (after importing any relevant packages) is to define the Beale function in our notebook:

We then set some function boundaries since we have ballpark estimates for where the minimum is in this case (from our plot), as well as a step size for our grid mesh.

We then make a mesh grid of points based on this information and are ready to find the minimum.

Now we make a (terrible) initial guess.

We then use the scipy.optimize function and see what answer pops out.

This is the result:

It looks like the answer is (3, 0.5), and if you plug these values into the equation, you find that this is the minimum (it also says this on the Wikipedia page).

In the next section, we will start on our neural network.

Mục Lục

Optimization in Neural Networks

A neural network can be defined as a framework that combines inputs and tries to guess the output. If we are lucky enough to have some results, called “the ground truth”, to compare the outputs produced by the network, we can calculate the error. So the network makes a guess, calculates some error function, guesses again while trying to minimize this error, and guesses again until the error does not go down anymore. This is optimization.

In neural networks, the most commonly used optimization algorithms are flavors of GD (gradient descent). The objective function used in gradient descent is the loss function we want to minimize.

This tutorial will focus on Keras now, so I will give a brief Keras refresher.

A K`eras` Refresher

Keras is a Python library for deep learning that can run on top of both Theano and TensorFlow, two powerful Python libraries for fast numerical computing created and released by Facebook and Google, respectively.

Keras was developed to make developing deep learning models as fast and easy as possible for research and practical applications. It runs on Python 2.7 or 3.5 and can seamlessly execute on GPUs and CPUs.

Keras is built on the idea of a model. At its core, we have a sequence of layers called the Sequential model, which is a linear stack of layers. Keras also provides the functional API, a way to define complex models, such as multi-output models, directed acyclic graphs, or models with shared layers.

We can summarize the construction of deep learning models in Keras using the Sequential model as follows:

Define your model: create a Sequential model and add layers.
Compile your model: specify loss function and optimizers and call the .compile() function.
Fit your model: train the model on data by calling the .fit() function.
Make predictions: use the model to generate predictions on new data by calling functions such as .evaluate() or .predict().

You may be asking yourself — how can you examine the model’s performance as it is running? This is a good question, and the answer is by using callbacks.

Callbacks: taking a peek into our model while it’s training

You can look at what is happening in various stages of your model by using callbacks. A callback is a set of functions to be applied at given stages of the training procedure. You can use callbacks to get a view of internal states and statistics of the model during training. You can pass a list of callbacks (as the keyword argument callbacks) to the .fit() method of the Sequential or Model classes. The relevant methods of the callbacks will then be called at each stage of the training.

A callback function you are already familiar with is keras.callbacks.History(). This is automatically included in .fit().
Another very useful one is keras.callbacks.ModelCheckpoint which saves the model with its weights at a certain point in the training. This can prove useful if your model is running for a long time and a system failure happens. Not all is lost, then. It’s a good practice to save the model weights only when an improvement is observed as measured by the acc, for example.
keras.callbacks.EarlyStopping stops the training when a monitored quantity has stopped improving.
keras.callbacks.LearningRateScheduler will change the learning rate during training.

We will apply some callbacks later. For full documentation on callbacks see https://keras.io/callbacks/.

The first thing we must do is import a lot of different functions to make our lives easier.

Another step you can do if you want your network to work using random numbers but for the result to be repeatable, is to use a random seed. This produces the same sequence of numbers each time, although they are still pseudorandom (these are a great way for comparing models and also testing for reproducibility).

Step 1 — Deciding on the network topology (not really considered optimization but is very important)

We will use the MNIST dataset, which consists of grayscale images of handwritten digits (0–9) whose dimension is 28×28 pixels. Each pixel is 8 bits, so its value ranges from 0 to 255.

Obtaining the dataset is very easy since there is a function for it built-in to Keras.

Our output for our X and Y data is (60000, 28, 28) and (60000,1) respectively. It is always a good suggestion to print some of the data to check the values (and the data type if necessary).

We can check the training data by looking at one image of each of the digits to make sure that none of them are missing from our data.

The last check is for the dimensions of the training and test sets, which can be done relatively easily:

We find that we have 60,000 training images and 10,000 test images. The next thing to do is preprocess the data.

Preprocessing the data

To run our NN, we need to preprocess the data (these steps can be performed interchangeably):

First, we must make the 2D image arrays into 1D (flatten them). We can either perform this by using array reshaping with numpy.reshape() or the keras‘ method for this: a layer called keras.layers.Flatten which transforms the format of the images from a 2d-array (of 28 by 28 pixels), to a 1D-array of 28 * 28 = 784 pixels.
Then we need to normalize the pixel values (give them values between 0 and 1) using the following transformation:

In our case, the minimum is zero, and the maximum is 255, so the formula becomes simply 𝑥:=𝑥/255.

We now want to one-hot encode our data.

Now we are finally ready to build our model!

Step 2 — Adjusting the `learning rate`

One of the most common optimization algorithms is Stochastic Gradient Descent (SGD). The hyperparameters that can be optimized in SGD are learning rate, momentum, decay and nesterov.

Learning rate controls the weight at the end of each batch and momentum controls how much to let the previous update influence the current weight update. Decay indicates the learning rate decay over each update and nesterov takes the value “True” or “False” depending on if we want to apply Nesterov momentum.

Typical values for those hyperparameters are lr=0.01, decay=1e-6, momentum=0.9, and nesterov=True.

The learning rate hyperparameter goes into the optimizer function which we will see below. Keras has a default learning rate scheduler in the SGDoptimizer that decreases the learning rate during the stochastic gradient descent optimization algorithm. The learning rate is decreased according to this formula:

lr=lr×1/(1+decay∗epoch)

Let’s implement a learning rate adaptation schedule in Keras. We’ll start with SGD and a learning rate value of 0.1. We will then train the model for 60 epochs and set the decay argument to 0.0016 (0.1/60). We also include a momentum value of 0.8 since that seems to work well when using an adaptive learning rate.

Next, we build the architecture of the neural network:

We can now run the model and see how well it performs. This took around 20 minutes on my machine and may be faster or slower, depending on your machine.

After it has finished running, we can plot the accuracy and loss function as a function of epochs for the training and test sets to see how the network performed.

The loss function plot looks as follows:

Loss as a function of epochs.

And this is the accuracy:

We will now look at applying a customized learning rate.

Apply a custom learning rate change using `LearningRateScheduler`

Write a function that performs the exponential learning rate decay as indicated by the following formula:

𝑙𝑟=𝑙𝑟₀ × 𝑒^(−𝑘𝑡)

This is similar to before, so I will do this in one code block and describe the differences.

We see here that the only thing that has changed here is the presence of the exp_decay function that we defined and its use in the LearningRateScheduler function. Notice we also chose to add a few callbacks to our model this time.

We can now plot the learning rate and loss functions as functions of the number of epochs. The learning rate plot is incredibly smooth as it follows our predefined exponentially decaying function.

The loss function also looks smoother now as compared to before.

This shows you that developing a learning rate scheduler can be a helpful way to improve neural network performance.

Step 3 — Choosing an `optimizer` and a `loss function`

When constructing a model and using it to make our predictions, for example, to assign label scores to images (“cat,” “plane,” etc.), we want to measure our success or failure by defining a “loss” function (or objective function). The goal of optimization is to efficiently calculate the parameters/weights that minimize this loss function. keras provides various types of loss functions.

Sometimes the “loss” function measures the “distance.” We can define this “distance” between two data points in various ways suitable to the problem or dataset. The distance used depends on the data type and the problem being tackled. For example, in natural language processing (which analyses textual data), the Hamming distance is much more common.

Distance

Euclidean
Manhattan
others, such as Hamming, which measures distances between strings, for example. The Hamming distance of “carolin” and “cathrin” is 3.

Loss functions