How to Grid Search Hyperparameters for Deep Learning Models in Python with Keras - MachineLearningMastery.com

Last Updated on August 4, 2022

Hyperparameter optimization is a big part of deep learning.

The reason is that neural networks are notoriously difficult to configure, and a lot of parameters need to be set. On top of that, individual models can be very slow to train.

In this post, you will discover how to use the grid search capability from the scikit-learn Python machine learning library to tune the hyperparameters of Keras’s deep learning models.

After reading this post, you will know:

How to wrap Keras models for use in scikit-learn and how to use grid search
How to grid search common neural network parameters, such as learning rate, dropout rate, epochs, and number of neurons
How to define your own hyperparameter tuning experiments on your own projects

Kick-start your project with my new book Deep Learning With Python, including step-by-step tutorials and the Python source code files for all examples.

Let’s get started.

Aug/2016: First published
Update Nov/2016: Fixed minor issue in displaying grid search results in code examples
Update Oct/2016: Updated examples for Keras 1.1.0, TensorFlow 0.10.0 and scikit-learn v0.18
Update Mar/2017: Updated example for Keras 2.0.2, TensorFlow 1.0.1 and Theano 0.9.0
Update Sept/2017: Updated example to use Keras 2 “epochs” instead of Keras 1 “nb_epochs”
Update March/2018: Added alternate link to download the dataset
Update Oct/2019: Updated for Keras 2.3.0 API
Update Jul/2022: Updated for TensorFlow/Keras and SciKeras 0.8

Mục Lục

Overview

In this post, you will discover how you can use the scikit-learn grid search capability. You will be given a suite of examples that you can copy and paste into your own project as a starting point.

Below is a list of the topics this post will cover:

How to use Keras models in scikit-learn
How to use grid search in scikit-learn
How to tune batch size and training epochs
How to tune optimization algorithms
How to tune learning rate and momentum
How to tune network weight initialization
How to tune activation functions
How to tune dropout regularization
How to tune the number of neurons in the hidden layer

How to Use Keras Models in scikit-learn

Keras models can be used in scikit-learn by wrapping them with the KerasClassifier or KerasRegressor class from the module SciKeras. You may need to run the command pip install scikeras first to install the module.

To use these wrappers, you must define a function that creates and returns your Keras sequential model, then pass this function to the model argument when constructing the KerasClassifier class.

For example:

def

create_model

(

)

return

model

KerasClassifier

(

model

create_model

)

The constructor for the KerasClassifier class can take default arguments that are passed on to the calls to model.fit(), such as the number of epochs and the batch size.

For example:

def

create_model

(

)

return

model

KerasClassifier

(

model

create_model

epochs

)

The constructor for the KerasClassifier class can also take new arguments that can be passed to your custom create_model() function. These new arguments must also be defined in the signature of your create_model() function with default parameters.

For example:

def

create_model

(

dropout_rate

0.0

)

return

model

KerasClassifier

(

model

create_model

dropout_rate

0.2

)

You can learn more about these from the SciKeras documentation.

How to Use Grid Search in scikit-learn

Grid search is a model hyperparameter optimization technique.

In scikit-learn, this technique is provided in the GridSearchCV class.

When constructing this class, you must provide a dictionary of hyperparameters to evaluate in the param_grid argument. This is a map of the model parameter name and an array of values to try.

By default, accuracy is the score that is optimized, but other scores can be specified in the score argument of the GridSearchCV constructor.

By default, the grid search will only use one thread. By setting the n_jobs argument in the GridSearchCV constructor to -1, the process will use all cores on your machine. However, sometimes this may interfere with the main neural network training process.

The GridSearchCV process will then construct and evaluate one model for each combination of parameters. Cross validation is used to evaluate each individual model, and the default of 3-fold cross validation is used, although you can override this by specifying the cv argument to the GridSearchCV constructor.

Below is an example of defining a simple grid search:

param_grid

dict

(

epochs

[

]

)

grid

GridSearchCV

(

estimator

model

param_grid

n_jobs

–

)

grid_result

grid

fit

(

)

Once completed, you can access the outcome of the grid search in the result object returned from grid.fit(). The best_score_ member provides access to the best score observed during the optimization procedure, and the best_params_ describes the combination of parameters that achieved the best results.

You can learn more about the GridSearchCV class in the scikit-learn API documentation.

Problem Description

Now that you know how to use Keras models with scikit-learn and how to use grid search in scikit-learn, let’s look at a bunch of examples.

All examples will be demonstrated on a small standard machine learning dataset called the Pima Indians onset of diabetes classification dataset. This is a small dataset with all numerical attributes that is easy to work with.

Download the dataset and place it in your currently working directly with the name pima-indians-diabetes.csv (update: download from here).

As you proceed through the examples in this post, you will aggregate the best parameters. This is not the best way to grid search because parameters can interact, but it is good for demonstration purposes.

Note on Parallelizing Grid Search

All examples are configured to use parallelism (n_jobs=-1).

If you get an error like the one below:

INFO (theano.gof.compilelock): Waiting for existing lock by process ‘55614’ (I am process ‘55613’)

INFO (theano.gof.compilelock): To manually release the lock, delete …

Kill the process and change the code to not perform the grid search in parallel; set n_jobs=1.

Need help with Deep Learning in Python?

Take my free 2-week email course and discover MLPs, CNNs and LSTMs (with code).

Click to sign-up now and also get a free PDF Ebook version of the course.

How to Tune Batch Size and Number of Epochs

In this first simple example, you will look at tuning the batch size and number of epochs used when fitting the network.

The batch size in iterative gradient descent is the number of patterns shown to the network before the weights are updated. It is also an optimization in the training of the network, defining how many patterns to read at a time and keep in memory.

The number of epochs is the number of times the entire training dataset is shown to the network during training. Some networks are sensitive to the batch size, such as LSTM recurrent neural networks and Convolutional Neural Networks.

Here you will evaluate a suite of different mini-batch sizes from 10 to 100 in steps of 20.

The full code listing is provided below:

# Use scikit-learn to grid search the batch size and epochs

import

numpy

import

tensorflow

from

sklearn

model_selection

import

GridSearchCV

from

tensorflow

keras

models

import

Sequential

from

tensorflow

keras

layers

import

Dense

from

scikeras

wrappers

import

KerasClassifier

# Function to create model, required for KerasClassifier

def

create_model

(

)

# create model

model

Sequential

(

)

model

add

(

Dense

(

input_shape

(

)

activation

‘relu’

)

model

add

(

Dense

(

activation

‘sigmoid’

)

# Compile model

model

compile

(

loss

‘binary_crossentropy’

optimizer

‘adam’

metrics

[

‘accuracy’

]

)

return

model

# fix random seed for reproducibility

seed

random

set_seed

(

seed

)

# load dataset

dataset

loadtxt

(

“pima-indians-diabetes.csv”

delimiter

“,”

)

# split into input (X) and output (Y) variables

dataset

[

]

dataset

[

]

# create model

model

KerasClassifier

(

model

create_model

verbose

)

# define the grid search parameters

batch_size

[

100

]

epochs

[

100

]

param_grid

dict

(

batch_size

epochs

)

grid

GridSearchCV

(

estimator

model

param_grid

n_jobs

–

)

grid_result

grid

fit

(

)

# summarize results

(

“Best: %f using %s”

(

grid_result

best_score_

grid_result

best_params_

)

means

grid_result

cv_results_

[

‘mean_test_score’

]

stds

grid_result

cv_results_

[

‘std_test_score’

]

params

grid_result

cv_results_

[

‘params’

]

for

mean

stdev

param

zip

(

means

stds

params

)

(

“%f (%f) with: %r”

(

mean

stdev

param

)

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

Running this example produces the following output:

Best: 0.705729 using {‘batch_size’: 10, ‘epochs’: 100}

0.597656 (0.030425) with: {‘batch_size’: 10, ‘epochs’: 10}

0.686198 (0.017566) with: {‘batch_size’: 10, ‘epochs’: 50}

0.705729 (0.017566) with: {‘batch_size’: 10, ‘epochs’: 100}

0.494792 (0.009207) with: {‘batch_size’: 20, ‘epochs’: 10}

0.675781 (0.017758) with: {‘batch_size’: 20, ‘epochs’: 50}

0.683594 (0.011049) with: {‘batch_size’: 20, ‘epochs’: 100}

0.535156 (0.053274) with: {‘batch_size’: 40, ‘epochs’: 10}

0.622396 (0.009744) with: {‘batch_size’: 40, ‘epochs’: 50}

0.671875 (0.019918) with: {‘batch_size’: 40, ‘epochs’: 100}

0.592448 (0.042473) with: {‘batch_size’: 60, ‘epochs’: 10}

0.660156 (0.041707) with: {‘batch_size’: 60, ‘epochs’: 50}

0.674479 (0.006639) with: {‘batch_size’: 60, ‘epochs’: 100}

0.476562 (0.099896) with: {‘batch_size’: 80, ‘epochs’: 10}

0.608073 (0.033197) with: {‘batch_size’: 80, ‘epochs’: 50}

0.660156 (0.011500) with: {‘batch_size’: 80, ‘epochs’: 100}

0.615885 (0.015073) with: {‘batch_size’: 100, ‘epochs’: 10}

0.617188 (0.039192) with: {‘batch_size’: 100, ‘epochs’: 50}

0.632812 (0.019918) with: {‘batch_size’: 100, ‘epochs’: 100}

You can see that the batch size of 10 and 100 epochs achieved the best result of about 70% accuracy.

How to Tune the Training Optimization Algorithm

Keras offers a suite of different state-of-the-art optimization algorithms.

In this example, you will tune the optimization algorithm used to train the network, each with default parameters.

This is an odd example because often, you will choose one approach a priori and instead focus on tuning its parameters on your problem (see the next example).

Here, you will evaluate the suite of optimization algorithms supported by the Keras API.

The full code listing is provided below:

# Use scikit-learn to grid search the batch size and epochs

import

numpy

import

tensorflow

from

sklearn

model_selection

import

GridSearchCV

from

tensorflow

keras

models

import

Sequential

from

tensorflow

keras

layers

import

Dense

from

scikeras

wrappers

import

KerasClassifier

# Function to create model, required for KerasClassifier

def

create_model

(

)

# create model

model

Sequential

(

)

model

add

(

Dense

(

input_shape

(

)

activation

‘relu’

)

model

add

(

Dense

(

activation

‘sigmoid’

)

# return model without compile

return

model

# fix random seed for reproducibility

seed

random

set_seed

(

seed

)

# load dataset

dataset

loadtxt

(

“pima-indians-diabetes.csv”

delimiter

“,”

)

# split into input (X) and output (Y) variables

dataset

[

]

dataset

[

]

# create model

model

KerasClassifier

(

model

create_model

loss

“binary_crossentropy”

epochs

100

batch_size

verbose

)

# define the grid search parameters

optimizer

[

‘SGD’

‘RMSprop’

‘Adagrad’

‘Adadelta’

‘Adam’

‘Adamax’

‘Nadam’

]

param_grid

dict

(

optimizer

)

grid

GridSearchCV

(

estimator

model

param_grid

n_jobs

–

)

grid_result

grid

fit

(

)

# summarize results

(

“Best: %f using %s”

(

grid_result

best_score_

grid_result

best_params_

)

means

grid_result

cv_results_

[

‘mean_test_score’

]

stds

grid_result

cv_results_

[

‘std_test_score’

]

params

grid_result

cv_results_

[

‘params’

]

for

mean

stdev

param

zip

(

means

stds

params

)

(

“%f (%f) with: %r”

(

mean

stdev

param

)

Note the function create_model() defined above does not return a compiled model like that one in the previous example. This is because setting an optimizer for a Keras model is done in the compile() function call; hence it is better to leave it to the KerasClassifier wrapper and the GridSearchCV model. Also, note that you specified loss="binary_crossentropy" in the wrapper as it should also be set during the compile() function call.

Running this example produces the following output:

Best: 0.697917 using {‘optimizer’: ‘Adam’}

0.674479 (0.033804) with: {‘optimizer’: ‘SGD’}

0.649740 (0.040386) with: {‘optimizer’: ‘RMSprop’}

0.595052 (0.032734) with: {‘optimizer’: ‘Adagrad’}

0.348958 (0.001841) with: {‘optimizer’: ‘Adadelta’}

0.697917 (0.038051) with: {‘optimizer’: ‘Adam’}

0.652344 (0.019918) with: {‘optimizer’: ‘Adamax’}

0.684896 (0.011201) with: {‘optimizer’: ‘Nadam’}

The KerasClassifier wrapper will not compile your model again if the model is already compiled. Hence the other way to run GridSearchCV is to set the optimizer as an argument to the create_model() function, which returns an appropriately compiled model like the following:

# Use scikit-learn to grid search the batch size and epochs

import

numpy

import

tensorflow

from

sklearn

model_selection

import

GridSearchCV

from

tensorflow

keras

models

import

Sequential

from

tensorflow

keras

layers

import

Dense

from

scikeras

wrappers

import

KerasClassifier

# Function to create model, required for KerasClassifier

def

create_model

(

optimizer

‘adam’

)

# create model

model

Sequential

(

)

model

add

(

Dense

(

input_shape

(

)

activation

‘relu’

)

model

add

(

Dense

(

activation

‘sigmoid’

)

# Compile model

model

compile

(

loss

‘binary_crossentropy’

optimizer

metrics

[

‘accuracy’

]

)

return

model

# fix random seed for reproducibility

seed

random

set_seed

(

seed

)

# load dataset

dataset

loadtxt

(

“pima-indians-diabetes.csv”

delimiter

“,”

)

# split into input (X) and output (Y) variables

dataset

[

]

dataset

[

]

# create model

model

KerasClassifier

(

model

create_model

epochs

100

batch_size

verbose

)

# define the grid search parameters

optimizer

[

‘SGD’

‘RMSprop’

‘Adagrad’

‘Adadelta’

‘Adam’

‘Adamax’

‘Nadam’

]

param_grid

dict

(

model__optimizer

optimizer

)

grid

GridSearchCV

(

estimator

model

param_grid

n_jobs

–

)

grid_result

grid

fit

(

)

# summarize results

(

“Best: %f using %s”

(

grid_result

best_score_

grid_result

best_params_

)

means

grid_result

cv_results_

[

‘mean_test_score’

]

stds

grid_result

cv_results_

[

‘std_test_score’

]

params

grid_result

cv_results_

[

‘params’

]

for

mean

stdev

param

zip

(

means

stds

params

)

(

“%f (%f) with: %r”

(

mean

stdev

param

)

Note that in the above, you have the prefix model__ in the parameter dictionary param_grid. This is required for the KerasClassifier in the SciKeras module to make clear that the parameter needs to route into the create_model() function as arguments, rather than some parameter to set up in compile() or fit(). See also the routed parameter section of SciKeras documentation.

Running this example produces the following output:

Best: 0.697917 using {‘model__optimizer’: ‘Adam’}

0.636719 (0.019401) with: {‘model__optimizer’: ‘SGD’}

0.683594 (0.020915) with: {‘model__optimizer’: ‘RMSprop’}

0.585938 (0.038670) with: {‘model__optimizer’: ‘Adagrad’}

0.518229 (0.120624) with: {‘model__optimizer’: ‘Adadelta’}

0.697917 (0.049445) with: {‘model__optimizer’: ‘Adam’}

0.652344 (0.027805) with: {‘model__optimizer’: ‘Adamax’}

0.686198 (0.012890) with: {‘model__optimizer’: ‘Nadam’}

The results suggest that the ADAM optimization algorithm is the best with a score of about 70% accuracy.

How to Tune Learning Rate and Momentum

It is common to pre-select an optimization algorithm to train your network and tune its parameters.

By far, the most common optimization algorithm is plain old Stochastic Gradient Descent (SGD) because it is so well understood. In this example, you will look at optimizing the SGD learning rate and momentum parameters.

The learning rate controls how much to update the weight at the end of each batch, and the momentum controls how much to let the previous update influence the current weight update.

You will try a suite of small standard learning rates and momentum values from 0.2 to 0.8 in steps of 0.2, as well as 0.9 (because it can be a popular value in practice). In Keras, the way to set the learning rate and momentum is the following:

optimizer

keras

optimizers

SGD

(

learning_rate

0.01

momentum

0.2

)

In the SciKeras wrapper, you will route the parameters to the optimizer with the prefix optimizer__.

Generally, it is a good idea to also include the number of epochs in an optimization like this as there is a dependency between the amount of learning per batch (learning rate), the number of updates per epoch (batch size), and the number of epochs.

The full code listing is provided below:

# Use scikit-learn to grid search the learning rate and momentum

import

numpy

import

tensorflow

from

sklearn

model_selection

import

GridSearchCV

from

tensorflow

keras

models

import

Sequential

from

tensorflow

keras

layers

import

Dense

from

tensorflow

keras

optimizers

import

SGD

from

scikeras

wrappers

import

KerasClassifier

# Function to create model, required for KerasClassifier

def

create_model

(

)

# create model

model

Sequential

(

)

model

add

(

Dense

(

input_shape

(

)

activation

‘relu’

)

model

add

(

Dense

(

activation

‘sigmoid’

)

return

model

# fix random seed for reproducibility

seed

random

set_seed

(

seed

)

# load dataset

dataset

loadtxt

(

“pima-indians-diabetes.csv”

delimiter

“,”

)

# split into input (X) and output (Y) variables

dataset

[

]

dataset

[

]

# create model

model

KerasClassifier

(

model

create_model

loss

“binary_crossentropy”

optimizer

“SGD”

epochs

100

batch_size

verbose

)

# define the grid search parameters

learn_rate

[

0.001

0.01

0.1

0.2

0.3

]

momentum

[

0.0

0.2

0.4

0.6

0.8

0.9

]

param_grid

dict

(

optimizer__learning_rate

learn_rate

optimizer__momentum

momentum

)

grid

GridSearchCV

(

estimator

model

param_grid

n_jobs

–

)

grid_result

grid

fit

(

)

# summarize results

(

“Best: %f using %s”

(

grid_result

best_score_

grid_result

best_params_

)

means

grid_result

cv_results_

[

‘mean_test_score’

]

stds

grid_result

cv_results_

[

‘std_test_score’

]

params

grid_result

cv_results_

[

‘params’

]

for

mean

stdev

param

zip

(

means

stds

params

)

(

“%f (%f) with: %r”

(

mean

stdev

param

)

Running this example produces the following output:

Best: 0.686198 using {‘optimizer__learning_rate’: 0.001, ‘optimizer__momentum’: 0.0}

0.686198 (0.036966) with: {‘optimizer__learning_rate’: 0.001, ‘optimizer__momentum’: 0.0}

0.651042 (0.009744) with: {‘optimizer__learning_rate’: 0.001, ‘optimizer__momentum’: 0.2}

0.652344 (0.038670) with: {‘optimizer__learning_rate’: 0.001, ‘optimizer__momentum’: 0.4}

0.656250 (0.065907) with: {‘optimizer__learning_rate’: 0.001, ‘optimizer__momentum’: 0.6}

0.671875 (0.022326) with: {‘optimizer__learning_rate’: 0.001, ‘optimizer__momentum’: 0.8}

0.661458 (0.015733) with: {‘optimizer__learning_rate’: 0.001, ‘optimizer__momentum’: 0.9}

0.665365 (0.021236) with: {‘optimizer__learning_rate’: 0.01, ‘optimizer__momentum’: 0.0}

0.671875 (0.003189) with: {‘optimizer__learning_rate’: 0.01, ‘optimizer__momentum’: 0.2}

0.640625 (0.008438) with: {‘optimizer__learning_rate’: 0.01, ‘optimizer__momentum’: 0.4}

0.648438 (0.003189) with: {‘optimizer__learning_rate’: 0.01, ‘optimizer__momentum’: 0.6}

0.649740 (0.003683) with: {‘optimizer__learning_rate’: 0.01, ‘optimizer__momentum’: 0.8}

0.651042 (0.001841) with: {‘optimizer__learning_rate’: 0.01, ‘optimizer__momentum’: 0.9}

0.651042 (0.001841) with: {‘optimizer__learning_rate’: 0.1, ‘optimizer__momentum’: 0.0}

0.651042 (0.001841) with: {‘optimizer__learning_rate’: 0.1, ‘optimizer__momentum’: 0.2}

0.651042 (0.001841) with: {‘optimizer__learning_rate’: 0.1, ‘optimizer__momentum’: 0.4}

0.651042 (0.001841) with: {‘optimizer__learning_rate’: 0.1, ‘optimizer__momentum’: 0.6}

0.651042 (0.001841) with: {‘optimizer__learning_rate’: 0.1, ‘optimizer__momentum’: 0.8}

0.651042 (0.001841) with: {‘optimizer__learning_rate’: 0.1, ‘optimizer__momentum’: 0.9}

0.651042 (0.001841) with: {‘optimizer__learning_rate’: 0.2, ‘optimizer__momentum’: 0.0}

0.651042 (0.001841) with: {‘optimizer__learning_rate’: 0.2, ‘optimizer__momentum’: 0.2}

0.651042 (0.001841) with: {‘optimizer__learning_rate’: 0.2, ‘optimizer__momentum’: 0.4}

0.651042 (0.001841) with: {‘optimizer__learning_rate’: 0.2, ‘optimizer__momentum’: 0.6}

0.651042 (0.001841) with: {‘optimizer__learning_rate’: 0.2, ‘optimizer__momentum’: 0.8}

0.651042 (0.001841) with: {‘optimizer__learning_rate’: 0.2, ‘optimizer__momentum’: 0.9}

0.652344 (0.003189) with: {‘optimizer__learning_rate’: 0.3, ‘optimizer__momentum’: 0.0}

0.651042 (0.001841) with: {‘optimizer__learning_rate’: 0.3, ‘optimizer__momentum’: 0.2}

0.651042 (0.001841) with: {‘optimizer__learning_rate’: 0.3, ‘optimizer__momentum’: 0.4}

0.651042 (0.001841) with: {‘optimizer__learning_rate’: 0.3, ‘optimizer__momentum’: 0.6}

0.651042 (0.001841) with: {‘optimizer__learning_rate’: 0.3, ‘optimizer__momentum’: 0.8}

0.651042 (0.001841) with: {‘optimizer__learning_rate’: 0.3, ‘optimizer__momentum’: 0.9}

You can see that SGD is not very good on this problem; nevertheless, the best results were achieved using a learning rate of 0.001 and a momentum of 0.0 with an accuracy of about 68%.

How to Tune Network Weight Initialization

Neural network weight initialization used to be simple: use small random values.

Now there is a suite of different techniques to choose from. Keras provides a laundry list.

In this example, you will look at tuning the selection of network weight initialization by evaluating all the available techniques.

You will use the same weight initialization method on each layer. Ideally, it may be better to use different weight initialization schemes according to the activation function used on each layer. In the example below, you will use a rectifier for the hidden layer. Use sigmoid for the output layer because the predictions are binary. The weight initialization is now an argument to create_model() function, where you need to use the model__ prefix to ask the KerasClassifier to route the parameter to the model creation function.

The full code listing is provided below:

# Use scikit-learn to grid search the weight initialization

import

numpy

import

tensorflow

from

sklearn

model_selection

import

GridSearchCV

from

tensorflow

keras

models

import

Sequential

from

tensorflow

keras

layers

import

Dense

from

scikeras

wrappers

import

KerasClassifier

# Function to create model, required for KerasClassifier

def

create_model

(

init_mode

‘uniform’

)

# create model

model

Sequential

(

)

model

add

(

Dense

(

input_shape

(

)

kernel_initializer

init_mode

activation

‘relu’

)

model

add

(

Dense

(

kernel_initializer

init_mode

activation

‘sigmoid’

)

# Compile model

model

compile

(

loss

‘binary_crossentropy’

optimizer

‘adam’

metrics

[

‘accuracy’

]

)

return

model

# fix random seed for reproducibility

seed

random

set_seed

(

seed

)

# load dataset

dataset

loadtxt

(

“pima-indians-diabetes.csv”

delimiter

“,”

)

# split into input (X) and output (Y) variables

dataset

[

]

dataset

[

]

# create model

model

KerasClassifier

(

model

create_model

epochs

100

batch_size

verbose

)

# define the grid search parameters

init_mode

[

‘uniform’

‘lecun_uniform’

‘normal’

‘zero’

‘glorot_normal’

‘glorot_uniform’

‘he_normal’

‘he_uniform’

]

param_grid

dict

(

model__init_mode

init_mode

)

grid

GridSearchCV

(

estimator

model

param_grid

n_jobs

–

)

grid_result

grid

fit

(

)

# summarize results

(

“Best: %f using %s”

(

grid_result

best_score_

grid_result

best_params_

)

means

grid_result

cv_results_

[

‘mean_test_score’

]

stds

grid_result

cv_results_

[

‘std_test_score’

]

params

grid_result

cv_results_

[

‘params’

]

for

mean

stdev

param

zip

(

means

stds

params

)

(

“%f (%f) with: %r”

(

mean

stdev

param

)

Running this example produces the following output:

Best: 0.716146 using {‘model__init_mode’: ‘uniform’}

0.716146 (0.034987) with: {‘model__init_mode’: ‘uniform’}

0.678385 (0.029635) with: {‘model__init_mode’: ‘lecun_uniform’}

0.716146 (0.030647) with: {‘model__init_mode’: ‘normal’}

0.651042 (0.001841) with: {‘model__init_mode’: ‘zero’}

0.695312 (0.027805) with: {‘model__init_mode’: ‘glorot_normal’}

0.690104 (0.023939) with: {‘model__init_mode’: ‘glorot_uniform’}

0.647135 (0.057880) with: {‘model__init_mode’: ‘he_normal’}

0.665365 (0.026557) with: {‘model__init_mode’: ‘he_uniform’}

We can see that the best results were achieved with a uniform weight initialization scheme achieving a performance of about 72%.

How to Tune the Neuron Activation Function

The activation function controls the non-linearity of individual neurons and when to fire.

Generally, the rectifier activation function is the most popular. However, it used to be the sigmoid and the tanh functions, and these functions may still be more suitable for different problems.

In this example, you will evaluate the suite of different activation functions available in Keras. You will only use these functions in the hidden layer, as a sigmoid activation function is required in the output for the binary classification problem. Similar to the previous example, this is an argument to the create_model() function, and you will use the model__ prefix for the GridSearchCV parameter grid.

Generally, it is a good idea to prepare data to the range of the different transfer functions, which you will not do in this case.

The full code listing is provided below:

# Use scikit-learn to grid search the activation function

import

numpy

import

tensorflow

from

sklearn

model_selection

import

GridSearchCV

from

tensorflow

keras

models

import

Sequential

from

tensorflow

keras

layers

import

Dense

from

scikeras

wrappers

import

KerasClassifier

# Function to create model, required for KerasClassifier

def

create_model

(

activation

‘relu’

)

# create model

model

Sequential

(

)

model

add

(

Dense

(

input_shape

(

)

kernel_initializer

‘uniform’

activation

)

model

add

(

Dense

(

kernel_initializer

‘uniform’

activation

‘sigmoid’

)

# Compile model

model

compile

(

loss

‘binary_crossentropy’

optimizer

‘adam’

metrics

[

‘accuracy’

]

)

return

model

# fix random seed for reproducibility

seed

random

set_seed

(

seed

)

# load dataset

dataset

loadtxt

(

“pima-indians-diabetes.csv”

delimiter

“,”

)

# split into input (X) and output (Y) variables

dataset

[

]

dataset

[

]

# create model

model

KerasClassifier

(

model

create_model

epochs

100

batch_size

verbose

)

# define the grid search parameters

activation

[

‘softmax’

‘softplus’

‘softsign’

‘relu’

‘tanh’

‘sigmoid’

‘hard_sigmoid’

‘linear’

]

param_grid

dict

(

model__activation

activation

)

grid

GridSearchCV

(

estimator

model

param_grid

n_jobs

–

)

grid_result

grid

fit

(

)

# summarize results

(

“Best: %f using %s”

(

grid_result

best_score_

grid_result

best_params_

)

means

grid_result

cv_results_

[

‘mean_test_score’

]

stds

grid_result

cv_results_

[

‘std_test_score’

]

params

grid_result

cv_results_

[

‘params’

]

for

mean

stdev

param

zip

(

means

stds

params

)

(

“%f (%f) with: %r”

(

mean

stdev

param

)

Running this example produces the following output:

Best: 0.710938 using {‘model__activation’: ‘linear’}

0.651042 (0.001841) with: {‘model__activation’: ‘softmax’}

0.703125 (0.012758) with: {‘model__activation’: ‘softplus’}

0.671875 (0.009568) with: {‘model__activation’: ‘softsign’}

0.710938 (0.024080) with: {‘model__activation’: ‘relu’}

0.669271 (0.019225) with: {‘model__activation’: ‘tanh’}

0.675781 (0.011049) with: {‘model__activation’: ‘sigmoid’}

0.677083 (0.004872) with: {‘model__activation’: ‘hard_sigmoid’}

0.710938 (0.034499) with: {‘model__activation’: ‘linear’}

Surprisingly (to me at least), the “linear” activation function achieved the best results with an accuracy of about 71%.

How to Tune Dropout Regularization

In this example, you will look at tuning the dropout rate for regularization in an effort to limit overfitting and improve the model’s ability to generalize.

For the best results, dropout is best combined with a weight constraint such as the max norm constraint.

For more on using dropout in deep learning models with Keras see the post:

This involves fitting both the dropout percentage and the weight constraint. We will try dropout percentages between 0.0 and 0.9 (1.0 does not make sense) and maxnorm weight constraint values between 0 and 5.

The full code listing is provided below.

# Use scikit-learn to grid search the dropout rate

import

numpy

import

tensorflow

from

sklearn

model_selection

import

GridSearchCV

from

tensorflow

keras

models

import

Sequential

from

tensorflow

keras

layers

import

Dense

from

tensorflow

keras

layers

import

Dropout

from

tensorflow

keras

constraints

import

MaxNorm

from

scikeras

wrappers

import

KerasClassifier

# Function to create model, required for KerasClassifier

def

create_model

(

dropout_rate

weight_constraint

)

# create model

model

Sequential

(

)

model

add

(

Dense

(

input_shape

(

)

kernel_initializer

‘uniform’

activation

‘linear’

kernel_constraint

MaxNorm

(

weight_constraint

)

model

add

(

Dropout

(

dropout_rate

)

model

add

(

Dense

(

kernel_initializer

‘uniform’

activation

‘sigmoid’

)

# Compile model

model

compile

(

loss

‘binary_crossentropy’

optimizer

‘adam’

metrics

[

‘accuracy’

]

)

return

model

# fix random seed for reproducibility

seed

random

set_seed

(

seed

)

# load dataset

dataset

loadtxt

(

“pima-indians-diabetes.csv”

delimiter

“,”

)

(

dataset

dtype

dataset

shape

)

# split into input (X) and output (Y) variables

dataset

[

]

dataset

[

]

# create model

model

KerasClassifier

(

model

create_model

epochs

100

batch_size

verbose

)

# define the grid search parameters

weight_constraint

[

1.0

2.0

3.0

4.0

5.0

]

dropout_rate

[

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

]

param_grid

dict

(

model__dropout_rate

dropout_rate

model__weight_constraint

weight_constraint

)

#param_grid = dict(model__dropout_rate=dropout_rate)

grid

GridSearchCV

(

estimator

model

param_grid

n_jobs

–

)

grid_result

grid

fit

(

)

# summarize results

(

“Best: %f using %s”

(

grid_result

best_score_

grid_result

best_params_

)

means

grid_result

cv_results_

[

‘mean_test_score’

]

stds

grid_result

cv_results_

[

‘std_test_score’

]

params

grid_result

cv_results_

[

‘params’

]

for

mean

stdev

param

zip

(

means

stds

params

)

(

“%f (%f) with: %r”

(

mean

stdev

param

)

Running this example produces the following output.

Best: 0.766927 using {‘model__dropout_rate’: 0.2, ‘model__weight_constraint’: 3.0}

0.729167 (0.021710) with: {‘model__dropout_rate’: 0.0, ‘model__weight_constraint’: 1.0}

0.746094 (0.022326) with: {‘model__dropout_rate’: 0.0, ‘model__weight_constraint’: 2.0}

0.753906 (0.022097) with: {‘model__dropout_rate’: 0.0, ‘model__weight_constraint’: 3.0}

0.750000 (0.012758) with: {‘model__dropout_rate’: 0.0, ‘model__weight_constraint’: 4.0}

0.751302 (0.012890) with: {‘model__dropout_rate’: 0.0, ‘model__weight_constraint’: 5.0}

0.739583 (0.026748) with: {‘model__dropout_rate’: 0.1, ‘model__weight_constraint’: 1.0}

0.733073 (0.001841) with: {‘model__dropout_rate’: 0.1, ‘model__weight_constraint’: 2.0}

0.753906 (0.030425) with: {‘model__dropout_rate’: 0.1, ‘model__weight_constraint’: 3.0}

0.748698 (0.031466) with: {‘model__dropout_rate’: 0.1, ‘model__weight_constraint’: 4.0}

0.753906 (0.030425) with: {‘model__dropout_rate’: 0.1, ‘model__weight_constraint’: 5.0}

0.760417 (0.024360) with: {‘model__dropout_rate’: 0.2, ‘model__weight_constraint’: 1.0}

nan (nan) with: {‘model__dropout_rate’: 0.2, ‘model__weight_constraint’: 2.0}

0.766927 (0.021710) with: {‘model__dropout_rate’: 0.2, ‘model__weight_constraint’: 3.0}

0.755208 (0.010253) with: {‘model__dropout_rate’: 0.2, ‘model__weight_constraint’: 4.0}

0.750000 (0.008438) with: {‘model__dropout_rate’: 0.2, ‘model__weight_constraint’: 5.0}

0.725260 (0.015073) with: {‘model__dropout_rate’: 0.3, ‘model__weight_constraint’: 1.0}

0.738281 (0.008438) with: {‘model__dropout_rate’: 0.3, ‘model__weight_constraint’: 2.0}

0.748698 (0.003683) with: {‘model__dropout_rate’: 0.3, ‘model__weight_constraint’: 3.0}

0.740885 (0.023073) with: {‘model__dropout_rate’: 0.3, ‘model__weight_constraint’: 4.0}

0.735677 (0.008027) with: {‘model__dropout_rate’: 0.3, ‘model__weight_constraint’: 5.0}

0.743490 (0.009207) with: {‘model__dropout_rate’: 0.4, ‘model__weight_constraint’: 1.0}

0.751302 (0.006639) with: {‘model__dropout_rate’: 0.4, ‘model__weight_constraint’: 2.0}

0.750000 (0.024910) with: {‘model__dropout_rate’: 0.4, ‘model__weight_constraint’: 3.0}

0.744792 (0.030314) with: {‘model__dropout_rate’: 0.4, ‘model__weight_constraint’: 4.0}

0.751302 (0.010253) with: {‘model__dropout_rate’: 0.4, ‘model__weight_constraint’: 5.0}

0.757812 (0.006379) with: {‘model__dropout_rate’: 0.5, ‘model__weight_constraint’: 1.0}

0.740885 (0.030978) with: {‘model__dropout_rate’: 0.5, ‘model__weight_constraint’: 2.0}

0.742188 (0.003189) with: {‘model__dropout_rate’: 0.5, ‘model__weight_constraint’: 3.0}

0.718750 (0.016877) with: {‘model__dropout_rate’: 0.5, ‘model__weight_constraint’: 4.0}

0.726562 (0.019137) with: {‘model__dropout_rate’: 0.5, ‘model__weight_constraint’: 5.0}

0.725260 (0.013279) with: {‘model__dropout_rate’: 0.6, ‘model__weight_constraint’: 1.0}

0.738281 (0.013902) with: {‘model__dropout_rate’: 0.6, ‘model__weight_constraint’: 2.0}

0.743490 (0.001841) with: {‘model__dropout_rate’: 0.6, ‘model__weight_constraint’: 3.0}

0.722656 (0.009568) with: {‘model__dropout_rate’: 0.6, ‘model__weight_constraint’: 4.0}

0.747396 (0.024774) with: {‘model__dropout_rate’: 0.6, ‘model__weight_constraint’: 5.0}

0.729167 (0.006639) with: {‘model__dropout_rate’: 0.7, ‘model__weight_constraint’: 1.0}

0.717448 (0.012890) with: {‘model__dropout_rate’: 0.7, ‘model__weight_constraint’: 2.0}

0.710938 (0.027621) with: {‘model__dropout_rate’: 0.7, ‘model__weight_constraint’: 3.0}

0.718750 (0.014616) with: {‘model__dropout_rate’: 0.7, ‘model__weight_constraint’: 4.0}

0.743490 (0.021236) with: {‘model__dropout_rate’: 0.7, ‘model__weight_constraint’: 5.0}

0.713542 (0.009207) with: {‘model__dropout_rate’: 0.8, ‘model__weight_constraint’: 1.0}

nan (nan) with: {‘model__dropout_rate’: 0.8, ‘model__weight_constraint’: 2.0}

0.721354 (0.009207) with: {‘model__dropout_rate’: 0.8, ‘model__weight_constraint’: 3.0}

0.716146 (0.009207) with: {‘model__dropout_rate’: 0.8, ‘model__weight_constraint’: 4.0}

0.716146 (0.015073) with: {‘model__dropout_rate’: 0.8, ‘model__weight_constraint’: 5.0}

0.682292 (0.018688) with: {‘model__dropout_rate’: 0.9, ‘model__weight_constraint’: 1.0}

0.696615 (0.011201) with: {‘model__dropout_rate’: 0.9, ‘model__weight_constraint’: 2.0}

0.696615 (0.026557) with: {‘model__dropout_rate’: 0.9, ‘model__weight_constraint’: 3.0}

0.694010 (0.001841) with: {‘model__dropout_rate’: 0.9, ‘model__weight_constraint’: 4.0}

0.696615 (0.022628) with: {‘model__dropout_rate’: 0.9, ‘model__weight_constraint’: 5.0}

We can see that the dropout rate of 20% and the MaxNorm weight constraint of 3 resulted in the best accuracy of about 77%. You may notice some of the result is nan. Probably it is due to the issue that the input is not normalized and you may run into a degenerated model by chance.

How to Tune the Number of Neurons in the Hidden Layer

The number of neurons in a layer is an important parameter to tune. Generally the number of neurons in a layer controls the representational capacity of the network, at least at that point in the topology.

Also, generally, a large enough single layer network can approximate any other neural network, at least in theory.

In this example, we will look at tuning the number of neurons in a single hidden layer. We will try values from 1 to 30 in steps of 5.

A larger network requires more training and at least the batch size and number of epochs should ideally be optimized with the number of neurons.

The full code listing is provided below.

# Use scikit-learn to grid search the number of neurons

import

numpy

import

tensorflow

from

sklearn

model_selection

import

GridSearchCV

from

tensorflow

keras

models

import

Sequential

from

tensorflow

keras

layers

import

Dense

from

tensorflow

keras

layers

import

Dropout

from

scikeras

wrappers

import

KerasClassifier

from

tensorflow

keras

constraints

import

MaxNorm

# Function to create model, required for KerasClassifier

def

create_model

(

neurons

)

# create model

model

Sequential

(

)

model

add

(

Dense

(

neurons

input_shape

(

)

kernel_initializer

‘uniform’

activation

‘linear’

kernel_constraint

MaxNorm

(

)

model

add

(

Dropout

(

0.2

)

model

add

(

Dense

(

kernel_initializer

‘uniform’

activation

‘sigmoid’

)

# Compile model

model

compile

(

loss

‘binary_crossentropy’

optimizer

‘adam’

metrics

[

‘accuracy’

]

)

return

model

# fix random seed for reproducibility

seed

random

set_seed

(

seed

)

# load dataset

dataset

loadtxt

(

“pima-indians-diabetes.csv”

delimiter

“,”

)

# split into input (X) and output (Y) variables

dataset

[

]

dataset

[

]

# create model

model

KerasClassifier

(

model

create_model

epochs

100

batch_size

verbose

)

# define the grid search parameters

neurons

[

]

param_grid

dict

(

model__neurons

neurons

)

grid

GridSearchCV

(

estimator

model

param_grid

n_jobs

–

)

grid_result

grid

fit

(

)

# summarize results

(

“Best: %f using %s”

(

grid_result

best_score_

grid_result

best_params_

)

means

grid_result

cv_results_

[

‘mean_test_score’

]

stds

grid_result

cv_results_

[

‘std_test_score’

]

params

grid_result

cv_results_

[

‘params’

]

for

mean

stdev

param

zip

(

means

stds

params

)

(

“%f (%f) with: %r”

(

mean

stdev

param

)

Running this example produces the following output.

Best: 0.729167 using {‘model__neurons’: 30}

0.701823 (0.010253) with: {‘model__neurons’: 1}

0.717448 (0.011201) with: {‘model__neurons’: 5}

0.717448 (0.008027) with: {‘model__neurons’: 10}

0.720052 (0.019488) with: {‘model__neurons’: 15}

0.709635 (0.004872) with: {‘model__neurons’: 20}

0.708333 (0.003683) with: {‘model__neurons’: 25}

0.729167 (0.009744) with: {‘model__neurons’: 30}

We can see that the best results were achieved with a network with 30 neurons in the hidden layer with an accuracy of about 73%.

Tips for Hyperparameter Optimization

This section lists some handy tips to consider when tuning hyperparameters of your neural network.

k-fold Cross Validation. You can see that the results from the examples in this post show some variance. A default cross-validation of 3 was used, but perhaps k=5 or k=10 would be more stable. Carefully choose your cross validation configuration to ensure your results are stable.
Review the Whole Grid. Do not just focus on the best result, review the whole grid of results and look for trends to support configuration decisions.
Parallelize. Use all your cores if you can, neural networks are slow to train and we often want to try a lot of different parameters. Consider spinning up a lot of AWS instances.
Use a Sample of Your Dataset. Because networks are slow to train, try training them on a smaller sample of your training dataset, just to get an idea of general directions of parameters rather than optimal configurations.
Start with Coarse Grids. Start with coarse-grained grids and zoom into finer grained grids once you can narrow the scope.
Do not Transfer Results. Results are generally problem specific. Try to avoid favorite configurations on each new problem that you see. It is unlikely that optimal results you discover on one problem will transfer to your next project. Instead look for broader trends like number of layers or relationships between parameters.
Reproducibility is a Problem. Although we set the seed for the random number generator in NumPy, the results are not 100% reproducible. There is more to reproducibility when grid searching wrapped Keras models than is presented in this post.

Summary

In this post, you discovered how you can tune the hyperparameters of your deep learning networks in Python using Keras and scikit-learn.

Specifically, you learned:

How to wrap Keras models for use in scikit-learn and how to use grid search.
How to grid search a suite of different standard neural network parameters for Keras models.
How to design your own hyperparameter optimization experiments.

Do you have any experience tuning hyperparameters of large neural networks? Please share your stories below.

Do you have any questions about hyperparameter optimization of neural networks or about this post? Ask your questions in the comments and I will do my best to answer.