Artificial Neural Network (ANN) 7 – Overfitting & Regularization- 2020

Artificial Neural Network (ANN) 7 – Overfitting & Regularization

NeuralNetwork-logo.png

bogotobogo.com site search:

Note

Continued from Artificial Neural Network (ANN) 6 – Training via BFGS where we trained our neural network via BFGS

We saw our neural network gave a pretty good predictions of our test score based on how many hours we slept, and how many hours we studied the night before.

In this article, we want to check how well our model reflects the real world data.

We want our model to fit the signal but not the noise so that we should be able to avoid overfitting.

Overfitting.png

picture source : Python machine learning by Sebastian Raschka

First, we’ll work on diagnosing overfitting, and then we’ll work on fixing it.

Training inputs

Let’s start with an input data for training our neural network:

ANN7-Input.png

Here is the plot for our input data, scores vs hours of sleep/study:

TestScore-SleepStudy.png

To train our model, we need to normalize training data:

NormalizingData.png

Training neural network

Let’s start training our network with the normalized data set:

TrainerRun.png

The cost function ($J$) plot vs iterations looks like this:

CostFunction-Iterations-Plot.png

More data for the neural network

Now we want to generate more data using “numpy.linspace()”:

Linspace-more-data-set.png

Contour for the newly generated data looks like this:

ContourPlotNewData2.png

3D-Plot-Code-New.png

3D-Plot-New.png

From the picture, we can see our model is overfitting, but how do we know for sure?

Data split : training and testing

In general, we want to split our data into 2 portions: training and testing. We won’t touch our testing data while training the model, and only use it to see how we’re doing since our testing data is a simulation of the real world.

DataSplit.png

We may want to modify Trainer class a bit to check testing error during training:

TrainingClassModified2.png

Let’s train our model with the new data:

TrainModelWithNewData2.png

We can plot the error on our training and testing sets as we train our model and identify the exact point at which overfitting begins.

PlotCostDuringTraining.png

As we can see from the picture above, our cost function ($\color{green}{J}$) with real data(test) soars around it=125 while the $\color{blue}{J}$ with training data continues to become smaller and smaller.

Now we know we have overfitting issue, but how do we fix it?

A simple rule of thumb is that we should have at least 10 times as many examples as the degrees for freedom in our model. For us, since we have 9 weights that can change, we would need 90 observations, which we certainly don’t have.

Regularization

One of th popular and effective ways of mitigating the overfitting issue is to use a technique called regularization.

One way to implement regularization is to add a term to our cost function that penalizes overly complex models.

A simple, but effective way to do this is to add together the square of our weights to our cost function so that models with larger magnitudes of weights, cost more.

We’ll need to normalize the other part of our cost function to ensure that our ratio of the two error terms does not change with respect to the number of examples.

We’re going to introduce a regularization hyper parameter, $\lambda$, that will allow us to tune the relative cost. So, higher values of lambda will impose bigger penalties for high model complexity.

We need to make changes to costFunction and costFunctionPrime as well as the __init__():

#New complete class, with changes:
class NeuralNetwork(object):
    def __init__(self, Lambda=0):        
        #Define Hyperparameters
        self.inputLayerSize = 2
        self.outputLayerSize = 1
        self.hiddenLayerSize = 3
        
        #Weights (parameters)
        self.W1 = np.random.randn(self.inputLayerSize,self.hiddenLayerSize)
        self.W2 = np.random.randn(self.hiddenLayerSize,self.outputLayerSize)
        
        #Regularization Parameter:
        self.Lambda = Lambda
        
    def forwardPropagation(self, X):
        #Propogate inputs though network
        self.z2 = np.dot(X, self.W1)
        self.a2 = self.sigmoid(self.z2)
        self.z3 = np.dot(self.a2, self.W2)
        yHat = self.sigmoid(self.z3) 
        return yHat
        
    def sigmoid(self, z):
        #Apply sigmoid activation function to scalar, vector, or matrix
        return 1/(1+np.exp(-z))
    
    def sigmoidPrime(self,z):
        #Gradient of sigmoid
        return np.exp(-z)/((1+np.exp(-z))**2)
    
    def costFunction(self, X, y):
        #Compute cost for given X,y, use weights already stored in class.
        self.yHat = self.forwardPropagation(X)
        J = 0.5*sum((y-self.yHat)**2)/X.shape[0] + (self.Lambda/2)*(np.sum(self.W1**2)+np.sum(self.W2**2))
        return J
        
    def costFunctionPrime(self, X, y):
        #Compute derivative with respect to W and W2 for a given X and y:
        self.yHat = self.forwardPropagation(X)
        
        delta3 = np.multiply(-(y-self.yHat), self.sigmoidPrime(self.z3))
        #Add gradient of regularization term:
        dJdW2 = np.dot(self.a2.T, delta3)/X.shape[0] +  self.Lambda*self.W2
        
        delta2 = np.dot(delta3, self.W2.T)*self.sigmoidPrime(self.z2)
        #Add gradient of regularization term:
        dJdW1 = np.dot(X.T, delta2)/X.shape[0] + self.Lambda*self.W1
        
        return dJdW1, dJdW2
    
    #Helper functions for interacting with other methods/classes
    def getParams(self):
        #Get W1 and W2 Rolled into vector:
        params = np.concatenate((self.W1.ravel(), self.W2.ravel()))
        return params
    
    def setParams(self, params):
        #Set W1 and W2 using single parameter vector:
        W1_start = 0
        W1_end = self.hiddenLayerSize*self.inputLayerSize
        self.W1 = np.reshape(params[W1_start:W1_end], \
                             (self.inputLayerSize, self.hiddenLayerSize))
        W2_end = W1_end + self.hiddenLayerSize*self.outputLayerSize
        self.W2 = np.reshape(params[W1_end:W2_end], \
                             (self.hiddenLayerSize, self.outputLayerSize))
        
    def computeGradients(self, X, y):
        dJdW1, dJdW2 = self.costFunctionPrime(X, y)
        return np.concatenate((dJdW1.ravel(), dJdW2.ravel()))

Since we made some changes, let’s make sure our gradients are correct after making those changes:

GradientsCheckNN.png

Let’s train our model again.

Here is the data set we’re going to use:

training-testing-data-numbers.png

training-testing-data-plot.png

RegularizedCostFunction.png

Now our training and testing errors are much closer, which is the indication of the success in reducing the overfit on this dataset.

Let’s see our contour plot for test scores against sleep/study hours:

ContourPlotWithRegularization.png

3-D plot:

3D-Plot-With-Regularization.png

While we see that the fit is still good, but our model is no longer that interested in the fitting accuracy to our data.

To reduce the overfitting further, we may want to increase the regularization parameter, $\lambda$.

Here is the plot of the 6 weights ($W^{(1)}$) for hidden layer and 3 weights ($W^{(2)}$) for output layer of our neural network:

Backpropagation-Weight-Updates2.png

Note: this picture for weights update was plotted later from a separated run. So, it does not reflect the pictures in the previous section though it shows the general trend how the weight are updated during the iterations.

To get the $W$, we need to modify the lines highlighted in our Trainer class as shown below:

New-TrainerClass.png

Github Repository

Please visit Github Artificial-Neural-Networks-with-Jupyter
.

Next:

8. Artificial Neural Network (ANN) 8 – Deep Learning I : Image Recognition (Image uploading)

Machine Learning with scikit-learn

scikit-learn installation

scikit-learn : Features and feature extraction – iris dataset

scikit-learn : Machine Learning Quick Preview

scikit-learn : Data Preprocessing I – Missing / Categorical data

scikit-learn : Data Preprocessing II – Partitioning a dataset / Feature scaling / Feature Selection / Regularization

scikit-learn : Data Preprocessing III – Dimensionality reduction vis Sequential feature selection / Assessing feature importance via random forests

Data Compression via Dimensionality Reduction I – Principal component analysis (PCA)

scikit-learn : Data Compression via Dimensionality Reduction II – Linear Discriminant Analysis (LDA)

scikit-learn : Data Compression via Dimensionality Reduction III – Nonlinear mappings via kernel principal component (KPCA) analysis

scikit-learn : Logistic Regression, Overfitting & regularization

scikit-learn : Supervised Learning & Unsupervised Learning – e.g. Unsupervised PCA dimensionality reduction with iris dataset

scikit-learn : Unsupervised_Learning – KMeans clustering with iris dataset

scikit-learn : Linearly Separable Data – Linear Model & (Gaussian) radial basis function kernel (RBF kernel)

scikit-learn : Decision Tree Learning I – Entropy, Gini, and Information Gain

scikit-learn : Decision Tree Learning II – Constructing the Decision Tree

scikit-learn : Random Decision Forests Classification

scikit-learn : Support Vector Machines (SVM)

scikit-learn : Support Vector Machines (SVM) II

Flask with Embedded Machine Learning I : Serializing with pickle and DB setup

Flask with Embedded Machine Learning II : Basic Flask App

Flask with Embedded Machine Learning III : Embedding Classifier

Flask with Embedded Machine Learning IV : Deploy

Flask with Embedded Machine Learning V : Updating the classifier

scikit-learn : Sample of a spam comment filter using SVM – classifying a good one or a bad one

Machine learning algorithms and concepts

Batch gradient descent algorithm

Single Layer Neural Network – Perceptron model on the Iris dataset using Heaviside step activation function

Batch gradient descent versus stochastic gradient descent

Single Layer Neural Network – Adaptive Linear Neuron using linear (identity) activation function with batch gradient descent method

Single Layer Neural Network : Adaptive Linear Neuron using linear (identity) activation function with stochastic gradient descent (SGD)

Logistic Regression

VC (Vapnik-Chervonenkis) Dimension and Shatter

Bias-variance tradeoff

Maximum Likelihood Estimation (MLE)

Neural Networks with backpropagation for XOR using one hidden layer

minHash

tf-idf weight

Natural Language Processing (NLP): Sentiment Analysis I (IMDb & bag-of-words)

Natural Language Processing (NLP): Sentiment Analysis II (tokenization, stemming, and stop words)

Natural Language Processing (NLP): Sentiment Analysis III (training & cross validation)

Natural Language Processing (NLP): Sentiment Analysis IV (out-of-core)

Locality-Sensitive Hashing (LSH) using Cosine Distance (Cosine Similarity)

Artificial Neural Networks (ANN)

[Note]

1. Introduction

2. Forward Propagation

3. Gradient Descent

4. Backpropagation of Errors

5. Checking gradient

6. Training via BFGS

7. Overfitting & Regularization

8. Deep Learning I : Image Recognition (Image uploading)

9. Deep Learning II : Image Recognition (Image classification)

10 – Deep Learning III : Deep Learning III : Theano, TensorFlow, and Keras

Please enable JavaScript to view the comments powered by Disqus.

[Note] Sources are available at Github – Jupyter notebook files