Neural Networks from Scratch with Python Code and Math in Detail— I

Neural Networks from Scratch with Python Code and Math in Detail— I

Building neural networks from scratch. From the math behind them to step-by-step implementation coding samples in Python with Google Colab

Author(s): Pratik Shukla, Roberto Iriondo

Last updated December 1, 2021

Note: In our second tutorial on neural networks, we dive in-depth into the limitations and advantages of using neural networks. We show how to implement neural nets with hidden layers and how these lead to a higher accuracy rate on our predictions, along with implementation samples in Python on Google Colab.

Figure 1: Where neural networks fit in AI, machine learning, and deep learning.Figure 1: Where neural networks fit in AI, machine learning, and deep learning.

What is a neural network?

Neural networks form the base of deep learning, which is a subfield of machine learning, where the structure of the human brain inspires the algorithms. Neural networks take input data, train themselves to recognize patterns found in the data, and then predict the output for a new set of similar data. Therefore, a neural network can be thought of as the functional unit of deep learning, which mimics the behavior of the human brain to solve complex data-driven problems.

The first thing that comes to our mind when we think of “neural networks” is biology, and indeed, neural nets are inspired by our brains.

📚 Check out our editorial recommendations on the best machine learning books. 📚

Let’s try to understand them:

Figure 2: An image representing a biological neuron | Source: Wikipedia [1]Figure 2: An image representing a biological neuron | Source: Wikipedia [1]

In machine learning, the neurons’ dendrites refer to as input, and the nucleus process the data and forward the calculated output through the axon. In a biological neural network, the width (thickness) of dendrites defines the weight associated with them.

Index:

📚 Check out our tutorial on the monte carlo simulation with examples in Python📚

1. What is an Artificial Neural Network?

Simply put, an ANN represents interconnected input and output units in which each connection has an associated weight. During the learning phase, the network learns by adjusting these weights in order to be able to predict the correct class for input data.

For instance:

We encounter ourselves in a deep sleep state, and suddenly our environment starts to tremble. Immediately afterward, our brain recognizes that it is an earthquake. At once, we think of what is most valuable to us:

  • Our beloved ones.
  • Essential documents.
  • Jewelry.
  • Laptop.
  • A pencil.

Now we only have a few minutes to get out of the house, and we can only save a few things. What will our priorities be in this case?

Perhaps, we are going to save our beloved ones first, and then if time permits, we can think of other things. What we did here is, assign a weight to our valuables. Each of the valuables at that moment is an input, and the priorities are the weights we assigned to it.

The same is the case with neural networks. We assign weights to different values and predict the output from them. However, in this case, we do not know the associated weight with each input, so we make an algorithm that will calculate the weights associated with them by processing lots of input data.

2. Applications of Artificial Neural Networks:

a. Classification of data:

Based on a set of data, our trained neural network predicts whether it is a dog or a cat?

b. Anomaly detection:

Given the details about transactions of a person, it can say that whether the transaction is fraud or not.

c. Speech recognition:

We can train our neural network to recognize speech patterns. Example: Siri, Alexa, Google assistant.

d. Audio generation:

Given the inputs as audio files, it can generate new music based on various factors like genre, singer, and others.

e. Time series analysis:

A well-trained neural network can predict the stock price.

f. Spell checking:

We can train a neural network that detects misspelled spellings and can also suggest a similar meaning for words. Example: Grammarly

g. Character recognition:

A well-trained neural network can detect handwritten characters.

h. Machine translation:

We can develop a neural network that translates one language into another language.

i. Image processing:

We can train a neural network to process an image and extract pieces of information from it.

3. General Structure of an Artificial Neural Network (ANN):

Figure 3: An Artificial Neural NetworkFigure 3: An Artificial Neural NetworkFigure 4: An artificial Neural Network With 3 LayersFigure 4: An artificial Neural Network With 3 LayersFigure 5: The perceptron by Frank Rosenblatt | Source: Machine Learning Department at Carnegie Mellon UniversityFigure 5: The perceptron by Frank Rosenblatt | Source: Machine Learning Department at Carnegie Mellon University

4. What is a Perceptron?

A perceptron is a neural network without any hidden layer. A perceptron only has an input layer and an output layer.

Figure 6: A perceptronFigure 6: A perceptron

Where we can use perceptrons?

Perceptrons’ use lies in many case scenarios. While a perceptron is mostly used for simple decision-making, these can also come together in larger computer programs to solve more complex problems.

For instance:

  1. Give access if a person is a faculty member and deny access if a person is a student.
  2. Provide entry for humans only.
  3. Implementation of logic gates [2].

Steps involved in the implementation of a neural network:

A neural network executes in 2 steps :

1. Feedforward:

On a feedforward neural network, we have a set of input features and some random weights. Notice that in this case, we are taking random weights that we will optimize using backward propagation.

2. Backpropagation:

During backpropagation, we calculate the error between predicted output and target output and then use an algorithm (gradient descent) to update the weight values.

Why do we need backpropagation?

While designing a neural network, first, we need to train a model and assign specific weights to each of those inputs. That weight decides how vital is that feature for our prediction. The higher the weight, the greater the importance. However, initially, we do not know the specific weight required by those inputs. So what we do is, we assign some random weight to our inputs, and our model calculates the error in prediction. Thereafter, we update our weight values and rerun the code (backpropagation). After individual iterations, we can get lower error values and higher accuracy.

Summarizing an Artificial Neural Network:

  1. Take inputs
  2. Add bias (if required)
  3. Assign random weights to input features
  4. Run the code for training.
  5. Find the error in prediction.
  6. Update the weight by gradient descent algorithm.
  7. Repeat the training phase with updated weights.
  8. Make predictions.

Flow chart for a simple neural network:

Figure 7: Artificial Neural Network (ANN) Basic Flow ChartFigure 7: Artificial Neural Network (ANN) Basic Flow Chart

The training phase of a neural network:

Figure 8: Training phase of a neural networkFigure 8: Training phase of a neural network

5. Perceptron Example:

Below is a simple perceptron model with four inputs and one output.

Figure 9: A simple perceptronFigure 9: A simple perceptronFigure 10: A set of dataFigure 10: A set of data

What we have here is the input values and their corresponding target output values. So what we are going to do, is assign some weight to the inputs and then calculate their predicted output values.

In this example we are going to calculate the output by the following formula:

Figure 11: Formula to calculate the neural net’s outputFigure 11: Formula to calculate the neural net’s output

For the sake of this example, we are going to take the bias value = 0 for simplicity of calculation.

a. Let’s take W = 3 and check the predicted output.

Figure 12: The output when W = 3Figure 12: The output when W = 3

b. After we have found the value of predicted output for W=3, we are going to compare it with our target output, and by doing that, we can find the error in the prediction model. Keep in mind that our goal is to achieve minimum error and maximum accuracy for our model.

Figure 13: The error when W = 3Figure 13: The error when W = 3

c. Notice that in the above calculation, there is an error in 3 out of 4 predictions. So we have to change the parameter values of our weight to set in low. Now we have two options:

  1. Increase weight
  2. Decrease weight

First, we are going to increase the value of the weight and check whether it leads to a higher error rate or lower error rate. Here we increased the weight value by 1 and changed it to W = 4.

Figure 14: Output when W = 4Figure 14: Output when W = 4

d. As we can see in the figure above, is that the error in prediction is increasing. So now we can conclude that increasing the weight value does not help us in reducing the error in prediction.

Figure 15: Error when W = 4Figure 15: Error when W = 4

e. After we fail in increasing the weight value, we are going to decrease the value of weight for it. Furthermore, by doing that, we can see whether it helps or not.

Figure 16: Output when W = 2Figure 16: Output when W = 2

f. Calculate the error in prediction. Here we can see that we have achieved the global minimum.

Figure 17: Error when W = 2Figure 17: Error when W = 2

In figure 17, we can see that there is no error in prediction.

Now what we did here:

  1. First, we have our input values and target output.
  2. Then we initialized some random value to W, and then we proceed further.
  3. Last, we calculated the error for in prediction for that weight value. Afterward, we updated the weight and predicted the output. After several trial and error epochs, we can reduce the error in prediction.

Figure 18: Illustrating our functionFigure 18: Illustrating our function

So, we are trying to get the value of weight such that the error becomes minimum. We need to figure out whether we need to increase or decrease the weight value. Once we know that, we keep on updating the weight value in that direction until error becomes minimum. We might reach a point where if further updates occur to the weight, the error will increase. At that time, we need to stop, and that is our final weight value.

In real-life data, the situation can be a bit more complex. In the example above, we saw that we could try different weight values and get the minimum error manually. However, in real-life data, weight values are often decimal (non-integer). Therefore, we are going to use a gradient descent algorithm with a low learning rate so that we can try different weight values and obtain the best predictions from our model.

Figure 19: Formula representing the finalFigure 19: Formula representing the final

6. Sigmoid Function:

A sigmoid function serves as an activation function in our neural network training. We generally use neural networks for classifications. In binary classification, we have 2 types. However, as we can see, our output value can be any possible number from the equation we used. To solve that problem, we use a sigmoid function. Now for classification, we want our output values to be 0 or 1. So to get values between 0 and 1 we use the sigmoid function. The sigmoid function converts our output values between 0 and 1.

Let’s have a look at it:

Figure 20: Sigmoid functionFigure 20: Sigmoid function

Let’s visualize our sigmoid function with Python:

Figure 21: Python code for the sigmoid function

Output:

Figure 22: Sigmoid function graph

Explanation:

In figure 21 and 22, for any input values, the value of the sigmoid function will always lie between 0 and 1. Here notice that for negative numbers, the output of the sigmoid function is ≤0.5, or we can say closer to zero, and for positive numbers, the output is going to be >0.5, or we can say closer to 1.

7. Neural Network Implementation from Scratch:

We are going to do is implement the “OR” logic gate using a perceptron. Keep in mind that here we are not going to use any of the hidden layers.

What is logical OR Gate?

Straightforwardly, when one of the inputs is 1, the output of the OR gate is going to be 1. It means that the output is 0 only when both of the inputs are 0.

Representation:

Figure 23: The OR gateFigure 23: The OR gate

Truth-Table for OR gate:

Figure 24: Set of truth-table data for the OR gateFigure 24: Set of truth-table data for the OR gate

Perceptron for the OR gate:

Figure 25: A perceptronFigure 25: A perceptron

Next, we are going to assign some weights to each of the input values and calculate it.

Figure 26: A-weighted perceptronFigure 26: A-weighted perceptron

Example: (Calculating Manually)

a. Calculate the input for o1:

Figure 27: Formula to calculate the input for o1

b. Calculate the output value:

Figure 28: Formula to calculate the output valueFigure 29: Result output value

Notice that from our truth table, we can see that we wanted the output of 1, but what we get here is 0.68997. Now we need to calculate the error and then backpropagate and then update the weight values.

c. Error Calculation:

Next, we are going to use Mean Squared Error for calculating the error :

Figure 30: Mean squared error formula

The summation sign (Sigma symbol) means that we have to add our error for all our input sets. Here we are going to see how that works for only one input set.

Figure 31: Result of the MSE

We have to do the same for all the remaining inputs. Now that we have found the error, we have to update the values of weight to make the error minimum. For updating weight values, we are going to use a gradient descent algorithm.

8. What is Gradient Descent?

Gradient Descent is a machine learning algorithm that operates iteratively to find the optimal values for its parameters. It takes into account, user-defined learning rate, and initial parameter values.

Working: (Iterative)

1. Start with initial values.

2. Calculate cost.

3. Update values using the update function.

4. Returns minimized cost for our cost function

Why do we need it?

Generally, what we do is, we find the formula that gives us the optimal values for our parameter. However, in this algorithm, it finds the value by itself!

Interesting, isn’t it?

Figure 32; Formula for the Gradient Descent algorithm

We are going to update our weight with this algorithm. First of all, we need to find the derivative f(X).

9. Derivation of the formula used in a neural network

Next, what we want to find is how a particular weight value affects the error. To find that we are going to apply the chain rule.

Figure 33: Finding the derivative

Afterward, what we have to do is we have to find values for these three derivatives.

In the following images, we have tried to show the derivation of each of these derivatives to showcase the math behind gradient descent.

d. Calculating derivatives:

Figure 34: Calculating the derivatives

In our case:

Output = 0.68997
Target = 1

Figure 35: Finding the first derivative

e. Finding the second part of the derivative:

Figure 36: Calculating the second part

To understand it step-by-step:

e.a. Value of outo1:

Figure 37: Value of outo1

e.b. Finding the derivative with respect to ino1:

Figure 38: Derivative of outo1 with respect to ino1

e.c. Simplifying it a bit to find the derivative easily:

Figure 39: Simplication

e.d. Applying chain rule and power rule:

Figure 40: Applying the chain rule, along with power rule

e.e. Applying sum rule:

Figure 41: Applying sum rule to outo1 with respect to ino1

e.f. The derivative of constant is zero:

Figure 42: Derivative of the constant is zero

e.g. Applying exponential rule and chain rule:

Figure 42: Applying exponential rule and a chain rule

e.h. Simplifying it a bit:

Figure 43: Simplifying the derivative

e.i. Multiplying both negative signs:

Figure 44: Multiplication of both negations

e.j. Put the negative power in the denominator:

Figure 45: Moving the negative power to the denominator

That is it. However, we need to simplify it as it is a little complex for our machine learning algorithm to process for a large number of inputs.

e.k. Simplifying it:

Figure 46: Simplifying the algorithm

e.l. Further simplification:

Figure 47: Step two of the simplyfication

e.k. Adding +1–1:

Figure 48: Adding the values

e.l. Separate the parts:

Figure 49: Separating the algorithm

e.m. Simplify:

Figure 50: Simplify the separation

e.n. Now we all know the value of outo1 from equation 1:

Figure 51: Value from outo1

e.o. From that we can derive the following final derivative:

Figure 52: Deriving the final derivative

e.p. Calculating the value of our input:

Figure 53: Final calculation of the output

f. Finding the third part of the derivative :

Figure 54: Formula to calculate the third derivative

f.a Value of ino:

Figure 55: Value of ino

f.b. Finding derivative:

All the other values except w2 will be considered constant here.

Figure 56: Finding the derivative

f.c Calculating both values for our input:

Figure 57: Calculating both values for the input

f.d. Putting it all together:

Figure 58: Calculating it as a whole

f.e. Putting it in our main equation:

Figure 59: Putting it on the main equation

f.f. We can calculate:

Figure 60: Calculation of second weight

Notice that the value of the weight has increased here. We can calculate all the values in this way, but as we can see, it is going to be a lengthy process. So now we are going to implement all the steps in Python.