Neural Network From Scratch

Neural Network From Scratch

Jan 2022

In this edition of Napkin Math, we’ll invoke the spirit of the Napkin Math
series to establish a mental model for how a neural network works by building
one from scratch. In a future issue we will do napkin math on performance, as
establishing the first-principle understanding is plenty of ground to cover for
today!

Neural nets are increasingly dominating the field of machine learning / artificial
intelligence: the most sophisticated models for computer vision (e.g. CLIP),
natural language processing (e.g. GPT-3), translation (e.g. Google Translate),
and more are based on neural nets. When these artificial neural nets reach some
arbitrary threshold of neurons, we call it deep learning.

A visceral example of Deep Learning’s unreasonable effectiveness comes from
this interview with Jeff Dean who leads AI at Google. He explains how
500 lines of Tensorflow outperformed the previous ~500,000 lines of code for
Google Translate’s extremely complicated model. Blew my mind. 1

As a software developer with a predominantly web-related skillset of Ruby,
databases, enough distributed systems knowledge to know to not get fancy, a bit
of hard-earned systems knowledge from debugging incidents, but only high school
level math: neural networks mystify me. How do they work? Why are they so
good? Why are they so slow? Why are GPUs/TPUs used to speed them up? Why do the
biggest models have more neurons than humans, yet still perform worse than the
human brain? 2

In true napkin math fashion, the best course of action to answer those questions
is by implementing a simple neural net from scratch.

Mental Model for a Neural Net: Building one from scratch

The hardest part of napkin math isn’t the calculation itself: it’s acquiring the
conceptual understanding of a system to come up with an equation for its
performance. Presenting and testing mental models of common systems is the crux
of value from the napkin math series!

The simplest neural net we can draw might look something like this:

lol

  • Input layer. This is a representation of the data that we want to feed to
    the neural net. For example, the input layer for a 4×4 pixel grayscale image
    that looks like this [1, 1, 1, 0.2] could be [1, 1, 1, 0.2]. Meaning the first 3 pixels are darkest (1.0) and the last pixel is
    lighter (0.2).
  • Hidden Layer. This is the layer that does a bunch of math on the input
    layer to convert it to our prediction. Training a model refers to changing the
    math of the hidden layer(s) to more often create an output like the training
    data. We will go into more detail with this layer in a moment. The values in the
    hidden layer are called weights.
  • Output Layer. This layer will contain our final prediction. For example,
    if we feed it the rectangle from before [1, 1, 1, 0.2] we
    might want the output layer to be a single number to represent how “dark” a
    rectangle is, e.g.: 0.8.

For example for the image [0.8, 0.7, 1, 1] = [0.8, 0.7, 1, 1] we’d expect a value close to 1 (dark!).

In contrast, for [0.2, 0.5, 0.4, 0.7] = [0.2, 0.5, 0.4, 0.7] we
expect something closer to 0 than to 1.

Let’s implement a neural network from our simple mental model. The goal of this
neural network is to take a grayscale 2×2 image and tell us how “dark” it is
where 0 is completely white [0, 0, 0, 0], and 1 is
completely black [1, 1, 1, 1]. We will initialize the
hidden layer with some random values at first, in Python:

input_layer 

=

[

0.2

,

0.5

,

0.4

,

0.7

]

hidden_layer

=

[

0.98

,

0.4

,

0.86

,

-

0.08

]

output_neuron

=

0

for

index

,

input_neuron

in

enumerate

(

input_layer

)

:

output_neuron

+=

input_neuron

*

hidden_layer

[

index

]

print

(

output_neuron

)

Our neural network is giving us model([0.2, 0.5, 0.4, 0.7]) = 0.7 which is closer to ‘dark’ (1.0) than ‘light’ (0.0). When looking
at this rectangle as a human, we judge it to be more bright than dark, so we
were expecting something below 0.5!

There’s a notebook with the final code available. You can make a copy and execute it there. For early versions of the code, such as the above, you can create a new cell at the beginning of the notebook and build up from there!

The only real thing we can change in our neural network in its current form is
the hidden layer’s values. How do we change the hidden layer values so that the
output neuron is close to 1 when the rectangle is dark, and close to 0 when it’s
light?

We could abandon this approach and just take the average of all the pixels. That
would work well! However, that’s not really the point of a neural net… We’ll
hit an impasse if we one day expand our model to try to implement
recognize_letters_from_picture(img) or is_cat(img).

Fundamentally, a neural network is just a way to approximate any function. It’s
really hard to sit down and write is_cat, but the same technique we’re using
to implement average through a neural network can be used to implement
is_cat. This is called the universal approximation theorem: an
artificial neural network can approximate any function!

So, let’s try to teach our simple neural network to take the average() of the
pixels instead of explicitly telling it that that’s what we want! The idea of
this walkthrough example is to understand a neural net with very few values and
low complexity, otherwise it’s difficult to develop an intuition when we move to
1,000s of values and 10s of layers, as real neural networks have.

We can observe that if we manually modify all the hidden layer attributes to
0.25, our neural network is actually an average function!

input_layer 

=

[

0.2

,

0.5

,

0.4

,

0.7

]

hidden_layer

=

[

0.25

,

0.25

,

0.25

,

0.25

]

output_neuron

=

0

for

index

,

input_neuron

in

enumerate

(

input_layer

)

:

output_neuron

+=

input_neuron

*

hidden_layer

[

index

]

print

(

output_neuron

)

print

(

sum

(

input_layer

)

/

4

)

model([0.2, 0.5, 0.4, 0.7]) = 0.45 sounds about right. The
rectangle is a little lighter than it’s dark.

But that was cheating! We only showed that we can implement average() by
simply changing the hidden layer’s values. But that won’t work if we try to implement
something more complicated. Let’s go back to our original hidden layer
initialized with random values:

hidden_layer 

=

[

0.98

,

0.4

,

0.86

,

-

0.08

]

How can we teach our neural network to implement average?

Training our Neural Network

To teach our model, we need to create some training data. We’ll create some
rectangles and calculate their average:

rectangles 

=

[

]

rectangle_average

=

[

]

for

i

in

range

(

0

,

1000

)

:

rectangle

=

[

round

(

random

.

random

(

)

,

1

)

,

round

(

random

.

random

(

)

,

1

)

,

round

(

random

.

random

(

)

,

1

)

,

round

(

random

.

random

(

)

,

1

)

]

rectangles

.

append

(

rectangle

)

rectangle_average

.

append

(

sum

(

rectangle

)

/

4

)

Brilliant, so we can now feed these to our little neural network and get a
result! Next step is for our neural network to adjust the values in the hidden
layer based on how its output compares with the actual average in the training
data. This is called our loss function: large loss, very wrong model; small
loss, less wrong model. We can use a standard measure called mean squared
error:

 
 
 

def

mean_squared_error

(

actual

,

expected

)

:

error_sum

=

0

for

a

,

b

in

zip

(

actual

,

expected

)

:

error_sum

+=

(

a

-

b

)

**

2

return

error_sum

/

len

(

actual

)

print

(

mean_squared_error

(

[

1.

]

,

[

2.

]

)

)

print

(

mean_squared_error

(

[

1.

]

,

[

3.

]

)

)

Now we can implement train():

def

model

(

rectangle

,

hidden_layer

)

:

output_neuron

=

0.

for

index

,

input_neuron

in

enumerate

(

rectangle

)

:

output_neuron

+=

input_neuron

*

hidden_layer

[

index

]

return

output_neuron

def

train

(

rectangles

,

hidden_layer

)

:

outputs

=

[

]

for

rectangle

in

rectangles

:

output

=

model

(

rectangle

,

hidden_layer

)

outputs

.

append

(

output

)

return

outputs hidden_layer

=

[

0.98

,

0.4

,

0.86

,

-

0.08

]

outputs

=

train

(

rectangles

,

hidden_layer

)

print

(

outputs

[

0

:

10

]

)

print

(

rectangle_average

[

0

:

10

]

)

mean_squared_error

(

outputs

,

rectangle_average

)

A good mean squared error is close to 0. Our model isn’t very good. But! We’ve
got the skeleton of a feedback loop in place for updating the hidden layer.

Updating the Hidden Layer with Gradient Descent

Now what we need is a way to update the hidden layer in response to the mean
squared error / loss. We need to minimize the value of this function:

mean_squared_error

(

train

(

rectangles

,

hidden_layer

)

,

rectangle_average

)

As noted earlier, the only thing we can really change here are the weights in
the hidden layer. How can we possibly know which weights will minimize this
function?

We could randomize the weights, calculate the loss (how wrong the model is,
in our case, with mean squared error), and then save the best ones we see after
some period of time.

We could possibly speed this up. If we have good weights, we could try to add
some random numbers to those. See if loss improves. This could work, but it
sounds slow… and likely to get stuck in some local maxima and not give a very
good result. And it’s trouble scaling this to 1,000s of weights…

Instead of embarking on this ad-hoc randomization mess, it turns out that
there’s a method called gradient descent to minimize the value of a function!
Gradient descent builds on a bit of calculus that you may not have touched on
since high school. We won’t go into depth here, but try to introduce just
enough that you understand the concept. 3

Let’s try to understand gradient descent. Consider some random function whose
graph might look like this:

Graph of a random function with some irregular shapesGraph of a function with an irregular curve with a local and global minimum.

How do we write code to find the minimum, the deepest (second) valley, of this function?

Let’s say that we’re at x=1 and we know the slope of the function at this
point. The slope is “how fast the function grows at this very point.” You may
remember this as the derivative. The slope at x=1 might be -1.5. This
means that every time we increase x += 1, it results in y -= 1.5. We’ll go
into how you figure out the slope in a bit, let’s focus on the concept first.

Graph function with some slope or derivative

The idea of gradient descent is that since we know the value of our function,
y, is decreasing as we increase x, we can increase x proportionally to the
slope. In other words, if we increase x by the slope, we step towards the
valley by 1.5.

Let’s take that step of x += 1.5:

Overshooting in gradient descent

Ugh, turned out that we stepped too far, past this valley! If we repeat the
step, we’ll land somewhere on the left side of the valley, to then bounce back
on the right side. We might never land in the bottom of the valley. Bummer.
Either way, this isn’t the global minimum of the function. We return to that
in a moment!

We can fix the overstepping easily by taking smaller steps. Perhaps we should’ve
stepped by just 0.1∗1.5=0.150.1 * 1.5 = 0.150.1∗1.5=0.15 instead. That would’ve smoothly landed us at
the bottom of the valley. That multiplier, 0.1, is called the learning rate
in gradient descent.

Minimum of function with gradient descent

But hang on, that’s not actually the minimum of the function. See that valley to
the right? That’s the actual global minimum. If our initial x value had been
e.g. 3, we might have found the global minimum instead of our local minimum.

Finding the global minimum of a function is hard. Gradient descent will give
us a minimum, but not the minimum. Unfortunately, it turns out it’s the best
weapon we have at our disposal. Especially when we have big, complicated
functions (like a neural net with millions of neurons). Gradient descent will
not always find the global minimum, but something pretty good.

This method of using the slope/derivative generalizes. For example, consider
optimizing a function in three-dimensions. We can visualize the gradient descent
method here as rolling a ball to the lowest point. A big neural network is
1000s of dimensions, but gradient descent still works to minimize the loss!

Depicts a 3-dimensional graph, if we do gradient descent on this we might imagine it as rolling a ball down the hill.Depicts a 3-dimensional graph, if we do gradient descent on this we might imagine it as rolling a ball down the hill.

Finalizing our Neural Network from scratch

Let’s summarize where we are:

  • We can implement a simple neural net: model().
  • Our neural net can figure out how wrong it is for a training set: loss(train()).
  • We have a method, gradient descent, for tuning our hidden layer’s weights
    for the minimum loss. I.e. we have a method to adjust those four random values
    in our hidden layer to take a better average as we iterate through the
    training data.

Now, let’s implement gradient descent and see if we can make our neural net
learn to take the average grayscale of our small rectangles:

def

model

(

rectangle

,

hidden_layer

)

:

output_neuron

=

0.

for

index

,

input_neuron

in

enumerate

(

rectangle

)

:

output_neuron

+=

input_neuron

*

hidden_layer

[

index

]

return

output_neuron

def

train

(

rectangles

,

hidden_layer

)

:

outputs

=

[

]

for

rectangle

in

rectangles

:

output

=

model

(

rectangle

,

hidden_layer

)

outputs

.

append

(

output

)

mean_squared_error

(

outputs

,

rectangle_average

)

for

index

,

_

in

enumerate

(

hidden_layer

)

:

learning_rate

=

0.1

hidden_layer

[

index

]

-=

learning_rate

*

hidden_layer

[

index

]

.

slope

return

outputs hidden_layer

=

[

0.98

,

0.4

,

0.86

,

-

0.08

]

train

(

rectangles

,

hidden_layer

)

Automagically computing the slope of a function with autograd

The missing piece here is to figure out the slope() after we’ve gone through
our training set. Figuring out the slope/derivative at a certain point is
tricky. It involves a fair bit of math. I am not going to go into the math of
calculating derivatives. Instead, we’ll do what all the machine learning
libraries do: automatically calculate it. 4

Minimizing the loss of a function is absolutely fundamental to machine learning.
The functions (neural networks) are so complicated that manually sitting down
to figure out the derivative like you might’ve done in high school is not
feasible. It’s the mathematical equivalent of writing assembly to implement a
website.

Let’s show one simple example of finding the derivative of a function, before we
let the computers do it all for us. If we have f(x)=x2f(x) = x^2f(x)=x2, then you might
remember from calculus classes that the derivative is f′(x)=2xf'(x) = 2xf′(x)=2x. In other
words, f(x)f(x)f(x)’s slope at any point is 2x, telling us it’s increasing
non-linearly. Well that’s exactly how we understand x2x^2×2, perfect! This means
that for x=2x = 2x=2 the slope is 444.

With the basics in order, we can use an autograd package to avoid the messy
business of computing our own derivatives. autograd is an automatic
differentiation engine. grad stands for gradient, which we can think of as the
derivative/slope of a function with more than one parameter.

It’s best to show how it works by using our example from before:

import

torch x

=

torch

.

tensor

(

2.

,

requires_grad

=

True

)

y

=

x

**

2

y

.

backward

(

)

print

(

x

.

grad

)

autograd is the closest to magic we get. I could do the most ridiculous stuff
with this tensor, and it’ll keep track of all the math operations applied and
have the ability to compute the derivative. We won’t go into how. Partly because
I don’t know how, and this post is long enough.

Just to convince you of this, we can be a little cheeky and do a bunch of random
stuff. I’m trying to really hammer this home, because this is what confused me
the most when learning about neural networks. It wasn’t obvious to me that a
neural network, including executing the loss function on the whole training set,
is just a function, and however complicated, we can still take the derivative
of it and use gradient descent. Even if it’s so many dimensions that it can’t be
neatly visualized as a ball rolling down a hill.

autograd doesn’t complain as we add complexity and will still calculate the
gradients. In this example we’ll even use a matrix/tensor with a few more elements and
calculate an average (like our loss function mean_squared_error), which is the
kind of thing we’ll calculate the gradients for in our neural network:

import

random

import

torch x

=

torch

.

tensor

(

[

0.2

,

0.3

,

0.8

,

0.1

]

,

requires_grad

=

True

)

y

=

x

for

_

in

range

(

3

)

:

choice

=

random

.

randint

(

0

,

2

)

if

choice

==

0

:

y

=

y

**

random

.

randint

(

1

,

10

)

elif

choice

==

1

:

y

=

y

.

sqrt

(

)

elif

choice

==

2

:

y

=

y

.

atanh

(

)

y

=

y

.

mean

(

)

y

.

backward

(

)

print

(

x

.

grad

)

Let’s use autograd for our neural net and then run it against our square from
earlier model([0.2, 0.5, 0.4, 0.7]) = 0.45:

import

torch

as

torch

def

model

(

rectangle

,

hidden_layer

)

:

output_neuron

=

0.

for

index

,

input_neuron

in

enumerate

(

rectangle

)

:

output_neuron

+=

input_neuron

*

hidden_layer

[

index

]

return

output_neuron

def

train

(

rectangles

,

hidden_layer

)

:

outputs

=

[

]

for

rectangle

in

rectangles

:

output

=

model

(

rectangle

,

hidden_layer

)

outputs

.

append

(

output

)

error

=

mean_squared_error

(

outputs

,

rectangle_average

)

error

.

backward

(

)

for

index

,

_

in

enumerate

(

hidden_layer

)

:

learning_rate

=

0.1

hidden_layer

.

data

[

index

]

-=

learning_rate

*

hidden_layer

.

grad

.

data

[

index

]

hidden_layer

.

grad

.

zero_

(

)

return

error hidden_layer

=

torch

.

tensor

(

[

0.98

,

0.4

,

0.86

,

-

0.08

]

,

requires_grad

=

True

)

print

(

model

(

[

0.2

,

0.5

,

0.4

,

0.7

]

,

hidden_layer

)

)

train

(

rectangles

,

hidden_layer

)

print

(

f"After:

{

model

(

[

0.2

,

0.5

,

0.4

,

0.7

]

,

hidden_layer

)

}

"

)

This blew my mind the first time I did this. Look at that. It’s optimizing the
hidden layer for all weights in the right direction! We’re expecting them all
to nudge towards 0.250.250.25 to implement average(). We haven’t told it anything
about average, we’ve just told it how wrong it is through the loss.

It’s important to understand how hidden_layer.grad is set here. The
hidden layer is instantiated as a tensor with an argument telling Pytorch to
keep track of all operations made to it. This allows us to later call backward() on a future tensor that derives from the hidden layer,
in this case, the error tensor, which is further derived from the
outputs tensor. You can read more in the documentation

But, the hidden layer isn’t all 0.250.250.25 quite yet, as we expect for it to
implement average. So how do we get them to that? Well, let’s try to repeat
the gradient descent process 100 times and see if we’re getting even better!

 

for

epoch

in

range

(

100

)

:

error

=

train

(

rectangles

,

hidden_layer

)

print

(

f"Epoch:

{

epoch

}

, Error:

{

error

}

, Layer:

{

hidden_layer

.

data

}

\n\n"

)

print

(

model

(

[

0.2

,

0.5

,

0.4

,

0.7

]

,

hidden_layer

)

.

item

(

)

)

Pretty close, but not quite there. I ran it for 300300300 times (an iteration over
the full training set is referred to as an epoch, so 300 epochs) instead, and
then I got:

print

(

model

(

[

0.2

,

0.5

,

0.4

,

0.7

]

,

hidden_layer

)

.

item

(

)

)

Boom! Our neural net has almost learned to take the average, off by just a
scanty 0.0020.0020.002. If we fine-tuned the learning rate and number of epochs we could
probably get it there, but I’m happy with this. model([0.2, 0.5, 0.4, 0.7]) = 0.448:

That’s it. That’s your first neural net:

model(rectangle)≈avg(rectangle)model(rectangle) \approx avg(rectangle)

m

o

d

e

l

(

rec

t

an

g

l

e

)

a

vg

(

rec

t

an

g

l

e

)

OK, so you just implemented the most complicated average function I’ve ever seen…

Sure did. The thing is, that if we adjusted it for looking for cats, it’s the
least complicated is_cat you’ll ever see. Because our neural network could
implement that too by changing the training data. Remember, a neural network
with enough neurons can approximate any function. You’ve just learned all the
building blocks to do it. We just started with the simplest possible example.

If you give the hidden layer some more neurons, this neural net will be able to
recognize handwritten numbers with decent accuracy (possible fun
exercise for you, see bottom of article), like this one:

An upscaled version of a handdrawn 3 from the 28x28 MNIST dataset.An upscaled version of a handdrawn 3 from the 28×28 MNIST dataset.

Activation Functions

To be truly powerful, there is one paramount modification we have to make to our
neural net. Above, we were implementing the averageaverageaverage function. However, were
our neural net to implement which_digit(png) or is_cat(jpg) then it wouldn’t work.

Recognizing handwritten digits isn’t a linear function, like average(). It’s
non-linear. It’s a crazy function, with a crazy shape (unlike a linear
function). To create crazy functions with crazy shapes, we have to introduce a
non-linear component to our neural network. This is called an activation
function. It can be e.g. ReLu(x)=max(0,x)ReLu(x) = max(0, x)ReLu(x)=max(0,x). There are many kinds of
activation functions that are good for different things.
5

We can apply this simple operation to our neural net:

def

model

(

rectangle

,

hidden_layer

)

:

output_neuron

=

0.

for

index

,

input_neuron

in

enumerate

(

rectangle

)

:

output_neuron

+=

input_neuron

*

hidden_layer

[

index

]

return

max

(

0

,

output_neuron

)

Now, we only have a single neuron/weight… that isn’t much. Good models have
100s, and the biggest models like GPT-3 have billions. So this won’t recognize
many digits or cats, but you can easily add more weights!

Matrices

The core operation in our model, the for loop, is matrix multiplication. We could
rewrite it to use them instead, e.g. rectangle @ hidden_layer. PyTorch will
then do the exact same thing. Except, it’ll now execute in C-land. And if you
have a GPU and pass some extra weights, it’ll execute on a GPU, which is even
faster. When doing any kind of deep learning, you want to avoid writing any
Python loops. They’re just too slow. If you ran the code above for the 300
epochs, you’ll see that it takes minutes to complete. I left matrices out of it
to simplify the explanation as much as possible. There’s plenty going on without
them.

Next steps to implement your own neural net from scratch

Even if you’ve carefully read through this article, you won’t fully grasp it
yet until you’ve had your own hands on it. Here are some suggestions on where to
go from here, if you’d like to move beyond the basic understanding you have now:

  1. Get the notebook running and study the code
  2. Change it to far larger rectangles, e.g. 100×100
  3. Add biases in addition to the weights. A model doesn’t just have
    weights that are multiplied onto the inputs, but also biases that are added
    (+) onto the inputs in each layer.
  4. Rewrite the model to use PyTorch tensors for matrix operations, as
    described in the previous section.
  5. Add 1-2 more layers to the model. Try to have them have different sizes.
  6. Change the tensors to run on GPU (see the PyTorch
    documentation) and see the
    performance speed up! Increase the size of the training set and rectangles to
    really be able to tell the difference. Make sure you change Runtime > Change Runtime Type in Collab to run on a GPU.
  7. This is a difficult step that will likely take a while, but it’ll be well
    worth it: Adapt the code to recognize handwritten letters from the MNIST
    dataset dataset. You’ll need to use pillow to turn
    the pixels into a large 1-dimensional tensor as the input layer, as well as a
    non-linear activation function like Sigmoid or ReLU. Use Nielsen’s
    book as a reference if you get stuck, which does exactly this.

I thoroughly hope you enjoyed this walkthrough of a neural net from scratch! In
a future issue we’ll use the mental model we’ve built up here to do some napkin
math on expected performance on training and using neural nets.

Thanks to Vegard Stikbakke, Andrew Bugera and Jonathan
Belotti for providing valuable feedback on drafts of this article.

Subscribe through email, RSS or Twitter to new articles!

2,907 subscribers

You might also like…