Python Programming Tutorials
Our loss_function
is what calculates “how far off” our classifications are from reality. As humans, we tend to think of things in terms of either right, or wrong. With a neural network, and arguably humans too, our accuracy is actually some sort of scaling score.
For example, you might be highly confident that something is the case, but you are wrong. Compare this to a time when you really aren’t certain either way, but maybe think something, but are wrong. In these cases, the degree to which you’re wrong doesn’t matter in terms of the choice necessarily, but in terms of you learning, it does.
In terms of a machine learning by tweaking lots of little parameters to slowly get closer and closer to fitting, it definitely matters how wrong things are.
For this, we use loss
, which is a measurement of how far off the neural network is from the targeted output. There are a few types of loss calculations. A popular one is mean squared error, but we’re trying to use these scalar-valued classes.
In general, you’re going to have two types of classes. One will just be a scalar value, the other is what’s called a one_hot
array/vector.
In our case, a zero might be classified as:
0
or [1, 0, 0, 0, 0, 0, 0 ,0 ,0 ,0]
[1, 0, 0, 0, 0, 0, 0 ,0 ,0 ,0]
is a one_hot
array where quite literally one element only is a 1 and the rest are zero. The index that is hot is the classification.
A one_hot
vector for a a 3 would be:
[0, 0, 0, 1, 0, 0, 0 ,0 ,0 ,0]
I tend to use one_hot
, but this data is specifying a scalar class, so 0, or 1, or 2…and so on.
Depending on what your targets look like, you will need a specific loss.
For one_hot
vectors, I tend to use mean squared error
.
For these scalar classifications, I use cross entropy
.
Next, we have our optimizer. This is the thing that adjusts our model’s adjustable parameters like the weights, to slowly, over time, fit our data. I am going to have us using Adam
, which is Adaptive Momentum
. This is the standard go-to optimizer usually. There’s a new one called rectified adam
that is gaining steam. I haven’t had the chance yet to make use of that in any project, and I do not think it’s available as just an importable function in Pytorch yet, but keep your eyes peeled for it! For now, Adam will do just fine I’m sure. The other thing here is lr
, which is the learning rate
. A good number to start with here is 0.001
or 1e-3
. The learning rate dictates the magnitude of changes that the optimizer can make at a time. Thus, the larger the LR, the quicker the model can learn, but also you might find that the steps you allow the optimizer to make are actually too big and the optimizer gets stuck bouncing around rather than improving. Too small, and the model can take much longer to learn as well as also possibly getting stuck.
Imagine the learning rate as the “size of steps” that the optimizer can take as it searches for the bottom of a mountain, where the path to the bottom isn’t necessarily a simple straight path down. Here’s some lovely imagery to help explain learning rate:
The black line is the “path” to the bottom of the optimization curve. When it comes to optimizing, sometimes you have to get worse in order to actually get beyond some local optimum. The optimizer doesn’t know what the absolute best spot could be, it just takes steps to see if it can find it. Thus, as you can see in the image, if the steps are too big, it will never get to the lower points. If the steps are too small (learning rate too small), it can get stuck as well long before it reaches a bottom. The goal is for something more like:
For simpler tasks, a learning rate of 0.001 usually is more than fine. For more complex tasks, you will see a learning rate with what’s called a decay. Basically you start the learning rate at something like 0.001, or 0.01…etc, and then over time, that learning rate gets smaller and smaller. The idea being you can initially train fast, and slowly take smaller steps, hopefully getthing the best of both worlds:
More on learning rates and decay later. For now, 0.001
will work just fine, and you just need to think of learning rate as what it sounds like: How quickly should we try to get this optimizer to optimize things.
Now we can iterate over our data. In general, you will make more than just 1 pass through your entire training dataset.
Each full pass through your dataset is referred to as an epoch
. In general, you will probably have somewhere between 3 and 10 epochs, but there’s no hard rule here.
Too few epochs, and your model wont learn everything it could have.
Too many epochs and your model will over fit
to your in-sample data (basically memorize the in-sample data, and perform poorly on out of sample data).
Let’s go with 3 epochs for now. So we will loop over epochs, and each epoch will loop over our data. Something like: