Why do we use a sigmoid activation function in artificial neural networks (ANN)

After going through myriad intro level tutorials with fits and starts over a period of two years on Machine Learning (ML), I finally have some basic understanding of how neural networks work. Running a ML algorithm wasn’t that difficult with Jupyter notebooks on a rented GPU server, it was wrapping my head around what is happening underneath the hood and why. I could always follow the explanation from perceptron to neuron, it was at introduction of the sigmoid activation function that my mind used to go blank and everything afterwards about backward propagation and diminishing descent was hazy.

Perceptron / Neuron

Perceptron is easy. A perceptron gets inputs say (x1, x2, s3) which are weighted by weights (w) and summed up. The sum is compared with a threshold (b). If the weighted perceptron exceeds the threshold, the perceptron fires a signal.

In neural networks, the threshold is called a bias and the equation of the firing signal rule can be re-written as

w.x is the dot product of input and weight vectors and b is the bias vector.

 Sigmoid Activation Function

The next step is to apply an activation function to the output. In all the tutorials, a sigmoid function is applied.

Where sigmoid function is defined as

With z=w.x+b

What I always had trouble understanding was where does this sigmoid function come from? Why is it needed?

The answer it seems is that the perceptron rule is a linear equation. If we have a multiple layers of neuron as in a deep neural network, with output from one layer acting as input for the subsequent layers, without the activation functions, what we are essentially getting is a linear transformation. It doesn’t matter how many layers there are in the neural network, as all the linear functions added together will you give you a linear functions effectively behaving like a single perceptron. It is ok if we are modeling a linear function but if we are trying to model the real world, it has non-linearity in it. Hence we introduce a non-linear activation function i.e. to bring non-linearity into the model.

Considering a multi-layered network, when the model doesn’t produce the desired output, we need to modify the weights and biases. Without the activation function, a slight change in the weights can flip the signal of the neuron from 0 to 1 which is a huge change and with each neuron feeding into multiple neurons in the next layer, wherein a few more neurons will flip and so on with the result that a slight change in weights and biases leads to an unpredictable change in the final output. Hence, an activation function is applied to the output of the neuron such that a small change in weights and biases results in a small change in the output. Sigmoid function is one such function. It can take any value from –infinity to +infinity yet its output is always between 0 and 1.

In addition, it is similar to the step function but a lot smoother.

This smoothness is what enables us to make small changes in weights and biases to get small changes in the final output.

Additional advantage of sigmoid function that when studying backpropagation later in neural networks, it is easier to calculate partial differential with respect to sigmoid function because of the presence of exponential.

Once I understood these aspects, I got a peak under the hood of how a simple artificial neural network works. Needless to mention, sigmoid function isn’t without its drawbacks such diminishing gradient due to its value always being less than 1.