Why tanh outperforms sigmoid | Medium
Mục Lục
Activation Functions: Why does “tanh” outperform “logistic sigmoid”?
Recently, I was trying to build a CNN-LeNet with an MNIST dataset using the sigmoid activation function, I was really shocked by the results it gave, then I tried it using the tanh activation function. It gave the best result, which can’t be compared to the sigmoid. Then, I actually thought of going back to MLP and understanding Why is tanh outperforming sigmoid. This blog is written for my future reference and would help someone who has the same doubts.
I assume that people who are reading this blog already have some idea about activation functions and Multi-Layered Perceptrons. If you are completely new to these concepts, you may not understand this blog. kindly, revisit after you learn the basics of deep learning.
In this blog, I would quote the logistic sigmoid function as sigmoid ( A standard choice for a sigmoid function is the logistic function).
Main Objective : To understand about performance of tanh and logistic sigmoid activation functions, without having any advanced concepts like Dropouts, Batch-Normalization, etc…
Why are activation functions needed?
Purpose of activation function
In simple terms, Activation Functions introduce non-linearity to the output of neurons. The activation function should be differentiable or the concept of updating weights (backpropagation)fails, which is the core idea of deep learning.
Why do we need non-linearity?
(Left pic) — Taken from Laerd Statistics site. These plots are to explain the Importance of non-linearity
Non-linearity in simple words can be thought of as “The outcome does not change in proportion to a change in any of the inputs”. Let’s assume we have data in non-linear shapes such as (circular or elliptical) as seen in the above figure and the task is to classify whether the data point belongs to a positive class or negative class. In such cases, we can’t use any linear models.
The activation function does the non-linear transformation to the input, making it capable to learn and perform more complex tasks.
PERFORMANCE OF “sigmoid” AND “tanh” on MNIST DATASET
About the dataset: The MNIST database of handwritten digits, available from this page, has a training set of 60,000 examples, and a test set of 10,000 examples. It is a subset of a larger set available from NIST. The digits have been size-normalized and centered in a fixed-size image. It is one of the most used and basic datasets to start learning various techniques in Deep Learning.
The objective of using this dataset here is: Analyze the performance of tanh and Sigmoid on various networks.
Epoch: 1 epoch is said to be done when whole training data is passed through the entire network once.
On a 6-Layered network with 4 hidden layers:
It is 6 layered network with 4 fully connected layers with 1 input and 1 output layer. There is no Dropout or Batch Normalization used in this network. All the other parameters like weight initializers are set to the default value.
Optimizer: SGD
Loss: Categorical cross entropy (Multi-class log-loss)
Metric: Accuracy
Plots of Loss function on this network (using tanh and Sigmoid as activation units)
Loss by applying tanh and sigmoid on 6 layered networks.
When the sigmoid activation function is used on this network, the loss didn’t start converging until the 35th epoch, it took 100 epochs to reach a loss of 0.51, where the ideal loss should be 0.
Conversely, when tanh is used instead of sigmoid, loss has been reduced to 0.14, by the 20th epoch.
Plots of accuracy on this network (using tanh and Sigmoid as activation units)
Accuracy plots on 6 layered network (with tanh and sigmoid as activation unit)
Similar to the loss, accuracy hasn’t improved till the 35th epoch when the sigmoid is used as an activation function, moreover, it took 100 epochs to reach an accuracy of 85%.
But, when tanh is used as an activation unit instead of sigmoid, 95% accuracy is achieved by the end of the 20th epoch.
On a 4-Layered network with 2 hidden layers:
It is 4 layered network with 2 fully connected layers with 1 input and 1 output layer. There is no Dropout or Batch Normalization used in this network. All the parameters like weight initializers are set to the default value.
Optimizer: SGD
Loss: Categorical cross entropy (Multi-class log-loss)
Metric: Accuracy
Plots of Loss function on this network (using tanh and Sigmoid as activation units)
Loss by applying tanh and sigmoid on 4 layered network.
When sigmoid is used as an activation function on this network, the loss has been reduced to 0.27 by the end of the 20th epoch.
When tanh is used instead of sigmoid, loss has been reduced to 0.10 by 20th epoch. In actuality, the ideal epoch, in this case, should be around 10 because from the graph, it is clear that we are over-training the network as both train and test loss started diverging slowly.
Plots of accuracy on this network (using tanh and Sigmoid as activation units)
Accuracy plots on 4 layered network (with tanh and sigmoid as activation units)
Similar to the loss, accuracy has been improving with respect to epoch, which is a good sign, but it is happening slowly as we are using sigmoid as an activation function, reaching 92% accuracy by the end of the 20th epoch.
But, when tanh is used as an activation unit instead of sigmoid, 96.8% accuracy is achieved by the end of the 20th epoch.
What is the main reason behind tanh performance and how fast convergence is achieved?
asymptote: It is a line that a curve approaches, as it heads towards infinity.
tanh and sigmoid, both are monotonically increasing function that asymptotes at some finite value as +inf and -inf is approached. In fact, tanh is a wide variety of sigmoid functions including called as hyperbolic tangent functions.
Both sigmoid and tanh are S-Shaped curves, the only difference is sigmoid lies between 0 and 1. whereas tanh lies between 1 and -1.
Graph of tanh and sigmoid
Mean of sigmoid, tanh, and their derivatives for a set of integers[-6,6], the below indicated mean values vary if the input varies (if a different set of numbers is considered). But, always mean of tanh function would be closer to zero when compared to sigmoid.
These are values of the sigmoid and its derivatives as well as tanh and its derivatives for the integer values of range [-6,6]
tanh function is symmetric about the origin, where the inputs would be normalized and they are more likely to produce outputs (which are inputs to next layer) and also, they are on an average close to zero.
It can also be said that data is centered around zero for tanh (centered around zero is nothing but mean of the input data is around zero.
These are the main reasons Why tanh is preferred and performs better than sigmoid (logistic).
Why you should normalize?
Assume, all the inputs are positive. Weights to a particular node in the first weight layer are updated by an amount proportional to δx where δ is the (scalar) error at that node and x is the input vector. When all of the components of an input vector are positive, all updates of weights that feed into a node will have the same sign (i.e. sign of (δ)). As a result, these weights can only ‘all decrease’ or ‘all increase’ together for a given input pattern. Thus, if a weight vector must change direction it can only do so by zigzagging, which is inefficient and it is very slow—To prevent such cases, normalization should be done where the average becomes zero.
This approach(normalization) should be applied at all the layers in the network, which means that the average of the outputs of a node is close to zero because these outputs are the inputs to the next layer.
Convergence is usually faster if the average of each input variable over the training set is close to zero.
The network training converges faster if its inputs are whitened — i.e., linearly transformed to have zero means and unit variances and decorrelated. As each layer observes the inputs produced by their previous layers, it would be advantageous to achieve the same whitening of the inputs of each layer.
Getting stuck during training (train time):
Not converging at train time.
- Logistic sigmoid can cause a neural network to get “stuck” during training. This is because, if a strongly-negative input is provided to the logistic sigmoid, it outputs a value, which is very near to zero. Because of this behavior, updating weights will be slow and they are less regularly updated. In simple terms, weights which are updated through backpropagation will be quite slow.
- In contrast to it, tanh has the output values range between (1,-1). So, strongly negative inputs to the tanh will map to negative outputs. Additionally, only zero-valued inputs are mapped to near-zero outputs. These properties make the network less likely to get “stuck” during training.
The problem of tanh and advantages of logistic sigmoid:
- tanh functions — error surface can be very flat at the origin. So, initializing very small weights should be avoided.
- Error surfaces can also be flat far from the origin because of the saturation of sigmoids (saturation is nothing but, the values can’t go beyond the limits, For instance, the value of a logistic sigmoid can’t be above 1 or below 0).
These both are problems of tanh and sigmoid functions.
- Logistic Sigmoid has a beautiful probabilistic interpretation, which made it more popular. Rather than classifying 0 or 1, logistic sigmoid can give the probability value for a particular data point, belonging to 0 or 1.
- Such interpretation lacks with tanh function.
This is the main advantage of the logistic sigmoid.
Popular concepts which aren’t discussed but helpful in the above scenarios :
- Batch Normalization is one of the amazing techniques (where we can normalize outputs of each layer and send it to the next layer as input). This is one of the best solutions for the problem, which we have seen above.
- Dropouts should be used while training deep layers in order to control over-fitting and Weight initialization is also most important before training a neural network.
- There are many important parameters that are to be considered before training any neural network as most of the neural network is about finding a lot of best hyperparameters (no.of.layers, BEST {optimizer, activation function, weight initializer, etc..).
Making Neural Networks work well or sometimes make them work seems like more of an art than a science.
Conclusion:
- tanh and logistic sigmoid are the most popular activation functions in the ’90s but because of their Vanishing gradient problem and sometimes Exploding gradient problem (because of weights), they aren’t mostly used now.
- These days Relu activation function is widely used. Even though, it sometimes gets into vanishing gradient problems, variants of Relu help solve such cases.
- tanh is preferred to sigmoid for faster convergence BUT again, this might change based on data. Data will also play an important role in deciding which activation function is best to choose.
References:
- http://yann.lecun.com/exdb/publis/pdf/lecun-98b.pdf
- https://theclevermachine.wordpress.com/2014/09/08/derivation-derivatives-for-common-neural-network-activation-functions/