Sigmoid Activation Function-InsideAIML
1) When the input is slightly away from the coordinate origin, the
gradient of the function becomes very small, almost zero. In the process of
neural network backpropagation, we all use the chain rule of differential to
calculate the differential of each weight w. When the backpropagation passes
through the sigmoid function, the differential on this chain is very small.
Moreover, it may pass through many sigmoid functions, which will eventually
cause the weight(w) to have little effect on the loss function, which is
not conducive to the optimization of the weight. This problem is called gradient
saturation or gradient dispersion.