Mục Lục

Introduction to FeedForward Neural Networks

Deep Feedforward networks or also known multilayer perceptrons are the foundation of most deep learning models. Networks like CNNs and RNNs are just some special cases of Feedforward networks. These networks are mostly used for supervised machine learning tasks where we already know the target function ie the result we want our network to achieve and are extremely important for practicing machine learning and form the basis of many commercial applications, areas such as computer vision and NLP were greatly affected by the presence of these networks.

The main goal of a feedforward network is to approximate some function f*. For example, a regression function y = f *(x) maps an input x to a value y. A feedforward network defines a mapping y = f (x; θ) and learns the value of the parameters θ that result in the best function approximation.

The reason these networks are called feedforward is that the flow of information takes place in the forward direction, as x is used to calculate some intermediate function in the hidden layer which in turn is used to calculate y. In this, if we add feedback from the last hidden layer to the first hidden layer it would represent a recurrent neural network.

These networks are represented by a composition of many different functions. Each model is associated with an acyclic graph describing how the functions are composed together. For example, we might have three functions f (1), f (2), and f (3) connected in a chain, to form f (x) = f(3)(f (2)(f (1)(x))). In this f(1) is the first layer, f(2) is the second layer and f(3) is the output layer.

The layers between the input layer and output layers are known as hidden layers, as the training data does not show the desired output for these layers. A network can contain any number of hidden layers with any number of hidden units. A unit basically resembles a neuron which takes input from units of previous layers and computes its own activation value.

Now a question arises that why do we need feed-forward networks when we have linear machine learning models, this is due to the fact that linear models are limited to only linear functions whereas neural networks aren’t. When our data isn’t linear separable linear models face problems in approximating whereas it is pretty easy for the neural networks. The hidden layers are used to increase the non-linearity and change the representation of the data for better generalization over the function.

For designing any feed-forward neural network there are some things that you will need to decide, most of the networks require some ingredients, some of which are the same for designing machine learning algorithms.

Optimizer

An Optimizer or optimization algorithm is used to minimize the cost function, this updates the values of the weights and biases after every training cycle or epoch until the cost function reaches the global optimum.

Optimization algorithms are of two types ;

First Order Optimization Algorithms

These algorithms minimize or maximize a cost function using its gradient values with respect to the parameters. The First order derivative tells us whether the function is decreasing or increasing at a particular point, in short, it gives the line which is tangent to the surface.

Second Order Optimization Algorithms

These algorithms use second order derivatives to minimize the cost function and are also called Hessian. Since the second derivative is costly to compute, the second order is not used much. The second order derivative tells us whether the first derivative is increasing or decreasing which hints at the function’s curvature. Second Order Derivative provide us with a quadratic surface which touches the curvature of the Error Surface.

There are many algorithms which are used for optimization like:

The Architecture of the Network

The Architecture of a network refers to the structure of the network ie the number of hidden layers and the number of hidden units in each layer. According to the Universal approximation theorem feedforward network with a linear output layer and at least one hidden layer with any “squashing” activation function can approximate any Borel measurable function from one finite-dimensional space to another with any desired non-zero amount of error provided that the network is given enough hidden units. This theorem simply states that no matter what function we are trying to learn there is always an MLP which will be able to represent the function.

We now know that there will always be an MLP which will be able to solve our problem but there is no specified method to determine this architecture. No one can say that if we use n number of layers with M number of hidden units, we will be able to solve the given problem, finding this configuration without hit and trial method is still an active area of research and for now can only be done by hit and trial method.

Finding the right architecture is hard as we might have to try many different configurations but even when we have the correct MLP architecture, it still might not be able to represent the target function. This happens due to two reasons first is that optimization algorithm might not be able to find correct values of the parameters that correspond to the desired function, and the other reason is the training algorithms might choose the wrong function due to overfitting.

Cost function

Cost function at any point of training shows the difference between approximation made by our model and the actual target value we are trying to reach and is always single-valued, as its job is to evaluate how is a network as a whole. Just like machine learning algorithms, feedforward networks are also trained using gradients based learning, in such learning method an algorithms like stochastic gradient descent is used to minimize the cost function.

The whole training process is heavily dependent on the choice of our cost function, the choice of the cost function is more or less the same for other parametric models.

In cases where our parametric models define a distribution p(y| x;𝛳), we simply use the cross-entropy between the training data and the model’s predictions as the cost function. We can also take another approach to this by predicting some statistic of y conditioned on x rather than predicting a complete probability distribution over y.

For using a function as a cost function with the backpropagation algorithm, it must satisfy two properties:

The cost function must be able to be written as an average.

The cost function must not be dependent on any activation value of network beside the output layer.

A cost function is mostly of form C(W, B, Sr, Er) where W is the weights of the neural network, B is the biases of the network, Sr is the input of a single training sample, and Er is the desired output of that training sample.

Some possible cost functions are:

Quadratic cost

quadratic cost function

This function is also known as the mean squared error, maximum likelihood, and sum squared error.

Cross-entropy cost

Cross-entropy cost function

This function is also known as Bernoulli negative log-likelihood and Binary Cross-Entropy

Exponential cost

Exponential cost function

Hellinger distance

Hellinger distance cost function

this is a function is also referred to as the statistical distance.

Output units

Output units are those units which are present in the output layer, their job is to give us the desired output or prediction, hence to finish the task that the neural network must perform. Choice of the output units is tightly coupled with the choice of the cost function. Any unit which can be used as a hidden unit in the neural network can also be used as the output unit.

Choice of output units are:

Linear units

Simplest kind of output units are linear output units which are used for Gaussian output distributions, these units are based on an affine transformation which offers no nonlinearity to the output layer. Given h features, a layer of linear outputs produces a vector :

linear unit function

For linear layers maximizing the log-likelihood is equivalent to minimizing the mean squared errors, the maximum likelihood makes it easier to lean the covariance of the Gaussian distribution.

The advantage of these linear units is that they do not saturate, ie their gradient always remains constant and never approaches zero, there these units pose no difficulty for gradient-based optimization algorithms.

Sigmoid units

sigmoid unit function

For solving a binary classification problem, we combine Sigmoid output units with maximum likelihood. A Sigmoid output unit has 2 components one is it uses a linear layer to compute z = w*h+b and then it uses activation function to convert z into probability. When other loss functions are used, such as mean squared error, the loss can saturate anytime, ie the gradients can shrink too small to be useful for learning. For this reason, maximum likelihood is preferred.

Softmax Units

Softmax units are used for Multitudinous output distributions, it used for a probability distribution over a discrete variable with n possible values, this can also be seen as a generalization of the sigmoid function which represents the probability distribution over a binary variable. Softmax function is defined by:

Softmax unit function

Like Sigmoid function, Softmax function can also saturate, ie the gradients can shrink too small to be useful for learning. In the case of Softmax, as it has multiple output units, the units can only saturate when the differences between input values become extreme.

These units are governed the by winner take all principle as the total probability is always 1 and cannot exceed, its value of one out gets closer to 1 it is sure the value of outputs from other output units will near to 0.

Hidden Units

Selecting the type of hidden unit is also active research and no particular unit can guarantee that it will outperform all others in every problem but we still have some units which are default choices in the beginning, For example, rectified linear units or popularly known as Relu is mostly used, this is due to intuitive reasons rather than experimental, in reality, It is usually impossible to predict in advance which will work best. Selecting a hidden unit involves trial and error, intuiting that a kind of hidden unit may work well and then testing.

Possible choices for hidden units are:

Rectified linear units

These functions use the activation function defined by g(z)

Relus are easy to optimize as they are similar to linear units, the only difference between them is that the output 0 for half of their domains. The reason they are so famous is that they always have a constant large gradient whenever the unit is active. the gradient the direction is far more useful for learning than it would be with activation functions that introduce second-order effects.

ReLU has a drawback that they can not learn via gradient-based methods for which their activation is zero.

There are many generalizations for Relu, these are;

Absolute value rectification

Leaky ReLU

Parametric ReLU

Maxout units

Maxout units apply an element-wise function g(z ), max out units divide z into groups of k values. Each max out unit then outputs the maximum element of one of these groups. Maxout units are considered the best generalization of ReLU as they have redundancy which is caused as each unit is driven by multiple filters that help them to resist the catastrophic forgetting in which the neural network forget how to perform tasks that they were trained on.

Logistic sigmoid and Hyperbolic tangent

Logistic sigmoid is given by:

Logistic sigmoid

Hyperbolic tangent is given by:

Hyperbolic tangent

These units are closely related as :

The relation between the hyperbolic tangent and sigmoid

Before ReLU, these were the most famous choices for neural networks, but now their use is disregarded as they saturate to a high value when z is very positive, saturate to a low value when z is very negative, and are only strongly sensitive to their input when z is near 0. The widespread saturation of sigmoidal units can make gradient-based learning very difficult.