Mục Lục

Deep Learning — Deep Belief Network (DBN)

In this post we will explore what are the features of Deep Belief Network(DBN), architecture of DBN and how DBN’s are trained and it’s usage.

What is Deep Belief Network?

DBN is a Unsupervised Probabilistic Deep learning algorithm.
DBN id composed of multi layer of stochastic latent variables. Latent variables are binary, also called as feature detectors or hidden units
DBN is a generative hybrid graphical model. Top two layers are undirected. Lower layers have directed connections from layers above.

Architecture of DBN

Deep Belief Network

It is a stack of Restricted Boltzmann Machine(RBM) or Autoencoders.
Top two layers of DBN are undirected, symmetric connection between them that form associative memory.
The connections between all lower layers are directed, with the arrows pointed toward the layer that is closest to the data. Lower Layers have directed acyclic connections that convert associative memory to observed variables. The lowest layer or the visible units receives the input data. Input data can be binary or real.
There are no intra layer connections likes RBM
Hidden units represents features that captures the correlations present in the data
Two layers are connected by a matrix of symmetrical weights W.
Every unit in each layer is connected to every unit in the each neighboring layer

How does DBN work?

DBN are pre trained using Greedy learning algorithm. Greedy learning algorithm uses layer-by-layer approach for learning the top-down, generative weights. These generative weights determine how variables in one layer depend on the variables in the layer above.
In DBN we run several steps of Gibbs sampling on the top two hidden layers. This stage is essentially drawing a sample from the RBM deﬁned by the top two hidden layers.
Then use a single pass of ancestral sampling through the rest of the model to draw a sample from the visible units.
Learning, the values of the latent variables in every layer can be inferred by a single, bottom-up pass. Greedy pretraining starts with an observed data vector in the bottom layer. It then uses the generative weights in the reverse direction using fine tuning.

Let’s understand this step by step

hat is Greedy Layer wise learning ?

Greedy Layer wise training algorithm was proposed by Geoffrey Hinton where we train a DBN one layer at a time in an unsupervised manner.
Easy way to learn anything complex is to divide the complex problem into easy manageable chunks. We take a multi layer DBN, divide into simpler models(RBM) that are learned sequentially. It is easier to train a shallow network than training a deeper network
The idea behind our greedy algorithm is to allow each model in the sequence to receive a different representation of the data.

How does Greedy Layer wise training algorithm work?

First layer is trained from the training data greedily, while all other layers are frozen. We derive the individual activation probabilities for the first hidden layer. All the hidden units of the first hidden layer are updated in parallel. This is called as the positive phase.

b1 and b2 are the biases associated with the hidden units

We reconstruct the visible units using negative phase which is similar technique like positive phase.

a1, a2 and a3 are biases. Reconstructing visible unit from hidden unit

Final step in Greedy layer wise learning is to update all associated weights. L is the learning rate that we multiply by the difference between the positive and negative phase values and add to the initial value of the weight.

Weight update for one of the weights. L is the Learning rate

This process will be repeated till we get required threshold values
We then take the first hidden layer which now acts an an input for the second hidden layer and so on.
Each layer takes output of the previous layer as an input to produce an output . Output generated is a new representation of data where distribution is simpler.
Weights for the second RBM is the transpose of the weights for the first RBM.
We again use the Contrastive Divergence method using Gibbs sampling just like we did for the first RBM.

Training the next RBM, hidden unit or output of RBM1 becomes the input for RBM2

We calculate the positive phase, negative phase and update all the associated weights.
This process will be repeated till we get required threshold values

a and b are biases associated with the nodes

we can again add another RBM and calculate the contrastive divergence using the Gibbs sampling

Why does DBM use Greedy Layer wise learning for pre training?

Pre training helps in optimization by better initializing the weights of all the layers.
Greedy learning algorithm is fast, efficient and learns one layer at a time.
Trains layer sequentially starting from bottom layer
Each layer learns a higher data representation of the the lower layer.

Why do we need fine tuning?

Greedy layerwise pretraining identifies feature detector.

Fine tuning modifies the features slightly to get the category boundaries right.

Adding fine tuning helps to discriminate between different classes better. Adjusting the weights during fine tuning process provides an optimal value. This helps increases the accuracy of the model.

How can we achieve fine tuning?

Fine tuning can be achieved by

Wake Sleep algorithm
Back propagation

We will discuss back propagation here

Fine Tuning using backward Propagation

Backward propagation works better with greedy layer wise training. We do not start backward propagation till we have identified sensible feature detectors that will be useful for discrimination task.

Objective of fine tuning is not discover new features. Objective of DBM is to improve the accuracy of the model by finding the optimal values of the weights between layers

Once we have the sensible feature detectors identified then backward propagation only needs to perform a local search.

Unlabelled data helps discover good features. We may also get features that are not very helpful for discriminative task but that is not an issue. We still get useful features from the raw input.

Input vectors generally contain a lot more information than the labels. Precious information is the label is used only for fine tuning

Labelled dataset help associate patterns and features to the dataset. A small labelled dataset is used for fine tuning using backward propagation

Advantages of backpropagation for fine tuning

Back Propagation fine tunes the model to be better at discrimination
Overcomes many limitations of standard backward propagation.
Makes it easier to learn a deep network
Makes network generalize better

How do we apply Fine Tuning Process?

Apply a stochastic bottom up pass and adjust the top down weights.
When we reach the top, we apply recursion to the top level layer. These are the top two layers of DBN that are are undirected. The top layer is our output
To fine tune further we do a stochastic top down pass and adjust the bottom up weights.