4 General Fully Connected Neural Networks | The Mathematical Engineering of Deep Learning

Learning outcomes from this chapter

4.1

The full neural network

4.1.1

The architecture of neural networks

The ideal network architecture for a task must be found via experimentation guided by monitoring the validation set error.

Let’s consider the following architecture.

The left most layer is called the input layer, and the neurons within the layer are called input neurons. A neuron of this layer is of a special kind since it has no input and it only outputs an \(x_j\) value the \(j\)th features.
The right most or output layer contains the output neurons (just one here). The middle layer is called a hidden layer, since the neurons in this layer are neither inputs nor outputs. We do not observe (directly) what goes out of this layer.

Somewhat confusingly, and for historical reasons, such multiple layer networks are sometimes called multilayer perceptrons or MLPs, despite being made up of sigmoid neurons, not perceptrons.

The design of the input and output layers in a network is often straightforward: as many neurons in the input layer than the number of explanatory/features variables; as many neurons in the output layer than the number of possible values for the response variable (if it is qualitative).

But there can be quite an art to the design of the hidden layers. Rectiﬁed linear units are an excellent default choice of hidden unit. Rectiﬁed linear units are easy to optimize because they are so similar to linear units.

When the output from one layer is used as input to the next layer (with no loops), we speak about feedforward neural networks. Other models are called recurrent neural networks.

Deep feedforward networks, also often called feedforward neural networks,or multilayer perceptrons (MLPs), are the quintessential deep learning models.

4.1.2

Goals

The goal of a feedforward network is to approximate some function \(f^*\). A feedforward network deﬁnes a mapping \(y = f(x;\theta)\) and learns the value of the parameters \(\theta\) that result in the best function approximation.
The function \(f\) is composed of a chain of functions: \[f=f^{(k)}(f^{(k-1)}(\ldots f^{(1)})),\] where \(f^{(1)}\) is called the firstlayer, and so on. The depth of the network is \(k\). The ﬁnal layer of a feedforward network is called the output layer.

Rather than thinking of the layer as representing a single vector-to-vector function, we can also think of the layer as consisting of many unit that act in parallel, each representing a vector-to-scalar function.

Architecture of the network is an art:

how many layers the network should contain
how these layers should be connected to each other
how many units should be in each layer.

4.1.3

One Layer NN

In this shalow neural network, we have:

\(x_1,\ x_2,\ x_3\) are inputs of a Neural Network. These elements are scalars and they are stacked vertically. This also represents an input layer.
Variables in a hidden layer are not seen in the input set.
The output layer consists of a single neuron only \(\hat{y}\) is the output of the neural network.

Let precise some notation that we will use

\[\left\{
\begin{eqnarray*}
\color{Green} {z_1^{[1]} } &=& \color{Orange} {w_1^{[1]}} ^T \color{Red}x + \color{Blue} {b_1^{[1]} } \hspace{2cm}\color{Purple} {a_1^{[1]}} = \sigma( \color{Green} {z_1^{[1]}} )\\
\color{Green} {z_2^{[1]} } &=& \color{Orange} {w_2^{[1]}} ^T \color{Red}x + \color{Blue} {b_2^{[1]} } \hspace{2cm} \color{Purple} {a_2^{[1]}} = \sigma( \color{Green} {z_2^{[1]}} )\\
\color{Green} {z_3^{[1]} } &=& \color{Orange} {w_3^{[1]}} ^T \color{Red}x + \color{Blue} {b_3^{[1]} } \hspace{2cm} \color{Purple} {a_3^{[1]}} = \sigma( \color{Green} {z_3^{[1]}} )\\
\color{Green} {z_4^{[1]} } &=& \color{Orange} {w_4^{[1]}} ^T \color{Red}x + \color{Blue} {b_4^{[1]} } \hspace{2cm} \color{Purple} {a_4^{[1]}} = \sigma( \color{Green} {z_4^{[1]}} )
\end{eqnarray*}\right.\]

where \(x=(x_1,x_2,x_3)^T\) and \(w_j^{[1]}=(w_{j,1}^{[1]},w_{j,2}^{[1]},w_{j,3}^{[1]},w_{j,4}^{[1]})^T\) (for \(j=1,\ldots,4\)).

Then, the output layer is defined by:

\[\begin{eqnarray*}
\color{Green} {z_1^{[2]} } &=& \color{Orange} {w_1^{[2]}} ^T \color{purple}a^{[1]} + \color{Blue} {b_1^{[2]} } \hspace{2cm}\color{Purple} {a_1^{[2]}} = \sigma( \color{Green} {z_1^{[2]}} )\\
\end{eqnarray*}\]

where \(a^{[1]}=(a^{[1]}_1,\ldots,a^{[1]}_4)^T\) and \(w_1^{[2]}=(w_{1,1}^{[2]},w_{1,2}^{[2]},w_{1,3}^{[2]},w_{1,4}^{[2]})^T\)

One can use matrix representation for efficiency computation:

\[\begin{equation} \begin{bmatrix} \color{Orange}- & \color{Orange} {w_1^{[1]} }^T & \color{Orange}-\\ \color{Orange}- & \color{Orange} {w_2^{[1] } } ^T & \color{Orange}- \\ \color{Orange}- & \color{Orange} {w_3^{[1]} }^T & \color{Orange}- \\ \color{Orange}- & \color{Orange} {w_4^{[1]} }^T & \color{Orange}- \end{bmatrix} \begin{bmatrix} \color{Red}{x_1} \\ \color{Red}{x_2} \\ \color{Red}{x_3} \end{bmatrix} + \begin{bmatrix} \color{Blue} {b_1^{[1]} } \\ \color{Blue} {b_2^{[1]} } \\ \color{Blue} {b_3^{[1]} } \\ \color{Blue} {b_4^{[1]} } \end{bmatrix} = \begin{bmatrix} \color{Orange} {w_1^{[1]} }^T \color{Red}x + \color{Blue} {b_1^{[1]} } \\ \color{Orange} {w_2^{[1] } } ^T \color{Red}x +\color{Blue} {b_2^{[1]} } \\ \color{Orange} {w_3^{[1]} }^T \color{Red}x +\color{Blue} {b_3^{[1]} } \\ \color{Orange} {w_4^{[1]} }^T \color{Red}x + \color{Blue} {b_4^{[1]} } \end{bmatrix} = \begin{bmatrix} \color{Green} {z_1^{[1]} } \\ \color{Green} {z_2^{[1]} } \\ \color{Green} {z_3^{[1]} } \\ \color{Green} {z_4^{[1]} } \end{bmatrix} \end{equation}\]

and by defining
\[\color{Orange}{W^{[1]}} = \begin{bmatrix} \color{Orange}- & \color{Orange} {w_1^{[1]} }^T & \color{Orange}-\\ \color{Orange}- & \color{Orange} {w_2^{[1] } } ^T & \color{Orange}- \\ \color{Orange}- & \color{Orange} {w_3^{[1]} }^T & \color{Orange}- \\ \color{Orange}- & \color{Orange} {w_4^{[1]} }^T & \color{Orange}- \end{bmatrix} \hspace{2cm} \color{Blue} {b^{[1]}} = \begin{bmatrix} \color{Blue} {b_1^{[1]} } \\ \color{Blue} {b_2^{[1]} } \\ \color{Blue} {b_3^{[1]} } \\ \color{Blue} {b_4^{[1]} } \end{bmatrix} \hspace{2cm} \color{Green} {z^{[1]} } = \begin{bmatrix} \color{Green} {z_1^{[1]} } \\ \color{Green} {z_2^{[1]} } \\ \color{Green} {z_3^{[1]} } \\ \color{Green} {z_4^{[1]} } \end{bmatrix} \hspace{2cm} \color{Purple} {a^{[1]} } = \begin{bmatrix} \color{Purple} {a_1^{[1]} } \\ \color{Purple} {a_2^{[1]} } \\ \color{Purple} {a_3^{[1]} } \\ \color{Purple} {a_4^{[1]} } \end{bmatrix}\]
we can write
\[\color{Green}{z^{[1]} } = W^{[1]} x + b ^{[1]}\]
and then by applying Element-wise Independent activation function \(\sigma(\cdot)\) to the vector \(z^{[1]}\) (meaning that \(\sigma(\cdot)\) are applied independently to each element of the input vector \(z^{[1]}\)) we get:

\[\color{Purple}{a^{[1]}} = \sigma (\color{Green}{ z^{[1]} }).\]
The output layer can be computed in the similar way:

\[\color{YellowGreen}{z^{[2]} } = W^{[2]} a^{[1]} + b ^{[2]}\]

where

\[\color{Orange}{W^{[2]}} = \begin{bmatrix} \color{Orange} {w_{1,1}^{[2]} } \\
\color{Orange} {w_{1,2}^{[2]} } \\ \color{Orange} {w_{1,3}^{[2]} } \\ \color{Orange} {w_{1,4}^{[2]} } \\
\end{bmatrix} \hspace{2cm} \color{Blue} {b^{[2]}} = \begin{bmatrix} \color{Blue} {b_1^{[2]} } \\ \color{Blue} {b_2^{[2]} } \\ \color{Blue} {b_3^{[2]} } \\ \color{Blue} {b_4^{[2]} } \end{bmatrix} \]

and finally:

\[\color{Pink}{a^{[2]}} = \sigma ( \color{LimeGreen}{z^{[2]} })\longrightarrow \color{red}{\hat{y}}\]

So far we have used:

The superscript number
\(^{[i]}\)

for denoting the layer number and the subscript number

\(_j\)

denotes the neuron number in a particular layer
\(x\)

is the input vector consisting of 3 features.
\(W^{[i]}_j\)

is the weight vector associated with neuron

\(j\)

present in the layer

\(i\)
\(b^{[i]}_j\)

is the bias associated with neuron j present in the layer i.
\(z^{[i]}_j\)

is the intermediate output associated with neuron

\(j\)

present in the layer

\(i\)

.
\(a^{[i]}_j\)

is the final output associated with neuron

\(j\)

present in the layer

\(i\)

.
As an example
\(\sigma(\cdot)\)

is the sigmoid activation function

Using this notation the forward-propagation equations are:

\[\left\{
\begin{eqnarray*}
{z^{[1]} } &=& W^{[1]}x +b^{[1]} \\
{a^{[1]} } &=& \sigma(z^{[1]} )\\
{z^{[2]} } &=& W^{[2]}a^{[1]} +b^{[2]}\\
\hat{y}&=&a^{[2]}=\sigma(z^{[2]})
\end{eqnarray*}\right.\]

Why non-linear Activation is important. Consider this neural network without activation functions:

\[\left\{
\begin{eqnarray*}
{z^{[1]} } &=& W^{[1]T}x +b^{[1]} \\
\hat{y}&=&z^{[2]}=W^{[2]T}z^{[1]} +b^{[2]}
\end{eqnarray*}\right.\]
Then, it follows

\[\left\{
\begin{eqnarray*}
{z^{[1]} } &=& W^{[1]}x +b^{[1]} \\
\hat{y}&=&z^{[2]}=W^{[2]}W^{[1]}x+ W^{[2]}b^{[1]}+b^{[2]}\\
\hat{y}&=&z^{[2]}=W_{new}x+ b_{new}\\
\end{eqnarray*}\right.\]
The output is then a linear combination of a new weight matrix, input and a new bias. Thus if we use an identity activation function then the Neural Network will output linear output of the input. Indeed,a composition of two linear functions is a linear function and so we lose the representation power of a NN. However, a linear activation function is generally recommended and implemented in the output layer in case of regression.

4.1.4

N layers Neural Network

We can generalize this simple previous neural network to a Multi-layer fully-connected neural networks by sacking more layers get a deeper fully-connected neural network defining by the following equations:

\[\left\{
\begin{eqnarray*}
{a^{[1]} } &=& g^{[1]}(W^{[1]}x +b^{[1]}) \\
{a^{[2]} } &=& g^{[2]}(W^{[2]}a^{[1]} +b^{[2]}) \\
\ldots &=& \ldots \\
{a^{[r-1]} } &=& g^{[r-1]}(W^{[r-1]}a^{[r-2]} +b^{[r-1]}) \\
\hat{y}=a^{[r]}&=& g^{[r]}(W^{[r]}a^{[r-1]} +b^{[r]})
\end{eqnarray*}\right.\]

This neural network is composed of \(r\) layers based on \(r\) weight matrices \(W^{[1]},\ldots,W^{[r]}\) and \(r\) bias vectors \(b^{[1]},\ldots,b^{[r]}\). The \(r\) activation functions noted \(g^{[r]}\) might be different for each layer \(r\). The number of neurons in each layer could be also be not equal and will be noted \(m_r\). Using this notation:

The total number of neurons in the network:
\(m_1+\ldots+m_r\)
Total number of parameters in this network is
\((d+1)m_1+(m_1+1)m_2+\ldots+(m_{r-1}+1)*m_r\)
\(W^{[1]}\in\Re^{m_1\times d}\)

,

\(W^{[k]}\in\Re^{m_k\times m_{k-1}}\)

(

\(k=2,\ldots,r-1\)

) and

\(W^{[r]}\in\Re^{1\times m_{r-1}}\)

How do we count layers in a neural network?:

When counting layers in a neural network we count hidden layers as well as the output layer, but we don’t count an input layer.

It is a four layer neural network with three hidden layers.

4.1.5

Vectorizing Across Multiple Training Examples

So far we have defined our Neural Network using only one inpute feature vector \(x\) to generate prediction \(\hat{y}\)
\[x\longrightarrow a^{[2]}=\hat{y}\]
Let consider \(m\) training example \(x^{[1]},\ldots,x^{[m]}\) and we can use our NN to get for each training example \(x^{[i]}\) a prediction

\[x^{(i)}\longrightarrow a^{[2](i)}=\hat{y}\ \ \ \ i=1,\ldots m\]
One can use a loop for getting all the predictions. However, matrix representation will help us to overcome the computational issue of using loop strategy.

Let first define the matrix \(\textbf{X}\) which every column is a feature vector for one training sample:

\[\textbf{X} = \begin{bmatrix} \vert & \vert & \dots & \vert \\ x^{(1)} & x^{(2)} & \dots & x^{(m)} \\ \vert & \vert & \dots & \vert \end{bmatrix}.\]

Then, we define the matrix \(\textbf{Z}^{[1]}\) with columns \(z^{[1](1)} \ldots z^{[1](m)}\)

\[\textbf{Z}^{[1]} = \begin{bmatrix} \vert & \vert & \dots & \vert \\ z^{[1](1)} & z^{[1](2)} & \dots & z^{[1](m)} \\ \vert & \vert & \dots & \vert \end{bmatrix}.\]
The activation matrix for the first hidden layer \(\textbf{A}^{[1]}\) is defined similary by:

\[\textbf{A}^{[1]}=\begin{bmatrix} \vert & \vert & \dots & \vert \\ a^{[1](1)} & a^{[1](2)} & \dots & a^{[1](m)} \\ \vert & \vert & \dots & \vert\end{bmatrix},\]
where for example the element in the first row and in the first column of a matrix \(\textbf{A}^{[1]}\) is an activation of the first hidden unit and the first training example. For clarification:

\[A^{[1]} = \begin{bmatrix} 1^{st} unit \enspace of \enspace 1.tr. example & 1^{st} unit \enspace of \enspace 2^{nd}tr. example & \dots & 1^{st} unit \enspace of \enspace m^{th}.tr. example \\ 2^{nd}unit \enspace of \enspace 1^{st}tr. example & 2^{nd} unit \enspace of \enspace 2^{nd}tr. example & \dots & 2^{nd} unit \enspace of \enspace m^{th}tr. example \\ the \enspace last \enspace unit \enspace of \enspace 1^{st}tr. example & the \enspace last \enspace unit \enspace of \enspace2^{nd}tr. example & \dots & the \enspace last \enspace unit \enspace of m^{th}tr. example \end{bmatrix}\]
Based on this matrix representation we get:

\[\left\{
\begin{eqnarray*}
{Z^{[1]} } &=& W^{[1]}\textbf{X} +b^{[1]} \\
{A^{[1]} } &=& \sigma({Z^{[2]} }) \\
{Z^{[2]} } &=& W^{[2]}{A^{[1]} } +b^{[2]} \\
{A^{[2]} } &=& \sigma({Z^{[2]} }) \\
\end{eqnarray*}\right.\]

One can notice that we add \(b^{[1]}\in \Re^{4\times 1}\) to \(W^{[1]}\textbf{X}\in \Re^{4\times m}\), which is strictly not allowed following the rules of linear algebra. In
practice however, this addition is performed using broadcasting. By defining

\[\tilde{b}^{[1]} = \begin{bmatrix} \vert & \vert & \dots & \vert \\ b^{[1]} & b^{[1]} & \dots & b^{[1]} \\ \vert & \vert & \dots & \vert \end{bmatrix}.\]
we can compute:

\[{Z^{[1]} } = W^{[1]}\textbf{X} +\tilde{b}^{[1]}\]