Recurrent Neural Network | Brilliant Math & Science Wiki

Consider an application that needs to predict an output sequence \(y = \left(y_1, y_2, \dots, y_n\right)\) for a given input sequence \(x = \left(x_1, x_2, \dots, x_m\right)\). For example, in an application for translating English to Spanish, the input \(x\) might be the English sentence \(\text{“i like pizza”}\) and the associated output sequence \(y\) would be the Spanish sentence \(\text{“me gusta comer pizza”}\). Thus, if the sequence was broken up by character, then \(x_1=\text{“i”}\), \(x_2=\text{” “}\), \(x_3=\text{“l”}\), \(x_4=\text{“i”}\), \(x_5=\text{“k”}\), all the way up to \(x_{12}=\text{“a”}\). Similarly, \(y_1=\text{“m”}\), \(y_2=\text{“e”}\), \(y_3=\text{” “}\), \(y_4=\text{“g”}\), all the way up to \(y_{20}=\text{“a”}\). Obviously, other input-output pair sentences are possible, such as \((\text{“it is hot today”}, \text{“hoy hace calor”})\) and \((\text{“my dog is hungry”}, \text{“mi perro tiene hambre”})\).

It might be tempting to try to solve this problem using feedforward neural networks, but two problems become apparent upon investigation. The first issue is that the sizes of an input \(x\) and an output \(y\) are different for different input-output pairs. In the example above, the input-output pair \((\text{“it is hot today”}, \text{“hoy hace calor”})\) has an input of length \(15\) and an output of length \(14\) while the input-output pair \((\text{“my dog is hungry”}, \text{“mi perro tiene hambre”})\) has an input of length \(16\) and an output of length \(21\). Feedforward neural networks have fixed-size inputs and outputs, and thus cannot be automatically applied to temporal sequences of arbitrary length.

The second issue is a bit more subtle. One can imagine trying to circumvent the above issue by specifying a max input-output size, and then padding inputs and outputs that are shorter than this maximum size with some special null character. Then, a feedforward neural network could be trained that learns to produce \(y_i\) on input \(x_i\). Thus, in the example \((\text{“it is hot today”}, \text{“hoy hace calor”})\), the training pairs would be

\[\big\{(x_1=\text{“i”}, y_1=\text{“h”}), (x_2=\text{“t”}, y_2=\text{“o”}), \dots, (x_{14}=\text{“a”}, y_{14}=\text{“r”}), (x_{15}=\text{“y”}, x_{15}=\text{“*”})\big\},\]

where the maximum size is \(15\) and the padding character is \(\text{“*”}\), used to pad the output, which at length \(14\) is one short of the maximum length \(15\).

The problem with this is that there is no reason to believe that \(x_1\) has anything to do with \(y_1\). In many Spanish sentences, the order of the words (and thus characters) in the English translation is different. Thus, if the first word in an English sentence is the last word in the Spanish translation, it stands to reason that any network that hopes to perform the translation will need to remember that first word (or some representation of it) until it outputs the end of the Spanish sentence. Any neural network that computes sequences needs a way to remember past inputs and computations, since they might be needed for computing later parts of the sequence output. One might say that the neural network needs a way to remember its context, i.e. the relation between its past and its present.