Generative Adversarial Network

What is a Generative Adversarial Network?

A generative adversarial network, or GAN, is a deep neural network framework which is able to learn from a set of training data and generate new data with the same characteristics as the training data. For example, a generative adversarial network trained on photographs of human faces can generate realistic-looking faces which are entirely fictitious.

Generative adversarial networks consist of two neural networks, the generator and the discriminator, which compete against each other. The generator is trained to produce fake data, and the discriminator is trained to distinguish the generator’s fake data from real examples. If the generator produces fake data that the discriminator can easily recognize as implausible, such as an image that is clearly not a face, the generator is penalized. Over time, the generator learns to generate more plausible examples.

Generative Adversarial Network Architecture

A generative adversarial network is made up of two neural networks:

the generator, which learns to produce realistic fake data from a random seed. The fake examples produced by the generator are used as negative examples for training the discriminator.

the discriminator, which learns to distinguish the fake data from realistic data. If the generator produces implausible results, the discriminator penalizes the generator.

The generator’s fake examples, and the training set of real examples, are both fed randomly into the discriminator network. The discriminator does not know whether a particular input originated from the generator or from the training set.

Initially, before training has begun, the generator’s fake output is very easy for the discriminator to recognize.

Since the output of the generator is fed directly into the discriminator as input, this means that when the discriminator classifies an output of the generator, we can apply the backpropagation algorithm through the whole system and update the generator’s weights.

Over time, the generator’s output becomes more realistic and the generator gets better at fooling the discriminator. Eventually, the generator’s outputs are so realistic, that the discriminator is unable to distinguish them from the real examples.

The Discriminator in a Generative Adversarial Network

The discriminator is simply a binary classifier, ending with a suitable function such as the softmax function. The discriminator outputs an array such as

where the two numbers indicate the discriminator’s estimate of the probability of the input example being real or fake.

The discriminator’s input may come from two sources:

the training set, such as real photos of faces, or real audio recordings.

the generator, such as generated synthetic faces, or fake audio recordings.

While we are training the discriminator, we do not train the generator, but hold the generator’s weights constant and use it to produce negative examples for the discriminator.

The process for training the discriminator in a GAN

Pass some real examples, and some fake examples from the generator, into the discriminator as input.

The discriminator classifies them into real and fake.

Calculate the discriminator loss using a suitable function such as the cross-entropy loss.

Update the discriminator’s weights through backpropagation.

In essence, this process is the same as the process for training any other kind of binary classifier, such as a convolutional neural network in the case of computer vision.

The Generator in a Generative Adversarial Network

The generator network is a feedforward neural network learns over time to produce plausible fake data, such as fake faces. It uses feedback from the discriminator to gradually improve its output, until ideally, the discriminator is unable to distinguish its output from real data.

The process of training the generator in a GAN

At the start of training, we initialize both the generator and discriminator with random weights.

For each training iteration, we pass a random seed into the generator as input. The random noise is propagated through the generator and outputs a synthetic example, such as an image.

The generator output is then passed as input into the discriminator network, and the discriminator classifies the example as ‘real’ or ‘fake’. 

We calculate a loss function for the generator. The generator’s loss function represents how good the generator was at tricking the discriminator.

We use the backpropagation algorithm through both the discriminator and generator, to determine how to adjust the only generator’s weights in order to improve the generator loss function.

Note that at this point we do not adjust the discriminator’s weights, because the discriminator needs to stay static while we are training the generator. If we did not do this, training the generator would be like trying to hit a moving target.

How does training a generative adversarial network work?

There are two aspects that make generative adversarial networks more complex to train than a standard feedforward neural network:

The generator and the discriminator are really two neural networks which must be trained separately, but they also interact so they cannot be trained completely independently of each other.

It is hard to identify exactly when a generative adversarial network has converged.

Since the generator and discriminator have their own separate loss functions, we have to train them separately. We can do this by alternating between the two:

We train the discriminator for one or more epochs, keeping the generator weights constant.

We train the generator for one or more epochs, keeping the discriminator weights constant.

We repeat steps (1) and (2) until we determine that the network has converged.

Convergence in a Generative Adversarial Network

Once the generator is able to produce fakes that are indistinguishable from real examples, the discriminator has a much more difficult task. In fact, for a perfect generator, the discriminator will have only 50% accuracy in distinguishing fakes from genuine examples.

This means that the discriminator feedback, which we are using to train the generator, becomes less meaningful over time, and eventually becomes completely random, like a coin toss.

If we continue to train the network beyond this point, then the discriminator’s feedback can actually cause the generator’s quality to go down. For this reason, it is important to monitor the quality of the generated output and stop training once the discriminator has ‘lost’ the game to the generator.

Loss Function of a Generative Adversarial Network

The loss function used by Ian Goodfellow and his colleagues in their 2014 paper that introduced generative adversarial networks is as follows:

Generative adversarial network loss function

The generator tries to minimize the output of the above loss function and the discriminator tries to maximize it. This way a single loss function can be used for both the generator and discriminator.

Loss Function Symbols Explained

The estimate by the discriminator of the probability that an input example x is real

The expected value over all genuine examples

The fake example produced by the generator for random seed z

The estimate by the discriminator of the probability that a fake input example, G(z),  from the generator is real.

The expected value over all random inputs to the generator.

The generator is only able to minimize the second term in the loss function, since the first term depends only on the discriminator.

Example of Training a Generative Adversarial Network

Let us take the example of training a generative adversarial network to synthesize handwritten digits. Below is a sample handwritten number 5 from the MNIST dataset. The MNIST dataset is a database of 60,000 images of handwritten digits 0 to 9, with dimensions 28×28 pixels. It is widely used for testing algorithms in computer vision.

When we initialize the generative adversarial network, initially the images produced by the generator will be pure noise, like this:

Because this noise is very different from a handwritten digit, the discriminator immediately learns to tell the generated and fake data apart. 

The generator then begins to learn how to fool the discriminator. After four epochs (passing the whole MNIST dataset through the generative adversarial network four times, which takes a minute or so on a GPU), the generator starts producing random images that begin to resemble numbers. The discriminator’s task is getting trickier.

After a further 20 epochs, the generator’s output begins to look recognizable:

Below is the generator’s output at 45 epochs. We can see that it is hard even for a human to recognize that this image is artificial, and at this point the discriminator’s ability to recognize the fake examples has dropped to zero.

Generative Adversarial Networks vs Variational Autoencoders

Both generative adversarial networks and variational autoencoders are deep generative models, which means that they model the distribution of the training data, such as images, sound, or text, instead of trying to model the probability of a label given an input example, which is what a discriminative model does.

A variational autoencoder learns a low-dimensional representation of the important information in its training data. It is able to learn a function that a set of 256×256-pixel face images, for example, to a vector of length 100, and also the inverse function that transforms the vector back into a face image.

Both generative adversarial networks and variational autoencoders are able to generate examples that are recognizably similar to the training set, such as digits or faces. However, the output of a GAN is more realistic and visually similar to the training set. In the case of image generation, variational autoencoders tend to generate distorted and blurred images.

The main difference between the two is how they are trained. Generative adversarial networks have two loss functions, one for the generator and one for the discriminator, and are ultimately a kind of unsupervised model. On the other hand, variational autoencoders are trained to minimize a loss function while reproducing a certain image in the training set, and can therefore be seen as a kind of semi-supervised learning.

Because of the more complex design with two networks and two loss functions, generative adversarial networks are much slower to train than variational autoencoders, although, as noted above, the output of the generator is more realistic than that of a variational autoencoder.

Applications of Generative Adversarial Networks

Generative Adversarial Networks for Synthetic training data 

Generative adversarial networks can be used to generate synthetic training data for machine learning applications where training data is scarce. It is often time consuming and costly to gather training data for many machine learning applications, so using a generative adversarial network to generate random faces is sometimes an attractive alternative.

Three synthetic faces generated by the generative adversarial network StyleGAN, developed by NVIDIA. These are not real people. StyleGAN was trained on the Flickr-Faces-HQ faces dataset.

Generative Adversarial Networks for Image Style Transfer

In addition to the examples noted above of generating random images resembling a training dataset, generative adversarial networks can be used for style transfer. In 2019 a team at NVIDIA led by Tero Karras published a generative adversarial network architecture called StyleGAN which can be used to transform images from one style to another. 

This network can be used to morph a human face from one gender to another, or change facial orientation. It is also capable of replacing all horses in a photograph with zebras, for example, or turning a painting into the style of Monet.


StyleGAN allows us to vary parameters within the network to control aspects of the faces that are generated. The fourth face from the left is the ‘mean face’ from the training set, and the faces on either side result from adjusting values within the network which are correlated with age and gender.

Generative Adversarial Networks for Audio Style Transfer

It is even possible to apply generative adversarial networks to audio data. To do this, an audio signal needs to be converted into a spectrogram, where time is on the x-axis, frequency is on the y-axis, and the intensity of sound at a given time point and frequency is represented by ‘color’. Since audio recordings can be of varying lengths, the spectrogram is cut into chunks of constant length. With this preprocessing, an audio signal can be converted into a number of fixed-size images. The generator architecture used can be a convolutional neural network similar to that used in image generation.

With this technique, it is possible to morph audio from one speaker’s voice to another, or to ‘transfer’ a piece of music from classical into a jazz style.

Part of a spectrogram of Beethoven’s 9th Symphony. Time is on the x-axis. This can be passed into a generative adversarial network as if it were an image.

Generative Adversarial Networks for Deepfakes

Since generative adversarial networks can generate photorealistic face images and videos, and can also transform audio recordings into another speaker’s voice, they have become widely known for their use in the phenomenon of ‘deepfakes’. These are hyper-realistic fake videos of celebrities and politicians speaking where both the voice and images are AI-generated. This has generated a degree of controversy in recent years due to the potential for unethical uses of the technology.

Generative Adversarial Network History

Generative adversarial networks were first proposed by the American Ian Goodfellow and his colleagues in 2014. In his PhD at the University of Montréal, Goodfellow had studied noise-contrastive estimation, which is a way of learning a data distribution by comparing it with a noise distribution. Noise-contrastive estimation uses a similar loss function to the one used in generative adversarial networks, and Goodfellow developed the loss function further after his PhD and eventually came up with the idea of a generative adversarial network.

From 2016 onwards, generative adversarial networks began to feature in news articles and entered into the public consciousness thanks to their ability to generate realistic and professional-looking artwork.

Below is the Google N-gram frequency of mentions of the term since 2012, showing the sharp uptick in attention that the topic has received in recent years.


In 2018, a group of three Parisian artists called Obvious used a generative adversarial network to generate a painting on canvas called Edmond de Belamy. The network implementation was written by AI artist Robbie Barrat, and trained on 15,000 real portrait paintings. Edmond de Belamy was sold at the auction house Christie’s for $432,500, making headlines around the world and bringing AI art into the public eye.

The AI painting ‘Edmond de Belamy’. This file is in the public domain because, as the work of a computer algorithm or artificial intelligence, it has no human author in whom copyright is vested. Note the ‘signature’ in the bottom right of the painting is the GAN loss function given above.

In 2019, NVIDIA published the source code and models of their StyleGAN network, allowing the public to generate fake faces on demand.

In recent years generative adversarial networks have again received attention due to their potential to generate convincing ‘deepfakes’. A deepfake is a video of a person talking where the face has been swapped for someone else. Together with the ability of generative adversarial networks to generate fake audio recordings, deepfakes have made it possible to generate convincing fake videos of political figures saying things that they never would have said in reality. This has raised concerns about the rising problem of fake news and media manipulation.

References

[1] Langr and Bok, GANs in Action: Deep learning with Generative Adversarial Networks (2019)

[2] Goodfellow et al, Generative Adversarial Networks (2014)

[3] Google, Generative Adversarial Networks https://developers.google.com/machine-learning/gan

[4] Karras et al, Analyzing and Improving the Image Quality of StyleGAN (2019)

[5] D. P. Kingma and M. Welling, Auto-encoding variational Bayes

[6] Karras et al, A Style-Based Generator Architecture for Generative Adversarial Networks (2019)

[7] Marco Pasini, MelGAN-VC: Voice Conversion and Audio Style Transfer on arbitrarily long samples using Spectrograms (2019)