Update Log
2021-07-14: Initial Upload
Generative Adversarial Network
-
A generator G and a discriminator D
- G’s goal: minimise objective such that is close to 1 (discriminator is fooled into thinking generated is real)
- D’s goal: maximise objective such that is close to 1 (real) and is close to 0 (fake)
- G and D are both neural networks, G is differentiable
- Alternate between “Gradient Ascent on D” and “Gradient Descent on G”
- Optimisation target:
- The following figure shows the process of optimisation. Actually, we are looking for a map between and . could be any distribution, but normally we use normal distribution. The purpose of GAN is to find out a way to transform a distribution from which is sampled to the target (). Ideally, at the last step, the distribution of the generated data () is 100% the same as , D is not able to descriminate the generated data from the real data and hence output 0.5.

-
Backgroud Knowledge
- KL divergence - to measure the similarity of two distributions. Smaller number means they are closer.
- discrete
- continuous
- Note that they can be expressed as expectations.
- JS divergence - another distribution similarity measurement derived from KL divergence.
- KL divergence - to measure the similarity of two distributions. Smaller number means they are closer.
-
Detailed Analysis
-
Consider the optimisation target aforementioned
-
Train while fixing to maximise
- If one real data was misclassified, i.e. ouput some value near 1, , it will turn the expectation of the first item toward .
- Similarly, If one generated data was misclassified, the second item will be
-
Train to minimise
- Ignore the first item since it does not contain any , and the target becomes
- The optimal value of is
- Under this condition, the optimal situation for G is that the distribution of generated data is exactly the same as the real data, i.e. which makes .
- Let’s go back to check the optimal , let
Here we replace with (assume is invertible). Plug in the optimal then we have
As optimal has been plugged in, we can remove the max from the formula. The result could be summarised to two KL divergence minus . Because KL divergence is no less than zero, so the minimum value of is when and only when
which proves that reaches the minimum (optimal) when the generator produces exactly the same distribution as the real data.
-
Deep Convolutional GANs (DCGANs)
DCGAN modifies the original GAN mainly on the network structure. It uses CNN instead of FC in both generator and discriminator, see the figure below.

Primary improvements are listed below
- Replace any pooling layers with strided convolutions (discriminator) and fractional-strided convolutions (generator).
- Use batch normalisation in both the generator and the discriminator.
- Remove fully connected hidden layers for deeper architectures, and use 1*1 conv layer instead.
- Use ReLU activation in generator for all layers except for the output, which uses Tanh.
- Use LeakyReLU activation in the discriminator for all layers.
- Transposed Convolution is used in generator so that it is able to generate images that is larger than original data.
Wasserstein GAN
-
Difficulty of GANs
- The gradient vanishing
- At the very early phase of training, D is very easy to be confident in detecting G, so D will output almost always 0
- In GAN, better discriminator leads to worse vanishing gradient in its generator
- The gradient vanishing
-
JS divergence
- when two distribution has no overlap
- The probability that the support of and have almost zero overlap is 1.
-
Wasserstein (Earth-Mover) distance
- is the set of all the possible joint distribution of and ,i.e. the marginal distribution of is and
- To intuitively understand the Wasserstein distance, you can image the distribution and are like two piles of earth, and the W-distance between them are the minimum energy you need to move from one to another.
- Compared to KL & JS divergence, it has the advantage that distance between two distribution can be smoothly reflected even if they are not overlapped.
-
Loss Function in WGAN
-
Why don’t directly use W-distance as the optimisation target - the infimum is highly intractable
-
convert the W-distance to its duality form
-
Lipschitz continuity - Add a constraint on a continuous function that there exists a constant making any two elements and satisfy . is the lipschitz constant of function . For example, if the domain of definition is , Lipschitz coninuity limits the absolute value of the derivative function no greater than . In other words, Lipschitz continuity restrict the maximum local fluctuation range of a continuous function.
-
This duality form requires the supremum under the condition that the Lipschitz constant of function is no greater than K.
-
Represent the function as a neural network parameterised with
-
Meanwhile, need to satisfy the limit , but we don’t really care the value of K, since it will only scale the gradient, but not affect the direction. We can set a limitation to the parameter , and consequently the derivative will also be limited within a certain range, so there must be an unkonwn constant larger than the fluctuation range of , satisfying Lipschitz continuity.
-
The loss of discriminator (Remove the last sigmoid layer)
-
's goal: maximise Wasserstein distance between real data distribution and generative distribution
-
The Loss of generator
-
's goal: minimise Wasserstein distance between real data distribution and generative distribution
-
-
Compare to GAN
- Remove the last sigmoid
- no log is applied to the loss
- limit the parameters within a certain range (weight clipping)