Update Log
2021-07-14: Initial Upload

Generative Adversarial Network

  • A generator G and a discriminator D

    • G’s goal: minimise objective such that D(G(z))D(G(z)) is close to 1 (discriminator is fooled into thinking generated G(z)G(z) is real)
    • D’s goal: maximise objective such that D(x)D(x) is close to 1 (real) and D(G(z))D(G(z)) is close to 0 (fake)
    • G and D are both neural networks, G is differentiable
    • Alternate between “Gradient Ascent on D” and “Gradient Descent on G”
    • Optimisation target:

    minGmaxDV(D,G)=Expdata (x)[logD(x)]+Ezpz(z)[log(1D(G(z)))]\min _{G} \max _{D} V(D, G)=\mathbb{E}_{\boldsymbol{x} \sim p_{\text {data }}(\boldsymbol{x})}[\log D(\boldsymbol{x})]+\mathbb{E}_{\boldsymbol{z} \sim p_{\boldsymbol{z}}(\boldsymbol{z})}[\log (1-D(G(\boldsymbol{z})))]

    • The following figure shows the process of optimisation. Actually, we are looking for a map between zz and xx. zz could be any distribution, but normally we use normal distribution. The purpose of GAN is to find out a way to transform a distribution from which zz is sampled to the target (pdatap_{data}). Ideally, at the last step, the distribution of the generated data (pgp_g) is 100% the same as pdatap_{data}, D is not able to descriminate the generated data from the real data and hence output 0.5.
      zN0wr1r
  • Backgroud Knowledge

    • KL divergence - to measure the similarity of two distributions. Smaller number means they are closer.
      • discrete

      DKL(PQ)=iP(i)logP(i)Q(i)=E[logP(i)Q(i)]D_{K L}(P \| Q)=\sum_{i} P(i) \log \frac{P(i)}{Q(i)}=\mathbb{E}[\log\frac{P(i)}{Q(i)}]

      • continuous

      DKL(PQ)=p(x)logp(x)q(x)dx=E[logp(x)q(x)]D_{K L}(P \| Q)=\int_{-\infty}^{\infty} p(x) \log \frac{p(x)}{q(x)} d x=\mathbb{E}[\log\frac{p(x)}{q(x)}]

      • Note that they can be expressed as expectations.
    • JS divergence - another distribution similarity measurement derived from KL divergence.

      DJS(PQ)=12KL(pm)+12KL(qm), where m(x)=12(p(x)+q(x))D_{JS}(P||Q)=\frac{1}{2} \mathrm{KL}(p \| m)+\frac{1}{2} \mathrm{KL}(q \| m),\ where\ m(x)=\frac{1}{2}(p(x)+q(x))

  • Detailed Analysis

    • Consider the optimisation target aforementioned

    • Train DD while fixing GG to maximise VV

      • If one real data was misclassified, i.e. D(x)D(x) ouput some value near 1, log(D(x))<<0log(D(x))<<0, it will turn the expectation of the first item toward -\infty.
      • Similarly, If one generated data was misclassified, the second item will be -\infty
    • Train GG to minimise VV

      • Ignore the first item since it does not contain any GG, and the target becomes

      minGEzpz(z)[log(1D(G(z)))]maxGEzpz(z)[log(D(G(z)))]\min\limits_G\mathbb{E}_{\boldsymbol{z} \sim p_{\boldsymbol{z}}(\boldsymbol{z})}[\log (1-D(G(\boldsymbol{z})))] \Leftrightarrow \max\limits_G\mathbb{E}_{\boldsymbol{z} \sim p_{\boldsymbol{z}}(\boldsymbol{z})}[\log (D(G(\boldsymbol{z})))]

      • The optimal value of DD is

      DG(x)=pdata (x)pdata (x)+pg(x)D_{G}^{*}(\boldsymbol{x})=\frac{p_{\text {data }}(\boldsymbol{x})}{p_{\text {data }}(\boldsymbol{x})+p_{g}(\boldsymbol{x})}

      • Under this condition, the optimal situation for G is that the distribution of generated data is exactly the same as the real data, i.e. pdata(x)=pg(x)p_{data}(x)=p_g(x) which makes DG(x)=1/2D_G^*(x)=1/2.
      • Let’s go back to check the optimal GG, let

      C(G)=maxDV(G,D)=maxDxpdata (x)log(D(x))+pg(x)log(1D(x))dx\begin{aligned}C(G)&=\max _{D} V(G, D) \\&=\max _{D} \int_{x} p_{\text {data }}(x) \log (D(x))+p_{g}(x) \log (1-D(x)) d x\end{aligned}

      Here we replace G(z)G(z) with xx (assume G(z)G(z) is invertible). Plug in the optimal DG(x)D_G^*(x) then we have

      C(G)=maxDV(G,D)=maxDxpdata(x)log(D(x))+pg(x)log(1D(x))dx=xpdata(x)log(DG(x))+pg(x)log(1DG(x))dx=xpdata(x)log(pdata(x)pdata(x)+pg(x))+pg(x)log(pg(x)pdata(x)+pg(x))dx=xpdata(x)log(pdata(x)pdata(x)+pg(x)2)+pg(x)log(pg(x)pdata(x)+pg(x)2)dxlog(4)=KL[pdata(x)pdata(x)+pg(x)2]+KL[pg(x)pdata(x)+pg(x)2]log(4)=2JS(pdatapg)log(4)\begin{aligned}C(G)&=\max_{D}V(G,D)\\ &=\max_{D}\int_{x}p_{\text{data}}(x)\log(D(x))+p_{g}(x)\log(1-D(x))dx\\ &=\int_{x}p_{\text {data}}(x)\log\left(D_{G}^{*}(x)\right)+p_{g}(x)\log\left(1-D_{G}^{*}(x)\right)dx\\ &=\int_{x}p_{data}(x)\log\left(\frac{p_{data}(x)}{p_{data}(x)+p_{g}(x)}\right)+p_{g}(x)\log\left(\frac{p_{g}(x)}{p_{data}(x)+p_{g}(x)}\right)dx\\ &=\int_{x}p_{\text{data}}(x)\log\left(\frac{p_{\text {data}}(x)}{\frac{p_{\text{data}}(x)+p_{g}(x)}{2}}\right)+p_{g}(x)\log\left(\frac{p_{g}(x)}{\frac{p_{\text{data}}(x)+p_{g}(x)}{2}}\right)dx-\log(4)\\ &=KL\left[p_{\text{data}}(x)\|\frac{p_{\text{data}}(x)+p_{g}(x)}{2}\right]+KL\left[p_{g}(x)\|\frac{p_{\text{data}}(x)+p_{g}(x)}{2}\right]-\log(4)\\ &=2JS(p_{data}||p_g)-\log(4)\end{aligned}

      As optimal DG(x)D_G^*(x) has been plugged in, we can remove the max from the formula. The result could be summarised to two KL divergence minus log(4)log(4). Because KL divergence is no less than zero, so the minimum value of C(G)C(G) is log(4)-log(4) when and only when

      pdata(x)=pdata(x)+pg(x)2pdata(x)=pg(x)p_{data}(x)=\frac{p_{data}(x)+p_{g}(x)}{2}\Rightarrow p_{data}(x)=p_{g}(x)

      which proves that C(G)C(G) reaches the minimum (optimal) when the generator produces exactly the same distribution as the real data.

Deep Convolutional GANs (DCGANs)

DCGAN modifies the original GAN mainly on the network structure. It uses CNN instead of FC in both generator and discriminator, see the figure below.
bjjgReC
Primary improvements are listed below

  • Replace any pooling layers with strided convolutions (discriminator) and fractional-strided convolutions (generator).
  • Use batch normalisation in both the generator and the discriminator.
  • Remove fully connected hidden layers for deeper architectures, and use 1*1 conv layer instead.
  • Use ReLU activation in generator for all layers except for the output, which uses Tanh.
  • Use LeakyReLU activation in the discriminator for all layers.
  • Transposed Convolution is used in generator so that it is able to generate images that is larger than original data.

Wasserstein GAN

  • Difficulty of GANs

    • The gradient vanishing
      • At the very early phase of training, D is very easy to be confident in detecting G, so D will output almost always 0
      • In GAN, better discriminator leads to worse vanishing gradient in its generator
  • JS divergence

    • JS(PrPG)=log2JS(P_r||P_G)=log2 when two distribution has no overlap
    • The probability that the support of pdatap_{data} and pgp_g have almost zero overlap is 1.
  • Wasserstein (Earth-Mover) distance

    W(Pdata,Pg)=infγΠ(Pdata,Pg)E(x,y)γ[xy]W(P_{data}, P_{g})=\inf_{\gamma\sim\Pi(P_{data}, P_{g})} \mathbb{E}_{(x, y)\sim\gamma}[||x-y||]

    • Π(Pdata,Pg)\Pi(P_{data},P_g) is the set of all the possible joint distribution of PdataP_{data} and PgP_g,i.e. the marginal distribution of Π(Pdata,Pg)\Pi(P_{data},P_g) is PdataP_{data} and PgP_g
    • To intuitively understand the Wasserstein distance, you can image the distribution PdataP_{data} and PgP_g are like two piles of earth, and the W-distance between them are the minimum energy you need to move from one to another.
    • Compared to KL & JS divergence, it has the advantage that distance between two distribution can be smoothly reflected even if they are not overlapped.
  • Loss Function in WGAN

    • Why don’t directly use W-distance as the optimisation target - the infimum is highly intractable

    • convert the W-distance to its duality form

      W(Pdata,Pg)=supfL1ExPdata[f(x)]ExPg[f(x)]=1KsupfLKExPr[f(x)]ExPg[f(x)]\begin{aligned}W(P_{data}, P_{g})&=\sup_{||f||_{L}\le 1}\mathbb{E}_{x\sim P_{data}}[f(x)]-\mathbb{E}_{x\sim P_{g}}[f(x)] \\&=\frac{1}{K}\sup_{||f||_{L}\le K}\mathbb{E}_{x\sim P_{r}}[f(x)]-\mathbb{E}_{x\sim P_{g}}[f(x)]\end{aligned}

    • Lipschitz continuity - Add a constraint on a continuous function ff that there exists a constant K0K\ge0 making any two elements x1x_1 and x2x_2 satisfy f(x1)f(x2)Kx1x2|f(x_{1})-f(x_{2})| \leq K|x_{1}-x_{2}|. KK is the lipschitz constant of function ff. For example, if the domain of definition is R\mathbb{R}, Lipschitz coninuity limits the absolute value of the derivative function no greater than KK. In other words, Lipschitz continuity restrict the maximum local fluctuation range of a continuous function.

    • This duality form requires the supremum under the condition that the Lipschitz constant of function ff is no greater than K.

    • Represent the function ff as a neural network parameterised with ww

      W(Pdata,Pg)=maxwWExPdata[fw(x)]ExPg[fw(x)]W(P_{data}, P_{g})=\max _{w \in W} \mathbb{E}_{x \sim P_{data}}[f_{w}(x)]-\mathbb{E}_{x \sim P_{g}}[f_{w}(x)]

    • Meanwhile, ww need to satisfy the limit fwLK||f_w||_L\le K, but we don’t really care the value of K, since it will only scale the gradient, but not affect the direction. We can set a limitation [c,c][-c,c] to the parameter ww, and consequently the derivative fwx\frac{\partial f_w}{\partial x} will also be limited within a certain range, so there must be an unkonwn constant KK larger than the fluctuation range of fwf_w, satisfying Lipschitz continuity.

    • The loss of discriminator (Remove the last sigmoid layer)

      Ld=ExPdata[fw(x)]ExPg[fw(x)]L_d=\mathbb{E}_{x \sim P_{data}}[f_{w}(x)]-\mathbb{E}_{x \sim P_{g}}[f_{w}(x)]

    • fwf_w's goal: maximise Wasserstein distance between real data distribution and generative distribution

    • The Loss of generator

      Lg=ExPg[fw(x)]=Ezp(z)[fw(gθ(z))]L_g=-\mathbb{E}_{x \sim P_{g}}[f_{w}(x)]=-\mathbb{E}_{z \sim p(z)}[f_{w}(g_{\theta}(z))]

    • gθg_\theta's goal: minimise Wasserstein distance between real data distribution and generative distribution

  • Compare to GAN

    • Remove the last sigmoid
    • no log is applied to the loss
    • limit the parameters ww within a certain range [c,c][-c,c] (weight clipping)