GAN & Wasserstein GAN

Update Log
2021-07-14: Initial Upload

Generative Adversarial Network

A generator G and a discriminator D
- G’s goal: minimise objective such that $D(G(z))$ is close to 1 (discriminator is fooled into thinking generated $G(z)$ is real)
- D’s goal: maximise objective such that $D(x)$ is close to 1 (real) and $D(G(z))$ is close to 0 (fake)
- G and D are both neural networks, G is differentiable
- Alternate between “Gradient Ascent on D” and “Gradient Descent on G”
- Optimisation target:
$\min _{G} \max _{D} V(D, G)=\mathbb{E}_{\boldsymbol{x} \sim p_{\text {data }}(\boldsymbol{x})}[\log D(\boldsymbol{x})]+\mathbb{E}_{\boldsymbol{z} \sim p_{\boldsymbol{z}}(\boldsymbol{z})}[\log (1-D(G(\boldsymbol{z})))]$
- The following figure shows the process of optimisation. Actually, we are looking for a map between $z$ and $x$ . $z$ could be any distribution, but normally we use normal distribution. The purpose of GAN is to find out a way to transform a distribution from which $z$ is sampled to the target ( $p_{data}$ ). Ideally, at the last step, the distribution of the generated data ( $p_g$ ) is 100% the same as $p_{data}$ , D is not able to descriminate the generated data from the real data and hence output 0.5.
Backgroud Knowledge
- KL divergence - to measure the similarity of two distributions. Smaller number means they are closer.
  - discrete
  $D_{K L}(P \| Q)=\sum_{i} P(i) \log \frac{P(i)}{Q(i)}=\mathbb{E}[\log\frac{P(i)}{Q(i)}]$
  - continuous
  $D_{K L}(P \| Q)=\int_{-\infty}^{\infty} p(x) \log \frac{p(x)}{q(x)} d x=\mathbb{E}[\log\frac{p(x)}{q(x)}]$
  - Note that they can be expressed as expectations.
- JS divergence - another distribution similarity measurement derived from KL divergence. $D_{JS}(P||Q)=\frac{1}{2} \mathrm{KL}(p \| m)+\frac{1}{2} \mathrm{KL}(q \| m),\ where\ m(x)=\frac{1}{2}(p(x)+q(x))$
Detailed Analysis
- Consider the optimisation target aforementioned
- Train $D$ while fixing $G$ to maximise $V$
  - If one real data was misclassified, i.e. $D(x)$ ouput some value near 1, $log(D(x))<<0$ , it will turn the expectation of the first item toward $-\infty$ .
  - Similarly, If one generated data was misclassified, the second item will be $-\infty$
- Train $G$ to minimise $V$
  - Ignore the first item since it does not contain any $G$ , and the target becomes
  $\min\limits_G\mathbb{E}_{\boldsymbol{z} \sim p_{\boldsymbol{z}}(\boldsymbol{z})}[\log (1-D(G(\boldsymbol{z})))] \Leftrightarrow \max\limits_G\mathbb{E}_{\boldsymbol{z} \sim p_{\boldsymbol{z}}(\boldsymbol{z})}[\log (D(G(\boldsymbol{z})))]$
  - The optimal value of $D$ is
  $D_{G}^{*}(\boldsymbol{x})=\frac{p_{\text {data }}(\boldsymbol{x})}{p_{\text {data }}(\boldsymbol{x})+p_{g}(\boldsymbol{x})}$
  - Under this condition, the optimal situation for G is that the distribution of generated data is exactly the same as the real data, i.e. $p_{data}(x)=p_g(x)$ which makes $D_G^*(x)=1/2$ .
  - Let’s go back to check the optimal $G$ , let
  $\begin{aligned}C(G)&=\max _{D} V(G, D) \\&=\max _{D} \int_{x} p_{\text {data }}(x) \log (D(x))+p_{g}(x) \log (1-D(x)) d x\end{aligned}$
  
  Here we replace $G(z)$ with $x$ (assume $G(z)$ is invertible). Plug in the optimal $D_G^*(x)$ then we have
  
  $\begin{aligned}C(G)&=\max_{D}V(G,D)\\ &=\max_{D}\int_{x}p_{\text{data}}(x)\log(D(x))+p_{g}(x)\log(1-D(x))dx\\ &=\int_{x}p_{\text {data}}(x)\log\left(D_{G}^{*}(x)\right)+p_{g}(x)\log\left(1-D_{G}^{*}(x)\right)dx\\ &=\int_{x}p_{data}(x)\log\left(\frac{p_{data}(x)}{p_{data}(x)+p_{g}(x)}\right)+p_{g}(x)\log\left(\frac{p_{g}(x)}{p_{data}(x)+p_{g}(x)}\right)dx\\ &=\int_{x}p_{\text{data}}(x)\log\left(\frac{p_{\text {data}}(x)}{\frac{p_{\text{data}}(x)+p_{g}(x)}{2}}\right)+p_{g}(x)\log\left(\frac{p_{g}(x)}{\frac{p_{\text{data}}(x)+p_{g}(x)}{2}}\right)dx-\log(4)\\ &=KL\left[p_{\text{data}}(x)\|\frac{p_{\text{data}}(x)+p_{g}(x)}{2}\right]+KL\left[p_{g}(x)\|\frac{p_{\text{data}}(x)+p_{g}(x)}{2}\right]-\log(4)\\ &=2JS(p_{data}||p_g)-\log(4)\end{aligned}$
  
  As optimal $D_G^*(x)$ has been plugged in, we can remove the max from the formula. The result could be summarised to two KL divergence minus $log(4)$ . Because KL divergence is no less than zero, so the minimum value of $C(G)$ is $-log(4)$ when and only when
  
  $p_{data}(x)=\frac{p_{data}(x)+p_{g}(x)}{2}\Rightarrow p_{data}(x)=p_{g}(x)$
  
  which proves that $C(G)$ reaches the minimum (optimal) when the generator produces exactly the same distribution as the real data.

Deep Convolutional GANs (DCGANs)

DCGAN modifies the original GAN mainly on the network structure. It uses CNN instead of FC in both generator and discriminator, see the figure below.
bjjgReC
Primary improvements are listed below

Replace any pooling layers with strided convolutions (discriminator) and fractional-strided convolutions (generator).
Use batch normalisation in both the generator and the discriminator.
Remove fully connected hidden layers for deeper architectures, and use 1*1 conv layer instead.
Use ReLU activation in generator for all layers except for the output, which uses Tanh.
Use LeakyReLU activation in the discriminator for all layers.
Transposed Convolution is used in generator so that it is able to generate images that is larger than original data.

Wasserstein GAN

Difficulty of GANs
- The gradient vanishing
  - At the very early phase of training, D is very easy to be confident in detecting G, so D will output almost always 0
  - In GAN, better discriminator leads to worse vanishing gradient in its generator
JS divergence
- $JS(P_r||P_G)=log2$ when two distribution has no overlap
- The probability that the support of $p_{data}$ and $p_g$ have almost zero overlap is 1.
Wasserstein (Earth-Mover) distance

$W(P_{data}, P_{g})=\inf_{\gamma\sim\Pi(P_{data}, P_{g})} \mathbb{E}_{(x, y)\sim\gamma}[||x-y||]$
- $\Pi(P_{data},P_g)$ is the set of all the possible joint distribution of $P_{data}$ and $P_g$ ，i.e. the marginal distribution of $\Pi(P_{data},P_g)$ is $P_{data}$ and $P_g$
- To intuitively understand the Wasserstein distance, you can image the distribution $P_{data}$ and $P_g$ are like two piles of earth, and the W-distance between them are the minimum energy you need to move from one to another.
- Compared to KL & JS divergence, it has the advantage that distance between two distribution can be smoothly reflected even if they are not overlapped.
Loss Function in WGAN
- Why don’t directly use W-distance as the optimisation target - the infimum is highly intractable
- convert the W-distance to its duality form
  $\begin{aligned}W(P_{data}, P_{g})&=\sup_{||f||_{L}\le 1}\mathbb{E}_{x\sim P_{data}}[f(x)]-\mathbb{E}_{x\sim P_{g}}[f(x)] \\&=\frac{1}{K}\sup_{||f||_{L}\le K}\mathbb{E}_{x\sim P_{r}}[f(x)]-\mathbb{E}_{x\sim P_{g}}[f(x)]\end{aligned}$
- Lipschitz continuity - Add a constraint on a continuous function $f$ that there exists a constant $K\ge0$ making any two elements $x_1$ and $x_2$ satisfy $|f(x_{1})-f(x_{2})| \leq K|x_{1}-x_{2}|$ . $K$ is the lipschitz constant of function $f$ . For example, if the domain of definition is $\mathbb{R}$ , Lipschitz coninuity limits the absolute value of the derivative function no greater than $K$ . In other words, Lipschitz continuity restrict the maximum local fluctuation range of a continuous function.
- This duality form requires the supremum under the condition that the Lipschitz constant of function $f$ is no greater than K.
- Represent the function $f$ as a neural network parameterised with $w$
  $W(P_{data}, P_{g})=\max _{w \in W} \mathbb{E}_{x \sim P_{data}}[f_{w}(x)]-\mathbb{E}_{x \sim P_{g}}[f_{w}(x)]$
- Meanwhile, $w$ need to satisfy the limit $||f_w||_L\le K$ , but we don’t really care the value of K, since it will only scale the gradient, but not affect the direction. We can set a limitation $[-c,c]$ to the parameter $w$ , and consequently the derivative $\frac{\partial f_w}{\partial x}$ will also be limited within a certain range, so there must be an unkonwn constant $K$ larger than the fluctuation range of $f_w$ , satisfying Lipschitz continuity.
- The loss of discriminator (Remove the last sigmoid layer)
  $L_d=\mathbb{E}_{x \sim P_{data}}[f_{w}(x)]-\mathbb{E}_{x \sim P_{g}}[f_{w}(x)]$
- $f_w$ 's goal: maximise Wasserstein distance between real data distribution and generative distribution
- The Loss of generator
  $L_g=-\mathbb{E}_{x \sim P_{g}}[f_{w}(x)]=-\mathbb{E}_{z \sim p(z)}[f_{w}(g_{\theta}(z))]$
- $g_\theta$ 's goal: minimise Wasserstein distance between real data distribution and generative distribution
Compare to GAN
- Remove the last sigmoid
- no log is applied to the loss
- limit the parameters $w$ within a certain range $[-c,c]$ (weight clipping)

Phillip's blog

GAN & Wasserstein GAN
Click back to the top

Home

About

Archives

GAN & Wasserstein GAN

Generative Adversarial Network

Deep Convolutional GANs (DCGANs)

Wasserstein GAN