0-GP: Improving Generalization and Stability of Generative Adversarial Networks
- Gradient exploding in the discriminator* can lead to mode collapse in the generator (math. justification in the article)
- The number of modes in the distribution grows linearly with the size of the discriminator -> higher capacity discriminators are needed for
better approximation of the target distribution.
- Generalization is guaranted if the discriminator set is small enough.
- To smooth out the loss surface one can build a discriminator that makes the judgement on a mixed batch of fake and real samples, determining
the proportion between them (Lucas et al., 2018)
- VEEGAN (Srivastava et al., 2017) uses the inverse mapping of the generator to map the data to the prior distribution. The mismatch between
the inverse mapping and the prior is used to detect mode collapse. It is not able to help, if the generator can remember the entire dataset
- Generalization capability of the discriminator can be estimated by measuring the difference between its performance on the training dataset
and a held-out dataset
- When generator starts to produce samples of the same quality as the real ones, we come to the situation where the discriminator has to deal
with mislabeled data: generated samples, regardless of how good they are, are still labeled as bad ones, so the discriminator trained on such
dataset will overfit and not be able to teach the generator
- Heuristically, overfitting can be alleviated by limiting the number of discriminator updates per generator update. Goodfellow et al. (2014)
recommended to update the discriminator once every generator update
- It is observed that the norm of the gradient w.r.t. the discriminator’s parameters decreases as fakes samples approach real samples. If the
discriminator’s learning rate is fixed, then the number of gradient descent steps that the discriminator has to take to reach eps-optimal
state should increase. Alternating gradient descent with the same learning rate for discriminator and generator, and fixed number of
discriminator updates per generator update (Fixed-Alt-GD) cannot maintain the (empirical) optimality of the discriminator. In GANs trained
with Two Timescale Update Rule (TTUR) (Heusel et al., 2017), the ratio between the learning rate of the discriminator and that of the
generator goes to infinity as the iteration number goes to infinity. Therefore, the discriminator can learn much faster than the generator
and might be able to maintain its optimality throughout the learning process.
____________________________________
* in case of emperically optimal D
NOTE: All references to the authors in the text block above
are direct copies of references that can be found in the article
CBN:
1. Modulating early visual processing by language
2. A Learned Representation For Artistic Style
BN: Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
ResBlocks: Deep Residual Learning for Image Recognition
ProjDisc: cGANs with Projection Discriminator
ConvGRU: Convolutional Gated Recurrent Networks for Video Segmentation
Basic Ideas for Text Encoders: Realistic Image Generation using Region-phrase Attention
D/G Blocks' Structure: Large Scale GAN Training for High Fidelity Natural Image Synthesis
- Employing Spectral Normalization in G improves stability, allowing for fewer D steps per iteration.
- Greater batch size can help dealing with mode collapse and impove the network performance, though it might lead to training collapse (NaNs)
Joint Structured Embeddings:
1. Learning Deep Representations of Fine-Grained Visual Descriptions
2. also
Concatenate by Stacking: StackGAN++: Realistic Image Synthesis with Stacked Generative Adversarial Networks
Self-Attention: A Structured Self-attentive Sentence Embedding