Skip to content

Latest commit

 

History

History
47 lines (44 loc) · 4.78 KB

references.md

File metadata and controls

47 lines (44 loc) · 4.78 KB

References to Some of the Ideas Used in This Work

0-GP: Improving Generalization and Stability of Generative Adversarial Networks
- Gradient exploding in the discriminator* can lead to mode collapse in the generator (math. justification in the article)
- The number of modes in the distribution grows linearly with the size of the discriminator -> higher capacity discriminators are needed for better approximation of the target distribution.
- Generalization is guaranted if the discriminator set is small enough.
- To smooth out the loss surface one can build a discriminator that makes the judgement on a mixed batch of fake and real samples, determining the proportion between them (Lucas et al., 2018)
- VEEGAN (Srivastava et al., 2017) uses the inverse mapping of the generator to map the data to the prior distribution. The mismatch between the inverse mapping and the prior is used to detect mode collapse. It is not able to help, if the generator can remember the entire dataset
- Generalization capability of the discriminator can be estimated by measuring the difference between its performance on the training dataset and a held-out dataset
- When generator starts to produce samples of the same quality as the real ones, we come to the situation where the discriminator has to deal with mislabeled data: generated samples, regardless of how good they are, are still labeled as bad ones, so the discriminator trained on such dataset will overfit and not be able to teach the generator
- Heuristically, overfitting can be alleviated by limiting the number of discriminator updates per generator update. Goodfellow et al. (2014) recommended to update the discriminator once every generator update
- It is observed that the norm of the gradient w.r.t. the discriminator’s parameters decreases as fakes samples approach real samples. If the discriminator’s learning rate is fixed, then the number of gradient descent steps that the discriminator has to take to reach eps-optimal state should increase. Alternating gradient descent with the same learning rate for discriminator and generator, and fixed number of discriminator updates per generator update (Fixed-Alt-GD) cannot maintain the (empirical) optimality of the discriminator. In GANs trained with Two Timescale Update Rule (TTUR) (Heusel et al., 2017), the ratio between the learning rate of the discriminator and that of the generator goes to infinity as the iteration number goes to infinity. Therefore, the discriminator can learn much faster than the generator and might be able to maintain its optimality throughout the learning process.
____________________________________
* in case of emperically optimal D
NOTE: All references to the authors in the text block above
are direct copies of references that can be found in the article

CBN:
1. Modulating early visual processing by language
2. A Learned Representation For Artistic Style
BN: Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
ResBlocks: Deep Residual Learning for Image Recognition
ProjDisc: cGANs with Projection Discriminator
ConvGRU: Convolutional Gated Recurrent Networks for Video Segmentation
Basic Ideas for Text Encoders: Realistic Image Generation using Region-phrase Attention
D/G Blocks' Structure: Large Scale GAN Training for High Fidelity Natural Image Synthesis
- Employing Spectral Normalization in G improves stability, allowing for fewer D steps per iteration.
- Greater batch size can help dealing with mode collapse and impove the network performance, though it might lead to training collapse (NaNs)
Joint Structured Embeddings:
1. Learning Deep Representations of Fine-Grained Visual Descriptions
2. also
Concatenate by Stacking: StackGAN++: Realistic Image Synthesis with Stacked Generative Adversarial Networks
Self-Attention: A Structured Self-attentive Sentence Embedding