Authors: Yu Xin, Samartha Ramkumar
The youtube link to our presentation is found here. https://youtu.be/0Mv9eMjMYIQ
In this project, we explored the internal of the Denoise Diffusion Probablistic Model (DDPM), a diffusion-based image generation model. We attempted to implement this model using pytorch. Finally, we trained the model on TinyImageNet which has 200 different classes and the results are evaluated using FID score.
Deep Learning based image generation techniques has been under active reserarch, as it is useful in many situations ranging from assisting artists to develop creative ideas to generating photorealistic human faces from texual descriptions for law enforcements. The techniques for Deep Learning based image generation comes from primarily 4 flavors, Generative Adversarial Network (GAN), Variational Autoencoder (VAE), Flow-base methods, and Diffusion-based methods.
4 major approaches of image generation, source
Comparing to other approaches, diffusion-based methods are powerful because it balanced tractability (i.e., how interpretable is the model) and flexibility (i.e., how expressive is the model), which has been a major difficulty for other approaches (cite). Diffusion-based methods is also developed quite recently, making it one of the most activly researched area in computer vision and deep learning.
To the best of our knowledge, Denoise Diffusion Probablistic Model (abbrivated as DDPM) is the first approach to use diffusion method on image generation task. It proposed a Markov Chain model for the distrbution of images of different noise level. It uses an analytical process to add noise to image and a neural network to remove the noise from the distrbution. After training, new images can be generated by applying the denoising neural network starting from a random noise. Later, this model has been improved in many ways, from speeding up the sampling process (DDIM) to improving the quality of generation (Improved Diffusion). Recently, researchers also found ways to incorporate thematically useful information (such as text in English) into the diffusion generation process to control the image it generates (DALLE, GLIDE, IMAGEN, Stable Diffusion). Despite these impressive improvements in the recent research, the diffusion process is largely unchanged. Therefore, we will explore the original DDPM to gain some insight on diffusion-based models.
The DDPM generates images by starting with a random noise and iteratively reducing the noise by applying a denoising neural network. More specifically, it defines a Markov Chain with denoising process (via neural network with parameter
Illustration from the DDPM paper
The noisy image is assumed to be drawn from a distribution
In a nutshell, we try to slowly and systematically corrupt the inherent structure in a data distribution with an iterative forward diffusion process by making use of noise sampling. This is followed by the use of an neural network (UNet in this case) which learns to restore the lost structure in the data distribution during the reverse diffusion process. This yields us a tractable generative model of the data.
We start with the original image and iteratively add noise in each step. We use a Normal distribution to sample the noise. After sufficient iterations, we can say that the final image follows an isotropoic gaussian distribution. We do not employ the same noise at each timestep during the forward process. This can be regulated with the help of a scheduler which scales the mean and variance in order to avoid variance explosion as the noise is increases. The reverse diffusion process involves the neural network trying to learn how to remove noise step by step. This way after the model has completed trying, when we feed the model pure noise sampled from the Normal Distribution, it gradually removes the noise in specified timesteps for tractable outcome and produces the output image with clarity.
The forward process is the noise adding process. It has no learnable parameters in this implementation. We defined
Using a reparameterization trick, we sample at any timestep
where
The reverse process removes noise starting at
where,
Using some math tricks, we can ask the neural network to predict the nose instead of the image. Given noise predictor
An U-net architecture was choosen as the noise predictor
Original Unet Architecture. DDPM made modifications in each block, but retains the same high level architecture
The variance
With all the above, the training and sampling algorithms can be defined as the following
Training and Inference Algorithms suggested by the DDPM paper
In the training, we first uniformly sample a random timestep
During sampling, we follow the reverse process and apply the denoise neural network
For training we used the TinyImageNet (Download) dataset. This dataset consists of 100,000 images with 200 classes of objects (500 image for each class). Each image is has 3 channels (RGB) and has a width and height of 64. They can be represented as a tensor with dimension (3, 64, 64)
in CHW
notation.
Samples from TinyImageNet dataset
The below evaluations are partially based on this DDPM implementation.
We trained the u-net with 4 layers (on each side of the "U") using L1 loss and batch size 64 with Adam optimizer with learning rate of 0.0002. We used a diffusion timestep of
We computed the FID using the generated images against the training data as part of how our model performs for the purpose of evaluation. Here is what we found.
Condition | FID score |
---|---|
Random noise | 410.870 |
Epoch 20 | 117.015 |
Epoch 50 | 195.967 |
Epoch 100 | 139.16 |
The figure below shows how the loss function plateaus after the approximately 500 iterations. With a batch size of 64, each epoch is around 1563 iterations. The loss function eventually stablizes at around 0.12 to 0.18.
The FID suggests that the model is quite unstable, the quality of the generated results varies noticeably between epochs. We suspect this is due to the high dependence on the initial noise for generation. Recent research has provided insight to solve this issue by putting thematically meaningful information into the diffusion process (also known as conditioned diffusion) to contrain the generation results.
Here is generated images at different Epoch (generated images in each group is independent)
Epoch 1 | Epoch 20 |
---|---|
Epoch 50 | Epoch 100 |
---|---|
Here is an animation of the reverse process that generates an image