This repository is meant to give an overview of diffusion models alongside with an example of the current state of the art DDPM model according to FID score on ImageNet 256x256 dataset and others. The presentation will follow most of the details in the original paper "Diffusion Vision Transformers for Image Generation". Due to most of the ideas in the paper are based on previous implementations, there will be a broader discussion to how all of them are combined into one single model. I should mention that many of the arguments are empirical (it has been testest, and some ideas give a better score than others), even though we can come up with high level arguments why do we think we should expect improvements if we choose an architecture rather than other.
I am writing this introduction in order to establish a stronger connection with further arguments related to diffusion.
In these models we start by sampling images from an unknown probability distribution (the goal in generative probabilistic models is to approximate this distribution), then gradualy adding noise to them until they are completely unrelated to the starting examples. This procedure of adding noise (or forward process) can be modeled by a Gaussian probability. In fact, you could choose from a family of distributions, but Gaussian is preferred due to its "nice" properties. In addition, due to the complexity we have no reason to expect our neural network to be able to predict an initial image directly given as input complete noise. We can aleviate the work of the neural network by introducing a mechanism that indeed adds noise gradualy resulting in a series of
The full description is in "Diffusion_model_basic". I would recommend read them and the details below at the same time.
There are a few (if not many) questions regarding the discussion above. Firstly, what is the mean and variance of the Gaussian forward process? Do we have to learn both the mean and variance in the Gaussian reverse process? Secondly, how many steps T should we have? How to design the noising process such that at step
Indeed, there are lots of possibilities in designing the forward process, and thats because it might happen that the image is very noisy in a few steps which makes it even harder to train, or the added noise is small enough such that even us can recognise the initial image after those steps. One example is a linear increase in noise, or a cosine. Because the cosine is smoother (it has a slower decrease) than a line especially at the beginning and at the end of the
The loss function is as usual in variatonal problems, minimizing a variational lower bound over the joint of the reverse and forward process which decouple into sums of KL-divergences after factorization.
In principle, if there is to optimize the model, it is the training process, sampling process or both where to make changes. In this section we will focus on sampling, while the Vision Transformer is adopting a new neural network architecture, which can be thought of as changing both processes. Furthermore, incorporating different family of forward distributions is a possibility if the KL-divergence can have an exact computation, otherwise the Gaussian remains a great option just because can be directly scaled to training on a large scale, where time management and efficiency is key. Moreover, when training for very long times the differences might be insignificant.
One of the first papers to suggest improvements to DDPM is IDDPM. These modifications are very natural, in the sense they represent a first glance approach on what would you consider to test to check whether the improvements are notable. First of all, they found that changing
I don't want to insist on the cosine schedule because we are going to generalize and extend the idea to a family of specific schedulers. In particular, the one that achieves best scores if the Laplace scheduler. Everything in this sub-section is following the results in the original paper. In "Scheduler_notes" we see that using an arbitrary noise schedule, there is a new term in the ELBO (or variational lower bound) that multiplies the previous inside the expectation. This new term behaves to give a relative weight to training at step t, but at the same time a probability distribution over the steps, so that other are prioritized (for example noisy samples at the first steps might have a higher probability). In the figure below, there is a comparison between Laplace and other noise schedules, showing a roughly 40% improvement over the cosine. It is notable to say that in the paper there were only a few noise schedules considered, and there is an open question about others. Even so, a 40% boost only from only modifying the SNR is substantial.
It was remnarked in the DDPIM paper that the ELBO depends on the marginals ($q(x_{t}|x_{0})$) for each
DiffiT exploits the use of transformers in the encoder and decoder of the U-Net. Thus, feature maps "talk to each other" as in LLMs. The neural network is forced to learn the underlying relations between the feature maps. Although, transformers could be seen as a natural try to incorporate in a new architecture. The novel idea presented is the projection of both time and spatial embeddings into a shared space. As you saw in the first U-Net models, the time embedding was projected on the spatial part. Now, both time and space are projected. This was named by the authors "Time-dependent Self-Attention" and is the reason why the model is called "vision transformer". There will be 6 linear projections in total, defining
There is a table with the architecture in the paper, although tables S.2,S.3 have some possible typos. The resblock should keep the same dimensions of the feature maps. In S.3 this is the case, but in S.2 the resblock
The
If you find this repository useful, please cite the following:
@misc{Bodnar2024DiffiT_Implementation,
author = {Bodnar, Andrei},
title = {Diffusion_Vision_Transformer-Implementation},
year = {2024},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/AndreiB137/Diffusion_Vision_Transformer-Implementation}},
}