This repository houses my personal summaries and notes on a variety of academic papers/blogs I have read. These summaries are intended to provide a brief overview of the papers' main points, methodologies, findings, and implications, thereby serving as quick references for myself and anyone interested.
- Introduces a generative modeling using a continuous-time diffusion process, offering an alternative to adversarial and maximum likelihood methods
- Produces image samples of quality comparable or superior to leading GANs and VAEs
- Provides a theoretical foundation for diffusion models, linking them to other generative techniques
Summary notes
Paper explanation video: Yanic Kilcher
Archive link
Basic annotated implementation
- Present DDIMS which are implicit probabilistic models and can produce high quality samples 10X to 50X faster (in about 50 steps) in comparison to DDPM
- Generalizes DDPMs by using a class of non-Markovian diffusion process that lead to "short" generative Markov chains that can simulate image generation in a small number of steps
- The training objective in DDIM is similar to DDPM, one can use any pretrained DDPM model with DDIM or other generative processes that can generative images in least steps
Summary notes
Archive link
Github repo
- Introduces a textual editing method to semantically edit images in pre-trained text-conditioned diffusion models via Prompt-to-Prompt manipulations
- Approach allows for editing the image while preserving the original composition of the image and addressing the content of the new prompt.
- The key idea is that onr can edit images by injecting the cross-attention maps during the diffusion process, controlling which pixels attend to which tokens of the prompt text during which diffusion steps.
Summary notes
Archive link
Github repo
- Introduces an accurate inversion scheme for real input images, enabling intuitive and versatile text-based image modification without tuning model weights.
- It achieving near-perfect reconstruction, while retaining the rich text-guided editing capabilities of the original model
- The approach consists of two novel ideas, pivotal inversion (using DDIM inversion trajactory as the anchor noise vector) and null-text optimization (optimizing only the null-text embeddings)
Summary notes
Archive link
Paper walkthrough video: Original author
Github repo
5. Adding Conditional Control to Text-to-Image Diffusion Models, Lvmin Zhang and Maneesh Agarwala et. al.
- Allows additional control for the pre-trained large diffusion models, such as Stable diffusion, by providing the facility of input visual conditions such as edge maps, segment masks, depth masks, etc.
- Learns task-specific conditions in an end-to-end way
- Training is as fast as fine-tuning a diffusion model, and for small dataset (<50k), it can be trained to produce robust results even on desktop-grade personal GPUs.
- Multiple controlnets can be combinded at inference time to have multiple control visual conditions
Summary notes
Archive link
Github repo
HF usage example
Controlnet SD1.5 1.0 and 1.1 ckpts
Controlnet SDXL ckpts
- An image-and-pose conditioned diffusion method based upon Stable Diffusion to turn fashion photographs into realistic, animated videos
- Introduces a pose conditioning approach that greatly improves temporal consistency across frames
- Uses an image CLIP and VAE encoder, instead of text encoder, that increases the output fidelity to the conditioning image
Summary notes
Archive link
Github repo
- Introduces an enhanced stable diffusion model that surpasses the generating capabilities of previous versions
- Uses a larger UNet backbone and introducing novel conditioning schemes in the training stage
- Probably, the best public domain open-source text-to-image model at this moment (Aug, 2023)
Summary notes
Archive link
Paper walkthrough video: Two minute papers
HF usage example
- Directly sampling an image with a resolution beyond the training image sizes of pre-trained diffusion models models usually result in severe object repetition issues and unreasonable object structures.
- The paper explores the use of pre-trained diffusion models to generate images at resolutions higher than the models were trained on, specifically targeting the generation of images with arbitrary aspect ratios and higher resolution.
-
Summary notes
Archive link
Project page
Github repo
- It allows for precise control of concepts in the diffusion models.
-
Summary notes
Archive link
Project page
Github repo
XL sliders (LORA)
- ZipLoRA, seamlessly allows for merging independently trained style and subject LoRAs thus generates any subject in any style using sufficiently powerful diffusion models like SDXL.
- It offers a streamlined, cheap, and hyperparameter-free solution for simultaneous subject and style personalization, unlocking a new level of creative controllability for diffusion models.
-
Summary notes
Archive link
Project page
- DemoFusion focuses on producing high-resolution images to generate images at
4X
,16X
, and even higher resolutions without any fine-tuning or prohibitive memory demands. -
Summary notes
Archive link
Project page
Github repo
- Introduces the
Transformer
model, which relies solely on attention mechanisms for sequence modelling and transduction tasks. It dispenses the recurrence and convolutions networks entirely. - It is a breakthrough paper that has lead to major advances in NLP, CV and multi-modal machine learning
Summary notes
Archive link
Paper explanation video: Yanic Kilcher
Annotated Implementation
- Proposes a novel solution to image blending, particularly for the problem of hairstyle transfer, based on GAN-inversion
- introduces a latent space for image blending which is better at preserving detail and encoding spatial information
- explains a new GAN-embedding algorithm which is able to slightly modify images to conform to a common segmentation mask
Summary notes
Archive link
Github repo