StableCascade image generation (#1979)

CVS-139009
openvinotoolkit · May 1, 2024 · b3ef8e9 · b3ef8e9
1 parent 863dc2d
commit b3ef8e9
Show file tree

Hide file tree

Showing 7 changed files with 859 additions and 0 deletions.
diff --git a/.ci/ignore_treon_docker.txt b/.ci/ignore_treon_docker.txt
@@ -58,3 +58,4 @@ llm-rag-langchain
 stable-video-diffusion
 llm-agent-langchain
 hello-npu
+stable-cascade-image-generation
diff --git a/.ci/ignore_treon_linux.txt b/.ci/ignore_treon_linux.txt
@@ -58,3 +58,4 @@ llm-rag-langchain
 stable-video-diffusion
 llm-agent-langchain
 hello-npu
+stable-cascade-image-generation
diff --git a/.ci/ignore_treon_mac.txt b/.ci/ignore_treon_mac.txt
@@ -59,3 +59,4 @@ llm-rag-langchain
 stable-video-diffusion
 llm-agent-langchain
 hello-npu
+stable-cascade-image-generation
diff --git a/.ci/ignore_treon_win.txt b/.ci/ignore_treon_win.txt
@@ -55,3 +55,4 @@ llm-rag-langchain
 stable-video-diffusion
 llm-agent-langchain
 hello-npu
+stable-cascade-image-generation
diff --git a/.ci/spellcheck/.pyspelling.wordlist.txt b/.ci/spellcheck/.pyspelling.wordlist.txt
@@ -705,6 +705,7 @@ SRT
 SSD
 SSDLite
 sst
+StableCascade
 StableDiffusionInpaintPipeline
 StableDiffusionPipeline
 StableDiffusionImg

diff --git a/notebooks/stable-cascade-image-generation/README.md b/notebooks/stable-cascade-image-generation/README.md
@@ -0,0 +1,27 @@
+# Image generation with StableCascade and OpenVINO
+
+<img src="https://huggingface.co/stabilityai/stable-cascade/resolve/main/figures/collage_1.jpg" />
+
+[Stable Cascade](https://huggingface.co/stabilityai/stable-cascade) is built upon the [Würstchen](https://openreview.net/forum?id=gU58d5QeGv) architecture and its main difference to other models like Stable Diffusion is that it is working at a much smaller latent space. Why is this important? The smaller the latent space, the faster you can run inference and the cheaper the training becomes. How small is the latent space? Stable Diffusion uses a compression factor of 8, resulting in a 1024x1024 image being encoded to 128x128. Stable Cascade achieves a compression factor of 42, meaning that it is possible to encode a 1024x1024 image to 24x24, while maintaining crisp reconstructions. The text-conditional model is then trained in the highly compressed latent space.
+
+The notebook provides a simple interface that allows communication with a model using text instruction. In this demonstration user can provide input instructions and the model generates an image. An additional part demonstrates how to use weights compression with [NNCF](https://docs.openvino.ai/2024/openvino-workflow/model-optimization-guide/weight-compression.html#compress-model-weights) to speed up pipeline and reduce memory consumption.
+
+>**Note**: This demonstration can require about 50GB RAM for conversion and running.
+
+## Notebook contents
+This tutorial consists of the following steps:
+- Prerequisites
+- Load the original model
+    - Infer the original model
+- Convert the model to OpenVINO IR
+    - Prior pipeline
+    - Decoder pipeline
+- Compiling models
+- Building the pipeline
+- Inference
+- Interactive inference
+
+## Installation instructions
+This is a self-contained example that relies solely on its own code.</br>
+We recommend running the notebook in a virtual environment. You only need a Jupyter server to start.
+For details, please refer to [Installation Guide](../../README.md).
diff --git a/notebooks/stable-cascade-image-generation/stable-cascade-image-generation.ipynb b/notebooks/stable-cascade-image-generation/stable-cascade-image-generation.ipynb