Update the readme of the diffusers sdxl&sd.

mindspore-lab · Sep 10, 2024 · 01c2f01 · 01c2f01
1 parent a4df8a0
commit 01c2f01
Show file tree

Hide file tree

Showing 2 changed files with 225 additions and 34 deletions.
diff --git a/examples/diffusers/text_to_image/README.md b/examples/diffusers/text_to_image/README.md
@@ -15,6 +15,9 @@ Before running the scripts, make sure to install the library's training dependen
 
 **Important**
 
+The training script is compute-intensive and only runs on an Ascend 910*. Please run the scripts with CANN version ([CANN 8.0.RC2.beta1](https://www.hiascend.com/developer/download/community/result?module=cann&cann=8.0.RC2.beta1)) and MindSpore version ([MS 2.3.0](https://www.mindspore.cn/versions#2.3.0)). You can use
+`cat /usr/local/Ascend/ascend-toolkit/latest/version.cfg` check the CANN version and you can see the specific version number [7.3.0.1.231:8.0.RC2]. If you have a custom installation path for CANN, find the `version.cfg` in your own CANN installation path to verify the version.
+
 To make sure you can successfully run the latest versions of the example scripts, we highly recommend **installing from source** and keeping the install up to date as we update the example scripts frequently and install some example-specific requirements. To do this, execute the following steps in a new virtual environment:
 ```bash
 git clone https://github.com/mindspore-lab/mindone
@@ -41,8 +44,6 @@ huggingface-cli login
 
 If you have already cloned the repo, then you won't need to go through these steps.
 
-<br>
-
 #### Hardware
 
 With `gradient_checkpointing` and `mixed_precision` it should be possible to fine tune the model on a single 24GB NPU. For higher `batch_size` and faster training it's better to use NPUs with >30GB memory.
@@ -86,17 +87,47 @@ python train_text_to_image.py \
   --output_dir="sd-your-dataset-model-$(date +%Y%m%d%H%M%S)"
 ```
 
+For parallel training, use `msrun` and along with `--distributed`:
+
+```bash
+export MODEL_NAME="CompVis/stable-diffusion-v1-4"
+export TRAIN_DIR="path_to_your_dataset"
+
+msrun --worker_num=8 --local_worker_num=8 --log_dir=$output_dir  \
+  train_text_to_image.py \
+  --pretrained_model_name_or_path=$MODEL_NAME \
+  --dataset_name=$DATASET_NAME \
+  --resolution=512 --center_crop --random_flip \
+  --train_batch_size=1 \
+  --max_train_steps=15000 \
+  --learning_rate=1e-05 \
+  --max_grad_norm=1 \
+  --mixed_precision="fp16" \
+  --distributed \
+  --lr_scheduler="constant" --lr_warmup_steps=0 \
+  --output_dir="sd-your-dataset-model-$(date +%Y%m%d%H%M%S)"
+```
+
+### Performance
+
+For the training example above, we trained on the OnePiece dataset and recorded the training speed as follows.
+
+| Method  | NPUs | Global <br/>Batch size | Resolution   | Precision | Graph Compile | Speed <br/>(s/step) | FPS <br/>(img/s) |
+|---------|------|------------------------|--------------|-----------|---------------|---------------------|------------------|
+| vanilla | 1    | 1*1                    | 512x512      | FP16      | 1~5 mins      | 260                 | 3.85             |
+| vanilla | 8    | 1*8                    | 512x512      | FP16      | 1~5 mins      | 404                 | 19.8             |
+
 Once the training is finished the model will be saved in the `output_dir` specified in the command. In this example it's `sd-onepiece-model`. To load the fine-tuned model for inference just pass that path to `StableDiffusionPipeline`
 
 ```python
 import mindspore as ms
 from mindone.diffusers import StableDiffusionPipeline
 
-model_path = "path_to_saved_model"
+model_path = "sd-onepiece-model"
 pipe = StableDiffusionPipeline.from_pretrained(model_path, mindspore_dtype=ms.float16)
 
-image = pipe(prompt="a man in a straw hat")[0][0]
-image.save("a-man-in-a-straw-hat.png")
+image = pipe(prompt="a man with a beard and a shirt")[0][0]
+image.save("onepiece.png")
 ```
 
 Checkpoints only save the unet, so to run inference from a checkpoint, just load the unet
@@ -105,15 +136,26 @@ Checkpoints only save the unet, so to run inference from a checkpoint, just load
 import mindspore as ms
 from mindone.diffusers import StableDiffusionPipeline, UNet2DConditionModel
 
-model_path = "path_to_saved_model"
+model_path = "sd-onepiece-model"
 unet = UNet2DConditionModel.from_pretrained(model_path + "/checkpoint-<N>/unet", mindspore_dtype=ms.float16)
 
 pipe = StableDiffusionPipeline.from_pretrained("<initial model>", unet=unet, mindspore_dtype=ms.float16)
 
-image = pipe(prompt="a man in a straw hat")[0][0]
-image.save("a-man-in-a-straw-hat.png")
+image = pipe(prompt="a man with a beard and a shirt")[0][0]
+image.save("onepiece.png")
 ```
 
+We trained 6k steps based on the OnePiece dataset. Here are some of the results of the fine-tuning.
+
+|                                                                      a girl with a mask on her face                                                                      |                                                                 a man holding a book                                                                  |                                                                 a man holding a sword                                                                  |                                                                 a man sitting on top of a flower                                                                  |
+|:------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:-----------------------------------------------------------------------------------------------------------------------------------------------------:|:------------------------------------------------------------------------------------------------------------------------------------------------------:|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------:|
+| <img src="https://github.com/liuchuting/mindone/blob/image/examples/diffusers/text_to_image/images/sd_base_infer/a_girl_with_a_mask_on_her_face.png?raw=true" width=224> | <img src="https://github.com/liuchuting/mindone/blob/image/examples/diffusers/text_to_image/images/sd_base_infer/a_man_holding_a_book.png" width=224> | <img src="https://github.com/liuchuting/mindone/blob/image/examples/diffusers/text_to_image/images/sd_base_infer/a_man_holding_a_sword.png" width=224> | <img src="https://github.com/liuchuting/mindone/blob/image/examples/diffusers/text_to_image/images/sd_base_infer/a_man_sitting_on_top_of_a_flower.png" width=224> |
+
+|                                                                 a man with a beard and a shirt                                                                  |                                                                 a man with a knife in his hand                                                                  |                                                                 a smiling woman in a helmet                                                                  |                                                                 a woman in a white dress                                                                  |
+|:---------------------------------------------------------------------------------------------------------------------------------------------------------------:|:---------------------------------------------------------------------------------------------------------------------------------------------------------------:|:------------------------------------------------------------------------------------------------------------------------------------------------------------:|:---------------------------------------------------------------------------------------------------------------------------------------------------------:|
+| <img src="https://github.com/liuchuting/mindone/blob/image/examples/diffusers/text_to_image/images/sd_base_infer/a_man_with_a_beard_and_a_shirt.png" width=224> | <img src="https://github.com/liuchuting/mindone/blob/image/examples/diffusers/text_to_image/images/sd_base_infer/a_man_with_a_knife_in_his_hand.png" width=224> | <img src="https://github.com/liuchuting/mindone/blob/image/examples/diffusers/text_to_image/images/sd_base_infer/a_smiling_woman_in_a_helmet.png" width=224> | <img src="https://github.com/liuchuting/mindone/blob/image/examples/diffusers/text_to_image/images/sd_base_infer/a_woman_in_a_white_dress.png" width=224> |
+
+
 #### Training with Min-SNR weighting
 
 We support training with the Min-SNR weighting strategy proposed in [Efficient Diffusion Training via Min-SNR Weighting Strategy](https://arxiv.org/abs/2303.09556) which helps to achieve faster convergence
@@ -142,9 +184,6 @@ In a nutshell, LoRA allows adapting pretrained models by adding pairs of rank-de
 
 [cloneofsimo](https://github.com/cloneofsimo) was the first to try out LoRA training for Stable Diffusion in the popular [lora](https://github.com/cloneofsimo/lora) GitHub repository.
 
-With LoRA, it's possible to fine-tune Stable Diffusion on a custom image-caption pair dataset
-on consumer GPUs like Tesla T4, Tesla V100.
-
 ### Training
 
 First, you need to set up your development environment as is explained in the [installation section](#installing-the-dependencies). Make sure to set the `MODEL_NAME` and `DATASET_NAME` environment variables. Here, we will use [Stable Diffusion v1-4](https://hf.co/CompVis/stable-diffusion-v1-4) and the [OnePiece dataset](https://huggingface.co/datasets/YaYaB/onepiece-blip-captions).
@@ -170,6 +209,27 @@ python train_text_to_image_lora.py \
   --output_dir="sd-onepiece-model-lora-$(date +%Y%m%d%H%M%S)"
 ```
 
+For parallel training, use `msrun` and along with `--distributed`:
+
+```bash
+export MODEL_NAME="CompVis/stable-diffusion-v1-4"
+export DATASET_NAME="YaYaB/onepiece-blip-captions"
+
+msrun --worker_num=8 --local_worker_num=8 --log_dir=$output_dir  \
+  train_text_to_image_lora.py \
+  --pretrained_model_name_or_path=$MODEL_NAME \
+  --dataset_name=$DATASET_NAME \
+  --resolution=512 --center_crop --random_flip \
+  --train_batch_size=1 \
+  --num_train_epochs=100 --checkpointing_steps=5000 \
+  --learning_rate=1e-04 --lr_scheduler="constant" --lr_warmup_steps=0 \
+  --mixed_precision="fp16" \
+  --seed=42 \
+  --distributed \
+  --validation_prompt="a man in a straw hat" \
+  --output_dir="sd-onepiece-model-lora-$(date +%Y%m%d%H%M%S)"
+```
+
 The above command will also run inference as fine-tuning progresses and log the results to local files.
 
 **___Note: When using LoRA we can use a much higher learning rate compared to non-LoRA fine-tuning. Here we use *1e-4* instead of the usual *1e-5*. Also, by using LoRA, it's possible to run `train_text_to_image_lora.py` in consumer GPUs like T4 or V100.___**
@@ -178,24 +238,39 @@ The final LoRA embedding weights have been uploaded to [sayakpaul/sd-model-finet
 
 You can check some inference samples that were logged during the course of the fine-tuning process [here](https://wandb.ai/sayakpaul/text2image-fine-tune/runs/q4lc0xsw).
 
+### Performance
+
+For the training example above, we trained on the OnePiece dataset and recorded the training speed as follows.
+
+| Method | NPUs | Global <br/>Batch size | Resolution   | Precision | Graph Compile | Speed <br/>(s/step) | FPS <br/>(img/s) |
+|--------|------|------------------------|--------------|-----------|---------------|---------------------|------------------|
+| lora   | 1    | 1*1                    | 512x512      | FP16      | 1~5 mins      | 200                 | 5.00             |
+| lora   | 8    | 1*8                    | 512x512      | FP16      | 1~5 mins      | 231                 | 34.63            |
+
 ### Inference
 
-Once you have trained a model using above command, the inference can be done simply using the `StableDiffusionPipeline` after loading the trained LoRA weights.  You
+If the LoRA weights you want to use is from huggingface, you can replace the following model_path like `model_path = "sayakpaul/sd-model-finetuned-lora-t4"`. Once you have trained a model using above command, the inference can be done simply using the `StableDiffusionPipeline` after loading the trained LoRA weights. You
 need to pass the `output_dir` for loading the LoRA weights which, in this case, is `sd-onepiece-model-lora`.
 
 ```python
 import mindspore as ms
 from mindone.diffusers import StableDiffusionPipeline
 
-model_path = "sayakpaul/sd-model-finetuned-lora-t4"
+model_path = "sd-onepiece-model-lora"
 pipe = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4", mindspore_dtype=ms.float16)
 pipe.load_lora_weights(model_path)
 
-prompt = "A pokemon with green eyes and red legs."
+prompt = "a man in a hat and jacket"
 image = pipe(prompt, num_inference_steps=30, guidance_scale=7.5)[0][0]
-image.save("pokemon.png")
+image.save(f"onepiece.png")
 ```
 
+We trained 15k steps based on the OnePiece dataset. Here are some of the results of the lora fine-tuning.
+
+|                                                                      a man in a hat and jacket                                                                      |                                                                 a man in a yellow coat                                                                  |                                                                 a man with a big smile on his face                                                                  |                                                                 a man with a hat and mustache                                                                  |
+|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:-------------------------------------------------------------------------------------------------------------------------------------------------------:|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:--------------------------------------------------------------------------------------------------------------------------------------------------------------:|
+| <img src="https://github.com/liuchuting/mindone/blob/image/examples/diffusers/text_to_image/images/sd_lora_infer/a_man_in_a_hat_and_jacket.png?raw=true" width=224> | <img src="https://github.com/liuchuting/mindone/blob/image/examples/diffusers/text_to_image/images/sd_lora_infer/a_man_in_a_yellow_coat.png" width=224> | <img src="https://github.com/liuchuting/mindone/blob/image/examples/diffusers/text_to_image/images/sd_lora_infer/a_man_with_a_big_smile_on_his_face.png" width=224> | <img src="https://github.com/liuchuting/mindone/blob/image/examples/diffusers/text_to_image/images/sd_lora_infer/a_man_with_a_hat_and_mustache.png" width=224> |
+
 If you are loading the LoRA parameters from the Hub and if the Hub repository has
 a `base_model` tag (such as [this](https://huggingface.co/sayakpaul/sd-model-finetuned-lora-t4/blob/main/README.md?code=true#L4)), then
 you can do: