From f7aaa33523329d53a8ba881fa5ef2a588eb6a8d2 Mon Sep 17 00:00:00 2001
From: liuchuting <liuchuting1@huawei.com>
Date: Thu, 5 Sep 2024 16:44:01 +0800
Subject: [PATCH 1/2] Update the readme of the diffusers sdxl&sd.

---
 examples/diffusers/text_to_image/README.md    | 107 +++++++++--
 .../diffusers/text_to_image/README_sdxl.md    | 175 ++++++++++++++++--
 2 files changed, 248 insertions(+), 34 deletions(-)
diff --git a/examples/diffusers/text_to_image/README.md b/examples/diffusers/text_to_image/README.md
index ecdbc6db38..56f07d7d09 100644
--- a/examples/diffusers/text_to_image/README.md
+++ b/examples/diffusers/text_to_image/README.md
@@ -15,6 +15,9 @@ Before running the scripts, make sure to install the library's training dependen
 
 **Important**
 
+The training script is compute-intensive and only runs on an Ascend 910*. Please run the scripts with CANN version ([CANN 8.0.RC2.beta1](https://www.hiascend.com/developer/download/community/result?module=cann&cann=8.0.RC2.beta1)) and MindSpore version ([MS 2.3.0](https://www.mindspore.cn/versions#2.3.0)). You can use
+`cat /usr/local/Ascend/ascend-toolkit/latest/version.cfg` check the CANN version and you can see the specific version number [7.3.0.1.231:8.0.RC2]. If you have a custom installation path for CANN, find the `version.cfg` in your own CANN installation path to verify the version.
+
 To make sure you can successfully run the latest versions of the example scripts, we highly recommend **installing from source** and keeping the install up to date as we update the example scripts frequently and install some example-specific requirements. To do this, execute the following steps in a new virtual environment:
 ```bash
 git clone https://github.com/mindspore-lab/mindone
@@ -41,8 +44,6 @@ huggingface-cli login
 
 If you have already cloned the repo, then you won't need to go through these steps.
 
-<br>
-
 #### Hardware
 
 With `gradient_checkpointing` and `mixed_precision` it should be possible to fine tune the model on a single 24GB NPU. For higher `batch_size` and faster training it's better to use NPUs with >30GB memory.
@@ -86,17 +87,47 @@ python train_text_to_image.py \
   --output_dir="sd-your-dataset-model-$(date +%Y%m%d%H%M%S)"
 ```
 
+For parallel training, use `msrun` and along with `--distributed`:
+
+```bash
+export MODEL_NAME="CompVis/stable-diffusion-v1-4"
+export TRAIN_DIR="path_to_your_dataset"
+
+msrun --worker_num=8 --local_worker_num=8 --log_dir=$output_dir  \
+  train_text_to_image.py \
+  --pretrained_model_name_or_path=$MODEL_NAME \
+  --dataset_name=$DATASET_NAME \
+  --resolution=512 --center_crop --random_flip \
+  --train_batch_size=1 \
+  --max_train_steps=15000 \
+  --learning_rate=1e-05 \
+  --max_grad_norm=1 \
+  --mixed_precision="fp16" \
+  --distributed \
+  --lr_scheduler="constant" --lr_warmup_steps=0 \
+  --output_dir="sd-your-dataset-model-$(date +%Y%m%d%H%M%S)"
+```
+
+### Performance
+
+For the training example above, we trained on the OnePiece dataset and recorded the training speed as follows.
+
+| Method  | NPUs | Global <br/>Batch size | Resolution   | Precision | Graph Compile | Speed <br/>(ms/step) | FPS <br/>(img/s) |
+|---------|------|------------------------|--------------|-----------|---------------|----------------------|------------------|
+| vanilla | 1    | 1*1                    | 512x512      | FP16      | 1~5 mins      | 260                  | 3.85             |
+| vanilla | 8    | 1*8                    | 512x512      | FP16      | 1~5 mins      | 404                  | 19.8             |
+
 Once the training is finished the model will be saved in the `output_dir` specified in the command. In this example it's `sd-onepiece-model`. To load the fine-tuned model for inference just pass that path to `StableDiffusionPipeline`
 
 ```python
 import mindspore as ms
 from mindone.diffusers import StableDiffusionPipeline
 
-model_path = "path_to_saved_model"
+model_path = "sd-onepiece-model"
 pipe = StableDiffusionPipeline.from_pretrained(model_path, mindspore_dtype=ms.float16)
 
-image = pipe(prompt="a man in a straw hat")[0][0]
-image.save("a-man-in-a-straw-hat.png")
+image = pipe(prompt="a man with a beard and a shirt")[0][0]
+image.save("onepiece.png")
 ```
 
 Checkpoints only save the unet, so to run inference from a checkpoint, just load the unet
@@ -105,15 +136,26 @@ Checkpoints only save the unet, so to run inference from a checkpoint, just load
 import mindspore as ms
 from mindone.diffusers import StableDiffusionPipeline, UNet2DConditionModel
 
-model_path = "path_to_saved_model"
+model_path = "sd-onepiece-model"
 unet = UNet2DConditionModel.from_pretrained(model_path + "/checkpoint-<N>/unet", mindspore_dtype=ms.float16)
 
 pipe = StableDiffusionPipeline.from_pretrained("<initial model>", unet=unet, mindspore_dtype=ms.float16)
 
-image = pipe(prompt="a man in a straw hat")[0][0]
-image.save("a-man-in-a-straw-hat.png")
+image = pipe(prompt="a man with a beard and a shirt")[0][0]
+image.save("onepiece.png")
 ```
 
+We trained 6k steps based on the OnePiece dataset. Here are some of the results of the fine-tuning.
+
+|                                                                      a girl with a mask on her face                                                                      |                                                                 a man holding a book                                                                  |                                                                 a man holding a sword                                                                  |                                                                 a man sitting on top of a flower                                                                  |
+|:------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:-----------------------------------------------------------------------------------------------------------------------------------------------------:|:------------------------------------------------------------------------------------------------------------------------------------------------------:|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------:|
+| <img src="https://github.com/liuchuting/mindone/blob/image/examples/diffusers/text_to_image/images/sd_base_infer/a_girl_with_a_mask_on_her_face.png?raw=true" width=224> | <img src="https://github.com/liuchuting/mindone/blob/image/examples/diffusers/text_to_image/images/sd_base_infer/a_man_holding_a_book.png" width=224> | <img src="https://github.com/liuchuting/mindone/blob/image/examples/diffusers/text_to_image/images/sd_base_infer/a_man_holding_a_sword.png" width=224> | <img src="https://github.com/liuchuting/mindone/blob/image/examples/diffusers/text_to_image/images/sd_base_infer/a_man_sitting_on_top_of_a_flower.png" width=224> |
+
+|                                                                 a man with a beard and a shirt                                                                  |                                                                 a man with a knife in his hand                                                                  |                                                                 a smiling woman in a helmet                                                                  |                                                                 a woman in a white dress                                                                  |
+|:---------------------------------------------------------------------------------------------------------------------------------------------------------------:|:---------------------------------------------------------------------------------------------------------------------------------------------------------------:|:------------------------------------------------------------------------------------------------------------------------------------------------------------:|:---------------------------------------------------------------------------------------------------------------------------------------------------------:|
+| <img src="https://github.com/liuchuting/mindone/blob/image/examples/diffusers/text_to_image/images/sd_base_infer/a_man_with_a_beard_and_a_shirt.png" width=224> | <img src="https://github.com/liuchuting/mindone/blob/image/examples/diffusers/text_to_image/images/sd_base_infer/a_man_with_a_knife_in_his_hand.png" width=224> | <img src="https://github.com/liuchuting/mindone/blob/image/examples/diffusers/text_to_image/images/sd_base_infer/a_smiling_woman_in_a_helmet.png" width=224> | <img src="https://github.com/liuchuting/mindone/blob/image/examples/diffusers/text_to_image/images/sd_base_infer/a_woman_in_a_white_dress.png" width=224> |
+
+
 #### Training with Min-SNR weighting
 
 We support training with the Min-SNR weighting strategy proposed in [Efficient Diffusion Training via Min-SNR Weighting Strategy](https://arxiv.org/abs/2303.09556) which helps to achieve faster convergence
@@ -142,9 +184,6 @@ In a nutshell, LoRA allows adapting pretrained models by adding pairs of rank-de
 
 [cloneofsimo](https://github.com/cloneofsimo) was the first to try out LoRA training for Stable Diffusion in the popular [lora](https://github.com/cloneofsimo/lora) GitHub repository.
 
-With LoRA, it's possible to fine-tune Stable Diffusion on a custom image-caption pair dataset
-on consumer GPUs like Tesla T4, Tesla V100.
-
 ### Training
 
 First, you need to set up your development environment as is explained in the [installation section](#installing-the-dependencies). Make sure to set the `MODEL_NAME` and `DATASET_NAME` environment variables. Here, we will use [Stable Diffusion v1-4](https://hf.co/CompVis/stable-diffusion-v1-4) and the [OnePiece dataset](https://huggingface.co/datasets/YaYaB/onepiece-blip-captions).
@@ -170,32 +209,68 @@ python train_text_to_image_lora.py \
   --output_dir="sd-onepiece-model-lora-$(date +%Y%m%d%H%M%S)"
 ```
 
+For parallel training, use `msrun` and along with `--distributed`:
+
+```bash
+export MODEL_NAME="CompVis/stable-diffusion-v1-4"
+export DATASET_NAME="YaYaB/onepiece-blip-captions"
+
+msrun --worker_num=8 --local_worker_num=8 --log_dir=$output_dir  \
+  train_text_to_image_lora.py \
+  --pretrained_model_name_or_path=$MODEL_NAME \
+  --dataset_name=$DATASET_NAME \
+  --resolution=512 --center_crop --random_flip \
+  --train_batch_size=1 \
+  --num_train_epochs=100 --checkpointing_steps=5000 \
+  --learning_rate=1e-04 --lr_scheduler="constant" --lr_warmup_steps=0 \
+  --mixed_precision="fp16" \
+  --seed=42 \
+  --distributed \
+  --validation_prompt="a man in a straw hat" \
+  --output_dir="sd-onepiece-model-lora-$(date +%Y%m%d%H%M%S)"
+```
+
 The above command will also run inference as fine-tuning progresses and log the results to local files.
 
-**___Note: When using LoRA we can use a much higher learning rate compared to non-LoRA fine-tuning. Here we use *1e-4* instead of the usual *1e-5*. Also, by using LoRA, it's possible to run `train_text_to_image_lora.py` in consumer GPUs like T4 or V100.___**
+**___Note: When using LoRA we can use a much higher learning rate compared to non-LoRA fine-tuning. Here we use *1e-4* instead of the usual *1e-5*.
 
 The final LoRA embedding weights have been uploaded to [sayakpaul/sd-model-finetuned-lora-t4](https://huggingface.co/sayakpaul/sd-model-finetuned-lora-t4). **___Note: [The final weights](https://huggingface.co/sayakpaul/sd-model-finetuned-lora-t4/blob/main/pytorch_lora_weights.bin) are only 3 MB in size, which is orders of magnitudes smaller than the original model.___**
 
 You can check some inference samples that were logged during the course of the fine-tuning process [here](https://wandb.ai/sayakpaul/text2image-fine-tune/runs/q4lc0xsw).
 
+### Performance
+
+For the training example above, we trained on the OnePiece dataset and recorded the training speed as follows.
+
+| Method | NPUs | Global <br/>Batch size | Resolution   | Precision | Graph Compile | Speed <br/>(ms/step) | FPS <br/>(img/s) |
+|--------|------|------------------------|--------------|-----------|---------------|----------------------|------------------|
+| lora   | 1    | 1*1                    | 512x512      | FP16      | 1~5 mins      | 200                  | 5.00             |
+| lora   | 8    | 1*8                    | 512x512      | FP16      | 1~5 mins      | 231                  | 34.63            |
+
 ### Inference
 
-Once you have trained a model using above command, the inference can be done simply using the `StableDiffusionPipeline` after loading the trained LoRA weights.  You
+If the LoRA weights you want to use is from huggingface, you can replace the following model_path like `model_path = "sayakpaul/sd-model-finetuned-lora-t4"`. Once you have trained a model using above command, the inference can be done simply using the `StableDiffusionPipeline` after loading the trained LoRA weights. You
 need to pass the `output_dir` for loading the LoRA weights which, in this case, is `sd-onepiece-model-lora`.
 
 ```python
 import mindspore as ms
 from mindone.diffusers import StableDiffusionPipeline
 
-model_path = "sayakpaul/sd-model-finetuned-lora-t4"
+model_path = "sd-onepiece-model-lora"
 pipe = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4", mindspore_dtype=ms.float16)
 pipe.load_lora_weights(model_path)
 
-prompt = "A pokemon with green eyes and red legs."
+prompt = "a man in a hat and jacket"
 image = pipe(prompt, num_inference_steps=30, guidance_scale=7.5)[0][0]
-image.save("pokemon.png")
+image.save(f"onepiece.png")
 ```
 
+We trained 15k steps based on the OnePiece dataset. Here are some of the results of the lora fine-tuning.
+
+|                                                                      a man in a hat and jacket                                                                      |                                                                 a man in a yellow coat                                                                  |                                                                 a man with a big smile on his face                                                                  |                                                                 a man with a hat and mustache                                                                  |
+|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:-------------------------------------------------------------------------------------------------------------------------------------------------------:|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:--------------------------------------------------------------------------------------------------------------------------------------------------------------:|
+| <img src="https://github.com/liuchuting/mindone/blob/image/examples/diffusers/text_to_image/images/sd_lora_infer/a_man_in_a_hat_and_jacket.png?raw=true" width=224> | <img src="https://github.com/liuchuting/mindone/blob/image/examples/diffusers/text_to_image/images/sd_lora_infer/a_man_in_a_yellow_coat.png" width=224> | <img src="https://github.com/liuchuting/mindone/blob/image/examples/diffusers/text_to_image/images/sd_lora_infer/a_man_with_a_big_smile_on_his_face.png" width=224> | <img src="https://github.com/liuchuting/mindone/blob/image/examples/diffusers/text_to_image/images/sd_lora_infer/a_man_with_a_hat_and_mustache.png" width=224> |
+
 If you are loading the LoRA parameters from the Hub and if the Hub repository has
 a `base_model` tag (such as [this](https://huggingface.co/sayakpaul/sd-model-finetuned-lora-t4/blob/main/README.md?code=true#L4)), then
 you can do:
diff --git a/examples/diffusers/text_to_image/README_sdxl.md b/examples/diffusers/text_to_image/README_sdxl.md
index 0e4c38fb84..c6f3488512 100644
--- a/examples/diffusers/text_to_image/README_sdxl.md
+++ b/examples/diffusers/text_to_image/README_sdxl.md
@@ -12,6 +12,9 @@ Before running the scripts, make sure to install the library's training dependen
 
 **Important**
 
+The training script is compute-intensive and only runs on an Ascend 910*. Please run the scripts with CANN version ([CANN 8.0.RC2.beta1](https://www.hiascend.com/developer/download/community/result?module=cann&cann=8.0.RC2.beta1)) and MindSpore version ([MS 2.3.0](https://www.mindspore.cn/versions#2.3.0)). You can use
+`cat /usr/local/Ascend/ascend-toolkit/latest/version.cfg` check the CANN version and you can see the specific version number [7.3.0.1.231:8.0.RC2]. If you have a custom installation path for CANN, find the `version.cfg` in your own CANN installation path to verify the version.
+
 To make sure you can successfully run the latest versions of the example scripts, we highly recommend **installing from source** and keeping the install up to date as we update the example scripts frequently and install some example-specific requirements. To do this, execute the following steps in a new virtual environment:
 
 ```bash
@@ -36,6 +39,7 @@ python train_text_to_image_sdxl.py \
   --pretrained_model_name_or_path=$MODEL_NAME \
   --pretrained_vae_model_name_or_path=$VAE_NAME \
   --dataset_name=$DATASET_NAME \
+  --enable_xformers_memory_efficient_attention \
   --resolution=512 --center_crop --random_flip \
   --proportion_empty_prompts=0.2 \
   --train_batch_size=1 \
@@ -47,27 +51,97 @@ python train_text_to_image_sdxl.py \
   --output_dir="sdxl-onepiece-model-$(date +%Y%m%d%H%M%S)"
 ```
 
+For parallel training, use `msrun` and along with `--distributed`:
+
+```shell
+export MODEL_NAME="stabilityai/stable-diffusion-xl-base-1.0"
+export VAE_NAME="madebyollin/sdxl-vae-fp16-fix"
+export DATASET_NAME="YaYaB/onepiece-blip-captions"
+
+msrun --worker_num=8 --local_worker_num=8 --log_dir=$output_dir  \
+    train_text_to_image_sdxl.py \
+    --pretrained_model_name_or_path=$MODEL_NAME \
+    --pretrained_vae_model_name_or_path=$VAE_NAME \
+    --dataset_name=$DATASET_NAME \
+    --enable_xformers_memory_efficient_attention \
+    --resolution=512 --center_crop --random_flip \
+    --proportion_empty_prompts=0.2 \
+    --train_batch_size=1 \
+    --max_train_steps=10000 \
+    --learning_rate=1e-06 --lr_scheduler="constant" --lr_warmup_steps=0 \
+    --mixed_precision="fp16" \
+    --validation_prompt="a man in a green coat holding two swords" --validation_epochs 5 \
+    --checkpointing_steps=5000 \
+    --distributed \
+    --output_dir="sdxl-onepiece-model-$(date +%Y%m%d%H%M%S)"
+```
+
 **Notes**:
 
 *  The `train_text_to_image_sdxl.py` script pre-computes text embeddings and the VAE encodings and keeps them in memory. While for smaller datasets like [`lambdalabs/pokemon-blip-captions`](https://hf.co/datasets/lambdalabs/pokemon-blip-captions), it might not be a problem, it can definitely lead to memory problems when the script is used on a larger dataset. For those purposes, you would want to serialize these pre-computed representations to disk separately and load them during the fine-tuning process. Refer to [this PR](https://github.com/huggingface/diffusers/pull/4505) for a more in-depth discussion.
-* The training script is compute-intensive and only runs on an Ascend 910*.
 * The training command shown above performs intermediate quality validation in between the training epochs. `--report_to`, `--validation_prompt`, and `--validation_epochs` are the relevant CLI arguments here.
 * SDXL's VAE is known to suffer from numerical instability issues. This is why we also expose a CLI argument namely `--pretrained_vae_model_name_or_path` that lets you specify the location of a better VAE (such as [this one](https://huggingface.co/madebyollin/sdxl-vae-fp16-fix)).
 
+### Performance
+
+For the above training example, we record the training speed as follows.
+
+| Method  | NPUs | Global <br/>Batch size | Resolution   | Precision | Graph Compile | Speed <br/>(s/step) | FPS <br/>(img/s) |
+|---------|------|------------------------|--------------|-----------|---------------|---------------------|------------------|
+| vanilla | 1    | 1*1                    | 512x512      | FP16      | 1~5 mins      | 0.720               | 1.39             |
+| vanilla | 8    | 1*8                    | 512x512      | FP16      | 1~5 mins      | 1.148               | 6.97             |
+
 ### Inference
 
 ```python
 from mindone.diffusers import DiffusionPipeline
 import mindspore
 
-model_path = "you-model-id-goes-here" # <-- change this
+model_path = "stabilityai/stable-diffusion-xl-base-1.0" # <-- You can modify the model path of your training here.
 pipe = DiffusionPipeline.from_pretrained(model_path, mindspore_dtype=mindspore.float16)
 
-prompt = "a man in a green coat holding two swords"
+prompt = "The boy rides a horse in space"
+image = pipe(prompt, num_inference_steps=30, guidance_scale=7.5)[0][0]
+image.save("The-boy-rides-a-horse-in-space.png")
+```
+
+To change the pipelines scheduler, use the from_config() method to load a different scheduler's pipeline.scheduler.config into the pipeline.
+
+```python
+from mindone.diffusers import EulerAncestralDiscreteScheduler
+
+pipe.scheduler = EulerAncestralDiscreteScheduler.from_config(pipeline.scheduler.config)
 image = pipe(prompt, num_inference_steps=30, guidance_scale=7.5)[0][0]
+image.save("The-boy-rides-a-horse-in-space.png")
+```
+
+Here are some images generated by inference under different Schedulers.
+
+|                                                                   DDIMParallelScheduler <br/>(0.86s/step)                                                                    |                                                                    DDIMScheduler <br/>(0.8s/step)                                                                    |                                                                   LMSDiscreteScheduler <br/>(0.93s/step)                                                                    |                                                                   DPMSolverSinglestepScheduler <br/>(0.83s/step)                                                                    |
+|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:---------------------------------------------------------------------------------------------------------------------------------------------------------------:|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|
+| <img src="https://github.com/liuchuting/mindone/blob/image/examples/diffusers/text_to_image/images/diff_schedulers_infer/DDIMParallelScheduler.png?raw=true" width=224> | <img src="https://github.com/liuchuting/mindone/blob/image/examples/diffusers/text_to_image/images/diff_schedulers_infer/DDIMScheduler.png?raw=true" width=224> | <img src="https://github.com/liuchuting/mindone/blob/image/examples/diffusers/text_to_image/images/diff_schedulers_infer/LMSDiscreteScheduler.png?raw=true" width=224> | <img src="https://github.com/liuchuting/mindone/blob/image/examples/diffusers/text_to_image/images/diff_schedulers_infer/DPMSolverSinglestepScheduler.png?raw=true" width=224> |
+
+Checkpoints only save the unet, so to run inference from a checkpoint, just load the unet.
+
+```python
+import mindspore as ms
+from mindone.diffusers import StableDiffusionXLPipeline, UNet2DConditionModel
+
+model_path = "sdxl-onepiece-model"
+unet = UNet2DConditionModel.from_pretrained(model_path + "/checkpoint-<N>/unet", mindspore_dtype=ms.float16)
+
+pipe = StableDiffusionXLPipeline.from_pretrained("<initial model>", unet=unet, mindspore_dtype=ms.float16)
+
+image = pipe(prompt="a man with a beard and a shirt")[0][0]
 image.save("onepiece.png")
 ```
 
+We trained 10k steps based on the OnePiece dataset. Here are some of the results of the fine-tuning.
+
+|                                              a man in a blue suit and a green hat                                                                                                  |                                         a man with a big mouth                                                                                              |                                                                  a man with glasses on his face                                                                     |                                                                  a man with red hair and a cape                                                                     |
+|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:-----------------------------------------------------------------------------------------------------------------------------------------------------------:|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------:|
+| <img src="https://github.com/liuchuting/mindone/blob/image/examples/diffusers/text_to_image/images/sdxl_infer_fa_10k/a_man_in_a_blue_suit_and_a_green_hat.png?raw=true" width=224> | <img src="https://github.com/liuchuting/mindone/blob/image/examples/diffusers/text_to_image/images/sdxl_infer_fa_10k/a_man_with_a_big_mouth.png" width=224> | <img src="https://github.com/liuchuting/mindone/blob/image/examples/diffusers/text_to_image/images/sdxl_infer_fa_10k/a_man_with_glasses_on_his_face.png" width=224> | <img src="https://github.com/liuchuting/mindone/blob/image/examples/diffusers/text_to_image/images/sdxl_infer_fa_10k/a_man_with_red_hair_and_a_cape.png" width=224> |
+
 ## LoRA training example for Stable Diffusion XL (SDXL)
 
 Low-Rank Adaption of Large Language Models was first introduced by Microsoft in [LoRA: Low-Rank Adaptation of Large Language Models](https://arxiv.org/abs/2106.09685) by *Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen*.
@@ -80,16 +154,6 @@ In a nutshell, LoRA allows adapting pretrained models by adding pairs of rank-de
 
 [cloneofsimo](https://github.com/cloneofsimo) was the first to try out LoRA training for Stable Diffusion in the popular [lora](https://github.com/cloneofsimo/lora) GitHub repository.
 
-With LoRA, it's possible to fine-tune Stable Diffusion on a custom image-caption pair dataset
-on consumer GPUs like Tesla T4, Tesla V100.
-
-> [!WARNING]
-> If you're using mindspore 2.2.x, you have to set the `MS_DEV_TRAVERSE_SUBSTITUTIONS_MODE` environment variables to `1` before running the training commands,
-> otherwise you'll get a segmentation fault (core dumped).
-> ```bash
-> export MS_DEV_TRAVERSE_SUBSTITUTIONS_MODE=1
-> ```
-
 ### Training
 
 First, you need to set up your development environment as is explained in the [installation section](#installing-the-dependencies). Make sure to set the `MODEL_NAME` and `DATASET_NAME` environment variables and, optionally, the `VAE_NAME` variable. Here, we will use [Stable Diffusion XL 1.0-base](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0) and the [OnePiece dataset](https://huggingface.co/datasets/YaYaB/onepiece-blip-captions).
@@ -115,12 +179,45 @@ python train_text_to_image_lora_sdxl.py \
   --output_dir="sdxl-onepiece-model-lora-$(date +%Y%m%d%H%M%S)"
 ```
 
+For parallel training, use `msrun` and along with `--distributed`:
+
+```shell
+export MODEL_NAME="stabilityai/stable-diffusion-xl-base-1.0"
+export VAE_NAME="madebyollin/sdxl-vae-fp16-fix"
+export DATASET_NAME="YaYaB/onepiece-blip-captions"
+
+msrun --worker_num=8 --local_worker_num=8 --log_dir=$output_dir  \
+    train_text_to_image_lora_sdxl.py \
+    --pretrained_model_name_or_path=$MODEL_NAME \
+    --pretrained_vae_model_name_or_path=$VAE_NAME \
+    --dataset_name=$DATASET_NAME \
+    --resolution=1024 --center_crop --random_flip \
+    --train_batch_size=1 \
+    --num_train_epochs=2 --checkpointing_steps=500 \
+    --learning_rate=1e-04 --lr_scheduler="constant" --lr_warmup_steps=0 \
+    --mixed_precision="fp16" \
+    --seed=42 \
+    --validation_prompt="a man in a green coat holding two swords" \
+    --distributed \
+    --output_dir="sdxl-onepiece-model-lora-$(date +%Y%m%d%H%M%S)"
+```
+
 The above command will also run inference as fine-tuning progresses and log the results to local files.
 
 **Notes**:
 
 * SDXL's VAE is known to suffer from numerical instability issues. This is why we also expose a CLI argument namely `--pretrained_vae_model_name_or_path` that lets you specify the location of a better VAE (such as [this one](https://huggingface.co/madebyollin/sdxl-vae-fp16-fix)).
 
+### Performance
+
+For the above training example, we record the training speed as follows.
+
+| Method | NPUs | Global <br/>Batch size | Resolution   | Precision | Graph Compile | Speed <br/>(s/step) | FPS <br/>(img/s) |
+|--------|------|------------------------|--------------|-----------|---------------|---------------------|-----------------|
+| lora   | 1    | 1*1                    | 1024x1024    | FP16      | 15~20 min     | 0.828               | 1.21            |
+| lora   | 8    | 1*8                    | 1024x1024    | FP16      | 15~20 min     | 0.907               | 8.82            |
+
+
 ### Finetuning the text encoder and UNet
 
 The script also allows you to finetune the `text_encoder` along with the `unet`.
@@ -140,23 +237,65 @@ python train_text_to_image_lora_sdxl.py \
   --seed=42 \
   --validation_prompt="a man in a green coat holding two swords" \
   --train_text_encoder \
-  --output_dir="sdxl-onepiece-model-lora-txt-$(date +%Y%m%d%H%M%S)"
+  --output_dir="sdxl-onepiece-model-lora-$(date +%Y%m%d%H%M%S)"
+```
+
+For parallel training, use `msrun` and along with `--distributed`:
+
+```shell
+msrun --worker_num=8 --local_worker_num=8 --log_dir=$output_dir  \
+    train_text_to_image_lora_sdxl.py \
+  --pretrained_model_name_or_path=$MODEL_NAME \
+  --dataset_name=$DATASET_NAME \
+  --resolution=1024 --center_crop --random_flip \
+  --train_batch_size=1 \
+  --num_train_epochs=2 --checkpointing_steps=500 \
+  --learning_rate=1e-04 --lr_scheduler="constant" --lr_warmup_steps=0 \
+  --seed=42 \
+  --validation_prompt="a man in a green coat holding two swords" \
+  --train_text_encoder \
+  --distributed \
+  --output_dir="sdxl-onepiece-model-lora-$(date +%Y%m%d%H%M%S)"
 ```
 
+### Performance
+
+For the above training example, we record the training speed as follows.
+
+| Method | NPUs | Global <br/>Batch size | Resolution   | Precision | Graph Compile | Speed <br/>(s/step) | FPS <br/>(img/s) |
+|--------|------|------------------------|--------------|-----------|---------------|---------------------|------------------|
+| lora   | 1    | 1*1                    | 1024x1024    | FP16      | 15~20 mins    | 0.951               | 1.05             |
+| lora   | 1    | 1*1                    | 1024x1024    | BF16      | 15~20 mins    | 0.994               | 1.01             |
+| lora   | 1    | 1*1                    | 1024x1024    | FP32      | 15~20 mins    | 1.89                | 0.53             |
+
 ### Inference
 
-Once you have trained a model using above command, the inference can be done simply using the `DiffusionPipeline` after loading the trained LoRA weights.  You
+If the LoRA weights you want to use is from huggingface, you can replace the following model_path like `model_path = "takuoko/sd-pokemon-model-lora-sdxl"`. Once you have trained a model using above command, the inference can be done simply using the `DiffusionPipeline` after loading the trained LoRA weights. You
 need to pass the `output_dir` for loading the LoRA weights which, in this case, is `sdxl-onepiece-model-lora`.
 
 ```python
 import mindspore as ms
 from mindone.diffusers import DiffusionPipeline
 
-model_path = "takuoko/sd-pokemon-model-lora-sdxl"
+model_path = "sdxl-onepiece-model-lora"
 pipe = DiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", mindspore_dtype=ms.float16)
 pipe.load_lora_weights(model_path)
 
-prompt = "A pokemon with green eyes and red legs."
+prompt = "a guy with green hair"
 image = pipe(prompt, num_inference_steps=30, guidance_scale=7.5)[0][0]
-image.save("pokemon.png")
+image.save("onepiece.png")
 ```
+
+We trained 8.5k steps based on the OnePiece dataset. Here are some of the results of the lora fine-tuning.
+
+|                                                                       a cartoon character with a sword                                                                       |                                                                  a girl with a mask on her face                                                                   |                                                                  a guy with green hair                                                                   |                                                                  a lion sitting on the ground                                                                   |
+|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------:|:--------------------------------------------------------------------------------------------------------------------------------------------------------:|:---------------------------------------------------------------------------------------------------------------------------------------------------------------:|
+| <img src="https://github.com/liuchuting/mindone/blob/image/examples/diffusers/text_to_image/images/sdxl_lora_infer/a_cartoon_character_with_a_sword.png?raw=true" width=224> | <img src="https://github.com/liuchuting/mindone/blob/image/examples/diffusers/text_to_image/images/sdxl_lora_infer/a_girl_with_a_mask_on_her_face.png" width=224> | <img src="https://github.com/liuchuting/mindone/blob/image/examples/diffusers/text_to_image/images/sdxl_lora_infer/a_guy_with_green_hair.png" width=224> | <img src="https://github.com/liuchuting/mindone/blob/image/examples/diffusers/text_to_image/images/sdxl_lora_infer/a_lion_sitting_on_the_ground.png" width=224> |
+
+|                                                                  a man holding a book                                                                   |                                                                  a man in a cowboy hat                                                                   |                                                                  a man in a hat and jacket                                                                   |                                                                  a man in a yellow coat                                                                   |
+|:-------------------------------------------------------------------------------------------------------------------------------------------------------:|:--------------------------------------------------------------------------------------------------------------------------------------------------------:|:------------------------------------------------------------------------------------------------------------------------------------------------------------:|:---------------------------------------------------------------------------------------------------------------------------------------------------------:|
+| <img src="https://github.com/liuchuting/mindone/blob/image/examples/diffusers/text_to_image/images/sdxl_lora_infer/a_man_holding_a_book.png" width=224> | <img src="https://github.com/liuchuting/mindone/blob/image/examples/diffusers/text_to_image/images/sdxl_lora_infer/a_man_in_a_cowboy_hat.png" width=224> | <img src="https://github.com/liuchuting/mindone/blob/image/examples/diffusers/text_to_image/images/sdxl_lora_infer/a_man_in_a_hat_and_jacket.png" width=224> | <img src="https://github.com/liuchuting/mindone/blob/image/examples/diffusers/text_to_image/images/sdxl_lora_infer/a_man_in_a_yellow_coat.png" width=224> |
+
+|                                                                  a man sitting in a chair                                                                   |                                                                  a man with a big beard                                                                   |                                                                  a man with green hair and a white shirt                                                                   |                                                                  a smiling woman in a helmet                                                                   |
+|:-----------------------------------------------------------------------------------------------------------------------------------------------------------:|:---------------------------------------------------------------------------------------------------------------------------------------------------------:|:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:--------------------------------------------------------------------------------------------------------------------------------------------------------------:|
+| <img src="https://github.com/liuchuting/mindone/blob/image/examples/diffusers/text_to_image/images/sdxl_lora_infer/a_man_sitting_in_a_chair.png" width=224> | <img src="https://github.com/liuchuting/mindone/blob/image/examples/diffusers/text_to_image/images/sdxl_lora_infer/a_man_with_a_big_beard.png" width=224> | <img src="https://github.com/liuchuting/mindone/blob/image/examples/diffusers/text_to_image/images/sdxl_lora_infer/a_man_with_green_hair_and_a_white_shirt.png" width=224> | <img src="https://github.com/liuchuting/mindone/blob/image/examples/diffusers/text_to_image/images/sdxl_lora_infer/a_smiling_woman_in_a_helmet.png" width=224> |

From 9e0e7bffa5b0a6fd8ea3a22cd413737952a48630 Mon Sep 17 00:00:00 2001
From: liuchuting <liuchuting1@huawei.com>
Date: Mon, 23 Sep 2024 14:19:10 +0800
Subject: [PATCH 2/2] Complete the readme of the diffusers sdxl.

---
 examples/diffusers/text_to_image/README.md    | 58 ++++----------
 .../diffusers/text_to_image/README_sdxl.md    | 76 ++-----------------
 2 files changed, 21 insertions(+), 113 deletions(-)

diff --git a/examples/diffusers/text_to_image/README.md b/examples/diffusers/text_to_image/README.md
index 56f07d7d09..ff9c0307e9 100644
--- a/examples/diffusers/text_to_image/README.md
+++ b/examples/diffusers/text_to_image/README.md
@@ -15,9 +15,6 @@ Before running the scripts, make sure to install the library's training dependen
 
 **Important**
 
-The training script is compute-intensive and only runs on an Ascend 910*. Please run the scripts with CANN version ([CANN 8.0.RC2.beta1](https://www.hiascend.com/developer/download/community/result?module=cann&cann=8.0.RC2.beta1)) and MindSpore version ([MS 2.3.0](https://www.mindspore.cn/versions#2.3.0)). You can use
-`cat /usr/local/Ascend/ascend-toolkit/latest/version.cfg` check the CANN version and you can see the specific version number [7.3.0.1.231:8.0.RC2]. If you have a custom installation path for CANN, find the `version.cfg` in your own CANN installation path to verify the version.
-
 To make sure you can successfully run the latest versions of the example scripts, we highly recommend **installing from source** and keeping the install up to date as we update the example scripts frequently and install some example-specific requirements. To do this, execute the following steps in a new virtual environment:
 ```bash
 git clone https://github.com/mindspore-lab/mindone
@@ -30,6 +27,8 @@ Then cd in the example folder `examples/diffusers/text_to_image` and run
 pip install -r requirements.txt
 ```
 
+The training script is compute-intensive and only runs on an Ascend 910*. Please run the scripts with MindSpore version(MS2.3.0).
+
 ### OnePiece example
 
 You need to accept the model license before downloading or using the weights. In this example we'll use model version `v1-4`, so you'll need to visit [its card](https://huggingface.co/CompVis/stable-diffusion-v1-4), read the license and tick the checkbox if you agree.
@@ -44,6 +43,8 @@ huggingface-cli login
 
 If you have already cloned the repo, then you won't need to go through these steps.
 
+<br>
+
 #### Hardware
 
 With `gradient_checkpointing` and `mixed_precision` it should be possible to fine tune the model on a single 24GB NPU. For higher `batch_size` and faster training it's better to use NPUs with >30GB memory.
@@ -108,15 +109,6 @@ msrun --worker_num=8 --local_worker_num=8 --log_dir=$output_dir  \
   --output_dir="sd-your-dataset-model-$(date +%Y%m%d%H%M%S)"
 ```
 
-### Performance
-
-For the training example above, we trained on the OnePiece dataset and recorded the training speed as follows.
-
-| Method  | NPUs | Global <br/>Batch size | Resolution   | Precision | Graph Compile | Speed <br/>(ms/step) | FPS <br/>(img/s) |
-|---------|------|------------------------|--------------|-----------|---------------|----------------------|------------------|
-| vanilla | 1    | 1*1                    | 512x512      | FP16      | 1~5 mins      | 260                  | 3.85             |
-| vanilla | 8    | 1*8                    | 512x512      | FP16      | 1~5 mins      | 404                  | 19.8             |
-
 Once the training is finished the model will be saved in the `output_dir` specified in the command. In this example it's `sd-onepiece-model`. To load the fine-tuned model for inference just pass that path to `StableDiffusionPipeline`
 
 ```python
@@ -126,8 +118,8 @@ from mindone.diffusers import StableDiffusionPipeline
 model_path = "sd-onepiece-model"
 pipe = StableDiffusionPipeline.from_pretrained(model_path, mindspore_dtype=ms.float16)
 
-image = pipe(prompt="a man with a beard and a shirt")[0][0]
-image.save("onepiece.png")
+image = pipe(prompt="a man in a straw hat")[0][0]
+image.save("a-man-in-a-straw-hat.png")
 ```
 
 Checkpoints only save the unet, so to run inference from a checkpoint, just load the unet
@@ -141,21 +133,10 @@ unet = UNet2DConditionModel.from_pretrained(model_path + "/checkpoint-<N>/unet",
 
 pipe = StableDiffusionPipeline.from_pretrained("<initial model>", unet=unet, mindspore_dtype=ms.float16)
 
-image = pipe(prompt="a man with a beard and a shirt")[0][0]
-image.save("onepiece.png")
+image = pipe(prompt="a man in a straw hat")[0][0]
+image.save("a-man-in-a-straw-hat.png")
 ```
 
-We trained 6k steps based on the OnePiece dataset. Here are some of the results of the fine-tuning.
-
-|                                                                      a girl with a mask on her face                                                                      |                                                                 a man holding a book                                                                  |                                                                 a man holding a sword                                                                  |                                                                 a man sitting on top of a flower                                                                  |
-|:------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:-----------------------------------------------------------------------------------------------------------------------------------------------------:|:------------------------------------------------------------------------------------------------------------------------------------------------------:|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------:|
-| <img src="https://github.com/liuchuting/mindone/blob/image/examples/diffusers/text_to_image/images/sd_base_infer/a_girl_with_a_mask_on_her_face.png?raw=true" width=224> | <img src="https://github.com/liuchuting/mindone/blob/image/examples/diffusers/text_to_image/images/sd_base_infer/a_man_holding_a_book.png" width=224> | <img src="https://github.com/liuchuting/mindone/blob/image/examples/diffusers/text_to_image/images/sd_base_infer/a_man_holding_a_sword.png" width=224> | <img src="https://github.com/liuchuting/mindone/blob/image/examples/diffusers/text_to_image/images/sd_base_infer/a_man_sitting_on_top_of_a_flower.png" width=224> |
-
-|                                                                 a man with a beard and a shirt                                                                  |                                                                 a man with a knife in his hand                                                                  |                                                                 a smiling woman in a helmet                                                                  |                                                                 a woman in a white dress                                                                  |
-|:---------------------------------------------------------------------------------------------------------------------------------------------------------------:|:---------------------------------------------------------------------------------------------------------------------------------------------------------------:|:------------------------------------------------------------------------------------------------------------------------------------------------------------:|:---------------------------------------------------------------------------------------------------------------------------------------------------------:|
-| <img src="https://github.com/liuchuting/mindone/blob/image/examples/diffusers/text_to_image/images/sd_base_infer/a_man_with_a_beard_and_a_shirt.png" width=224> | <img src="https://github.com/liuchuting/mindone/blob/image/examples/diffusers/text_to_image/images/sd_base_infer/a_man_with_a_knife_in_his_hand.png" width=224> | <img src="https://github.com/liuchuting/mindone/blob/image/examples/diffusers/text_to_image/images/sd_base_infer/a_smiling_woman_in_a_helmet.png" width=224> | <img src="https://github.com/liuchuting/mindone/blob/image/examples/diffusers/text_to_image/images/sd_base_infer/a_woman_in_a_white_dress.png" width=224> |
-
-
 #### Training with Min-SNR weighting
 
 We support training with the Min-SNR weighting strategy proposed in [Efficient Diffusion Training via Min-SNR Weighting Strategy](https://arxiv.org/abs/2303.09556) which helps to achieve faster convergence
@@ -238,15 +219,6 @@ The final LoRA embedding weights have been uploaded to [sayakpaul/sd-model-finet
 
 You can check some inference samples that were logged during the course of the fine-tuning process [here](https://wandb.ai/sayakpaul/text2image-fine-tune/runs/q4lc0xsw).
 
-### Performance
-
-For the training example above, we trained on the OnePiece dataset and recorded the training speed as follows.
-
-| Method | NPUs | Global <br/>Batch size | Resolution   | Precision | Graph Compile | Speed <br/>(ms/step) | FPS <br/>(img/s) |
-|--------|------|------------------------|--------------|-----------|---------------|----------------------|------------------|
-| lora   | 1    | 1*1                    | 512x512      | FP16      | 1~5 mins      | 200                  | 5.00             |
-| lora   | 8    | 1*8                    | 512x512      | FP16      | 1~5 mins      | 231                  | 34.63            |
-
 ### Inference
 
 If the LoRA weights you want to use is from huggingface, you can replace the following model_path like `model_path = "sayakpaul/sd-model-finetuned-lora-t4"`. Once you have trained a model using above command, the inference can be done simply using the `StableDiffusionPipeline` after loading the trained LoRA weights. You
@@ -260,17 +232,11 @@ model_path = "sd-onepiece-model-lora"
 pipe = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4", mindspore_dtype=ms.float16)
 pipe.load_lora_weights(model_path)
 
-prompt = "a man in a hat and jacket"
+prompt = "a man in a straw hat"
 image = pipe(prompt, num_inference_steps=30, guidance_scale=7.5)[0][0]
-image.save(f"onepiece.png")
+image.save(f"a-man-in-a-straw-hat.png")
 ```
 
-We trained 15k steps based on the OnePiece dataset. Here are some of the results of the lora fine-tuning.
-
-|                                                                      a man in a hat and jacket                                                                      |                                                                 a man in a yellow coat                                                                  |                                                                 a man with a big smile on his face                                                                  |                                                                 a man with a hat and mustache                                                                  |
-|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:-------------------------------------------------------------------------------------------------------------------------------------------------------:|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:--------------------------------------------------------------------------------------------------------------------------------------------------------------:|
-| <img src="https://github.com/liuchuting/mindone/blob/image/examples/diffusers/text_to_image/images/sd_lora_infer/a_man_in_a_hat_and_jacket.png?raw=true" width=224> | <img src="https://github.com/liuchuting/mindone/blob/image/examples/diffusers/text_to_image/images/sd_lora_infer/a_man_in_a_yellow_coat.png" width=224> | <img src="https://github.com/liuchuting/mindone/blob/image/examples/diffusers/text_to_image/images/sd_lora_infer/a_man_with_a_big_smile_on_his_face.png" width=224> | <img src="https://github.com/liuchuting/mindone/blob/image/examples/diffusers/text_to_image/images/sd_lora_infer/a_man_with_a_hat_and_mustache.png" width=224> |
-
 If you are loading the LoRA parameters from the Hub and if the Hub repository has
 a `base_model` tag (such as [this](https://huggingface.co/sayakpaul/sd-model-finetuned-lora-t4/blob/main/README.md?code=true#L4)), then
 you can do:
@@ -289,3 +255,7 @@ pipe = StableDiffusionPipeline.from_pretrained(base_model_id, mindspore_dtype=ms
 ## Stable Diffusion XL
 
 * We support fine-tuning the UNet shipped in [Stable Diffusion XL](https://huggingface.co/papers/2307.01952) via the `train_text_to_image_sdxl.py` script. Please refer to the docs [here](./README_sdxl.md).
+
+## Tutorials
+
+The above training performance and inference results are recorded [here](https://github.com/liuchuting/tutorials/blob/sd_doc/aigc/diffusers/text_to_image/sd_performance_and_inference_results.md).
diff --git a/examples/diffusers/text_to_image/README_sdxl.md b/examples/diffusers/text_to_image/README_sdxl.md
index c6f3488512..41cfe57150 100644
--- a/examples/diffusers/text_to_image/README_sdxl.md
+++ b/examples/diffusers/text_to_image/README_sdxl.md
@@ -12,9 +12,6 @@ Before running the scripts, make sure to install the library's training dependen
 
 **Important**
 
-The training script is compute-intensive and only runs on an Ascend 910*. Please run the scripts with CANN version ([CANN 8.0.RC2.beta1](https://www.hiascend.com/developer/download/community/result?module=cann&cann=8.0.RC2.beta1)) and MindSpore version ([MS 2.3.0](https://www.mindspore.cn/versions#2.3.0)). You can use
-`cat /usr/local/Ascend/ascend-toolkit/latest/version.cfg` check the CANN version and you can see the specific version number [7.3.0.1.231:8.0.RC2]. If you have a custom installation path for CANN, find the `version.cfg` in your own CANN installation path to verify the version.
-
 To make sure you can successfully run the latest versions of the example scripts, we highly recommend **installing from source** and keeping the install up to date as we update the example scripts frequently and install some example-specific requirements. To do this, execute the following steps in a new virtual environment:
 
 ```bash
@@ -28,6 +25,8 @@ Then cd in the `examples/diffusers/text_to_image` folder and run
 pip install -r requirements_sdxl.txt
 ```
 
+The training script is compute-intensive and only runs on an Ascend 910*. Please run the scripts with MindSpore version(MS2.3.0).
+
 ### Training
 
 ```bash
@@ -82,45 +81,20 @@ msrun --worker_num=8 --local_worker_num=8 --log_dir=$output_dir  \
 * The training command shown above performs intermediate quality validation in between the training epochs. `--report_to`, `--validation_prompt`, and `--validation_epochs` are the relevant CLI arguments here.
 * SDXL's VAE is known to suffer from numerical instability issues. This is why we also expose a CLI argument namely `--pretrained_vae_model_name_or_path` that lets you specify the location of a better VAE (such as [this one](https://huggingface.co/madebyollin/sdxl-vae-fp16-fix)).
 
-### Performance
-
-For the above training example, we record the training speed as follows.
-
-| Method  | NPUs | Global <br/>Batch size | Resolution   | Precision | Graph Compile | Speed <br/>(s/step) | FPS <br/>(img/s) |
-|---------|------|------------------------|--------------|-----------|---------------|---------------------|------------------|
-| vanilla | 1    | 1*1                    | 512x512      | FP16      | 1~5 mins      | 0.720               | 1.39             |
-| vanilla | 8    | 1*8                    | 512x512      | FP16      | 1~5 mins      | 1.148               | 6.97             |
-
 ### Inference
 
 ```python
 from mindone.diffusers import DiffusionPipeline
 import mindspore
 
-model_path = "stabilityai/stable-diffusion-xl-base-1.0" # <-- You can modify the model path of your training here.
+model_path = "sdxl-onepiece-model" # <-- change this
 pipe = DiffusionPipeline.from_pretrained(model_path, mindspore_dtype=mindspore.float16)
 
-prompt = "The boy rides a horse in space"
+prompt = "a man with a beard and a shirt"
 image = pipe(prompt, num_inference_steps=30, guidance_scale=7.5)[0][0]
-image.save("The-boy-rides-a-horse-in-space.png")
-```
-
-To change the pipelines scheduler, use the from_config() method to load a different scheduler's pipeline.scheduler.config into the pipeline.
-
-```python
-from mindone.diffusers import EulerAncestralDiscreteScheduler
-
-pipe.scheduler = EulerAncestralDiscreteScheduler.from_config(pipeline.scheduler.config)
-image = pipe(prompt, num_inference_steps=30, guidance_scale=7.5)[0][0]
-image.save("The-boy-rides-a-horse-in-space.png")
+image.save("onepiece.png")
 ```
 
-Here are some images generated by inference under different Schedulers.
-
-|                                                                   DDIMParallelScheduler <br/>(0.86s/step)                                                                    |                                                                    DDIMScheduler <br/>(0.8s/step)                                                                    |                                                                   LMSDiscreteScheduler <br/>(0.93s/step)                                                                    |                                                                   DPMSolverSinglestepScheduler <br/>(0.83s/step)                                                                    |
-|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:---------------------------------------------------------------------------------------------------------------------------------------------------------------:|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|
-| <img src="https://github.com/liuchuting/mindone/blob/image/examples/diffusers/text_to_image/images/diff_schedulers_infer/DDIMParallelScheduler.png?raw=true" width=224> | <img src="https://github.com/liuchuting/mindone/blob/image/examples/diffusers/text_to_image/images/diff_schedulers_infer/DDIMScheduler.png?raw=true" width=224> | <img src="https://github.com/liuchuting/mindone/blob/image/examples/diffusers/text_to_image/images/diff_schedulers_infer/LMSDiscreteScheduler.png?raw=true" width=224> | <img src="https://github.com/liuchuting/mindone/blob/image/examples/diffusers/text_to_image/images/diff_schedulers_infer/DPMSolverSinglestepScheduler.png?raw=true" width=224> |
-
 Checkpoints only save the unet, so to run inference from a checkpoint, just load the unet.
 
 ```python
@@ -136,12 +110,6 @@ image = pipe(prompt="a man with a beard and a shirt")[0][0]
 image.save("onepiece.png")
 ```
 
-We trained 10k steps based on the OnePiece dataset. Here are some of the results of the fine-tuning.
-
-|                                              a man in a blue suit and a green hat                                                                                                  |                                         a man with a big mouth                                                                                              |                                                                  a man with glasses on his face                                                                     |                                                                  a man with red hair and a cape                                                                     |
-|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:-----------------------------------------------------------------------------------------------------------------------------------------------------------:|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------:|
-| <img src="https://github.com/liuchuting/mindone/blob/image/examples/diffusers/text_to_image/images/sdxl_infer_fa_10k/a_man_in_a_blue_suit_and_a_green_hat.png?raw=true" width=224> | <img src="https://github.com/liuchuting/mindone/blob/image/examples/diffusers/text_to_image/images/sdxl_infer_fa_10k/a_man_with_a_big_mouth.png" width=224> | <img src="https://github.com/liuchuting/mindone/blob/image/examples/diffusers/text_to_image/images/sdxl_infer_fa_10k/a_man_with_glasses_on_his_face.png" width=224> | <img src="https://github.com/liuchuting/mindone/blob/image/examples/diffusers/text_to_image/images/sdxl_infer_fa_10k/a_man_with_red_hair_and_a_cape.png" width=224> |
-
 ## LoRA training example for Stable Diffusion XL (SDXL)
 
 Low-Rank Adaption of Large Language Models was first introduced by Microsoft in [LoRA: Low-Rank Adaptation of Large Language Models](https://arxiv.org/abs/2106.09685) by *Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen*.
@@ -208,16 +176,6 @@ The above command will also run inference as fine-tuning progresses and log the
 
 * SDXL's VAE is known to suffer from numerical instability issues. This is why we also expose a CLI argument namely `--pretrained_vae_model_name_or_path` that lets you specify the location of a better VAE (such as [this one](https://huggingface.co/madebyollin/sdxl-vae-fp16-fix)).
 
-### Performance
-
-For the above training example, we record the training speed as follows.
-
-| Method | NPUs | Global <br/>Batch size | Resolution   | Precision | Graph Compile | Speed <br/>(s/step) | FPS <br/>(img/s) |
-|--------|------|------------------------|--------------|-----------|---------------|---------------------|-----------------|
-| lora   | 1    | 1*1                    | 1024x1024    | FP16      | 15~20 min     | 0.828               | 1.21            |
-| lora   | 8    | 1*8                    | 1024x1024    | FP16      | 15~20 min     | 0.907               | 8.82            |
-
-
 ### Finetuning the text encoder and UNet
 
 The script also allows you to finetune the `text_encoder` along with the `unet`.
@@ -258,16 +216,6 @@ msrun --worker_num=8 --local_worker_num=8 --log_dir=$output_dir  \
   --output_dir="sdxl-onepiece-model-lora-$(date +%Y%m%d%H%M%S)"
 ```
 
-### Performance
-
-For the above training example, we record the training speed as follows.
-
-| Method | NPUs | Global <br/>Batch size | Resolution   | Precision | Graph Compile | Speed <br/>(s/step) | FPS <br/>(img/s) |
-|--------|------|------------------------|--------------|-----------|---------------|---------------------|------------------|
-| lora   | 1    | 1*1                    | 1024x1024    | FP16      | 15~20 mins    | 0.951               | 1.05             |
-| lora   | 1    | 1*1                    | 1024x1024    | BF16      | 15~20 mins    | 0.994               | 1.01             |
-| lora   | 1    | 1*1                    | 1024x1024    | FP32      | 15~20 mins    | 1.89                | 0.53             |
-
 ### Inference
 
 If the LoRA weights you want to use is from huggingface, you can replace the following model_path like `model_path = "takuoko/sd-pokemon-model-lora-sdxl"`. Once you have trained a model using above command, the inference can be done simply using the `DiffusionPipeline` after loading the trained LoRA weights. You
@@ -286,16 +234,6 @@ image = pipe(prompt, num_inference_steps=30, guidance_scale=7.5)[0][0]
 image.save("onepiece.png")
 ```
 
-We trained 8.5k steps based on the OnePiece dataset. Here are some of the results of the lora fine-tuning.
-
-|                                                                       a cartoon character with a sword                                                                       |                                                                  a girl with a mask on her face                                                                   |                                                                  a guy with green hair                                                                   |                                                                  a lion sitting on the ground                                                                   |
-|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------:|:--------------------------------------------------------------------------------------------------------------------------------------------------------:|:---------------------------------------------------------------------------------------------------------------------------------------------------------------:|
-| <img src="https://github.com/liuchuting/mindone/blob/image/examples/diffusers/text_to_image/images/sdxl_lora_infer/a_cartoon_character_with_a_sword.png?raw=true" width=224> | <img src="https://github.com/liuchuting/mindone/blob/image/examples/diffusers/text_to_image/images/sdxl_lora_infer/a_girl_with_a_mask_on_her_face.png" width=224> | <img src="https://github.com/liuchuting/mindone/blob/image/examples/diffusers/text_to_image/images/sdxl_lora_infer/a_guy_with_green_hair.png" width=224> | <img src="https://github.com/liuchuting/mindone/blob/image/examples/diffusers/text_to_image/images/sdxl_lora_infer/a_lion_sitting_on_the_ground.png" width=224> |
-
-|                                                                  a man holding a book                                                                   |                                                                  a man in a cowboy hat                                                                   |                                                                  a man in a hat and jacket                                                                   |                                                                  a man in a yellow coat                                                                   |
-|:-------------------------------------------------------------------------------------------------------------------------------------------------------:|:--------------------------------------------------------------------------------------------------------------------------------------------------------:|:------------------------------------------------------------------------------------------------------------------------------------------------------------:|:---------------------------------------------------------------------------------------------------------------------------------------------------------:|
-| <img src="https://github.com/liuchuting/mindone/blob/image/examples/diffusers/text_to_image/images/sdxl_lora_infer/a_man_holding_a_book.png" width=224> | <img src="https://github.com/liuchuting/mindone/blob/image/examples/diffusers/text_to_image/images/sdxl_lora_infer/a_man_in_a_cowboy_hat.png" width=224> | <img src="https://github.com/liuchuting/mindone/blob/image/examples/diffusers/text_to_image/images/sdxl_lora_infer/a_man_in_a_hat_and_jacket.png" width=224> | <img src="https://github.com/liuchuting/mindone/blob/image/examples/diffusers/text_to_image/images/sdxl_lora_infer/a_man_in_a_yellow_coat.png" width=224> |
+## Tutorials
 
-|                                                                  a man sitting in a chair                                                                   |                                                                  a man with a big beard                                                                   |                                                                  a man with green hair and a white shirt                                                                   |                                                                  a smiling woman in a helmet                                                                   |
-|:-----------------------------------------------------------------------------------------------------------------------------------------------------------:|:---------------------------------------------------------------------------------------------------------------------------------------------------------:|:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:--------------------------------------------------------------------------------------------------------------------------------------------------------------:|
-| <img src="https://github.com/liuchuting/mindone/blob/image/examples/diffusers/text_to_image/images/sdxl_lora_infer/a_man_sitting_in_a_chair.png" width=224> | <img src="https://github.com/liuchuting/mindone/blob/image/examples/diffusers/text_to_image/images/sdxl_lora_infer/a_man_with_a_big_beard.png" width=224> | <img src="https://github.com/liuchuting/mindone/blob/image/examples/diffusers/text_to_image/images/sdxl_lora_infer/a_man_with_green_hair_and_a_white_shirt.png" width=224> | <img src="https://github.com/liuchuting/mindone/blob/image/examples/diffusers/text_to_image/images/sdxl_lora_infer/a_smiling_woman_in_a_helmet.png" width=224> |
+The above training performance and inference results are recorded [here](https://github.com/liuchuting/tutorials/blob/sd_doc/aigc/diffusers/text_to_image/sdxl_performance_and_inference_results.md).