From f7aaa33523329d53a8ba881fa5ef2a588eb6a8d2 Mon Sep 17 00:00:00 2001 From: liuchuting Date: Thu, 5 Sep 2024 16:44:01 +0800 Subject: [PATCH 1/2] Update the readme of the diffusers sdxl&sd. --- examples/diffusers/text_to_image/README.md | 107 +++++++++-- .../diffusers/text_to_image/README_sdxl.md | 175 ++++++++++++++++-- 2 files changed, 248 insertions(+), 34 deletions(-) diff --git a/examples/diffusers/text_to_image/README.md b/examples/diffusers/text_to_image/README.md index ecdbc6db38..56f07d7d09 100644 --- a/examples/diffusers/text_to_image/README.md +++ b/examples/diffusers/text_to_image/README.md @@ -15,6 +15,9 @@ Before running the scripts, make sure to install the library's training dependen **Important** +The training script is compute-intensive and only runs on an Ascend 910*. Please run the scripts with CANN version ([CANN 8.0.RC2.beta1](https://www.hiascend.com/developer/download/community/result?module=cann&cann=8.0.RC2.beta1)) and MindSpore version ([MS 2.3.0](https://www.mindspore.cn/versions#2.3.0)). You can use +`cat /usr/local/Ascend/ascend-toolkit/latest/version.cfg` check the CANN version and you can see the specific version number [7.3.0.1.231:8.0.RC2]. If you have a custom installation path for CANN, find the `version.cfg` in your own CANN installation path to verify the version. + To make sure you can successfully run the latest versions of the example scripts, we highly recommend **installing from source** and keeping the install up to date as we update the example scripts frequently and install some example-specific requirements. To do this, execute the following steps in a new virtual environment: ```bash git clone https://github.com/mindspore-lab/mindone @@ -41,8 +44,6 @@ huggingface-cli login If you have already cloned the repo, then you won't need to go through these steps. -
- #### Hardware With `gradient_checkpointing` and `mixed_precision` it should be possible to fine tune the model on a single 24GB NPU. For higher `batch_size` and faster training it's better to use NPUs with >30GB memory. @@ -86,17 +87,47 @@ python train_text_to_image.py \ --output_dir="sd-your-dataset-model-$(date +%Y%m%d%H%M%S)" ``` +For parallel training, use `msrun` and along with `--distributed`: + +```bash +export MODEL_NAME="CompVis/stable-diffusion-v1-4" +export TRAIN_DIR="path_to_your_dataset" + +msrun --worker_num=8 --local_worker_num=8 --log_dir=$output_dir \ + train_text_to_image.py \ + --pretrained_model_name_or_path=$MODEL_NAME \ + --dataset_name=$DATASET_NAME \ + --resolution=512 --center_crop --random_flip \ + --train_batch_size=1 \ + --max_train_steps=15000 \ + --learning_rate=1e-05 \ + --max_grad_norm=1 \ + --mixed_precision="fp16" \ + --distributed \ + --lr_scheduler="constant" --lr_warmup_steps=0 \ + --output_dir="sd-your-dataset-model-$(date +%Y%m%d%H%M%S)" +``` + +### Performance + +For the training example above, we trained on the OnePiece dataset and recorded the training speed as follows. + +| Method | NPUs | Global
Batch size | Resolution | Precision | Graph Compile | Speed
(ms/step) | FPS
(img/s) | +|---------|------|------------------------|--------------|-----------|---------------|----------------------|------------------| +| vanilla | 1 | 1*1 | 512x512 | FP16 | 1~5 mins | 260 | 3.85 | +| vanilla | 8 | 1*8 | 512x512 | FP16 | 1~5 mins | 404 | 19.8 | + Once the training is finished the model will be saved in the `output_dir` specified in the command. In this example it's `sd-onepiece-model`. To load the fine-tuned model for inference just pass that path to `StableDiffusionPipeline` ```python import mindspore as ms from mindone.diffusers import StableDiffusionPipeline -model_path = "path_to_saved_model" +model_path = "sd-onepiece-model" pipe = StableDiffusionPipeline.from_pretrained(model_path, mindspore_dtype=ms.float16) -image = pipe(prompt="a man in a straw hat")[0][0] -image.save("a-man-in-a-straw-hat.png") +image = pipe(prompt="a man with a beard and a shirt")[0][0] +image.save("onepiece.png") ``` Checkpoints only save the unet, so to run inference from a checkpoint, just load the unet @@ -105,15 +136,26 @@ Checkpoints only save the unet, so to run inference from a checkpoint, just load import mindspore as ms from mindone.diffusers import StableDiffusionPipeline, UNet2DConditionModel -model_path = "path_to_saved_model" +model_path = "sd-onepiece-model" unet = UNet2DConditionModel.from_pretrained(model_path + "/checkpoint-/unet", mindspore_dtype=ms.float16) pipe = StableDiffusionPipeline.from_pretrained("", unet=unet, mindspore_dtype=ms.float16) -image = pipe(prompt="a man in a straw hat")[0][0] -image.save("a-man-in-a-straw-hat.png") +image = pipe(prompt="a man with a beard and a shirt")[0][0] +image.save("onepiece.png") ``` +We trained 6k steps based on the OnePiece dataset. Here are some of the results of the fine-tuning. + +| a girl with a mask on her face | a man holding a book | a man holding a sword | a man sitting on top of a flower | +|:------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:-----------------------------------------------------------------------------------------------------------------------------------------------------:|:------------------------------------------------------------------------------------------------------------------------------------------------------:|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------:| +| | | | | + +| a man with a beard and a shirt | a man with a knife in his hand | a smiling woman in a helmet | a woman in a white dress | +|:---------------------------------------------------------------------------------------------------------------------------------------------------------------:|:---------------------------------------------------------------------------------------------------------------------------------------------------------------:|:------------------------------------------------------------------------------------------------------------------------------------------------------------:|:---------------------------------------------------------------------------------------------------------------------------------------------------------:| +| | | | | + + #### Training with Min-SNR weighting We support training with the Min-SNR weighting strategy proposed in [Efficient Diffusion Training via Min-SNR Weighting Strategy](https://arxiv.org/abs/2303.09556) which helps to achieve faster convergence @@ -142,9 +184,6 @@ In a nutshell, LoRA allows adapting pretrained models by adding pairs of rank-de [cloneofsimo](https://github.com/cloneofsimo) was the first to try out LoRA training for Stable Diffusion in the popular [lora](https://github.com/cloneofsimo/lora) GitHub repository. -With LoRA, it's possible to fine-tune Stable Diffusion on a custom image-caption pair dataset -on consumer GPUs like Tesla T4, Tesla V100. - ### Training First, you need to set up your development environment as is explained in the [installation section](#installing-the-dependencies). Make sure to set the `MODEL_NAME` and `DATASET_NAME` environment variables. Here, we will use [Stable Diffusion v1-4](https://hf.co/CompVis/stable-diffusion-v1-4) and the [OnePiece dataset](https://huggingface.co/datasets/YaYaB/onepiece-blip-captions). @@ -170,32 +209,68 @@ python train_text_to_image_lora.py \ --output_dir="sd-onepiece-model-lora-$(date +%Y%m%d%H%M%S)" ``` +For parallel training, use `msrun` and along with `--distributed`: + +```bash +export MODEL_NAME="CompVis/stable-diffusion-v1-4" +export DATASET_NAME="YaYaB/onepiece-blip-captions" + +msrun --worker_num=8 --local_worker_num=8 --log_dir=$output_dir \ + train_text_to_image_lora.py \ + --pretrained_model_name_or_path=$MODEL_NAME \ + --dataset_name=$DATASET_NAME \ + --resolution=512 --center_crop --random_flip \ + --train_batch_size=1 \ + --num_train_epochs=100 --checkpointing_steps=5000 \ + --learning_rate=1e-04 --lr_scheduler="constant" --lr_warmup_steps=0 \ + --mixed_precision="fp16" \ + --seed=42 \ + --distributed \ + --validation_prompt="a man in a straw hat" \ + --output_dir="sd-onepiece-model-lora-$(date +%Y%m%d%H%M%S)" +``` + The above command will also run inference as fine-tuning progresses and log the results to local files. -**___Note: When using LoRA we can use a much higher learning rate compared to non-LoRA fine-tuning. Here we use *1e-4* instead of the usual *1e-5*. Also, by using LoRA, it's possible to run `train_text_to_image_lora.py` in consumer GPUs like T4 or V100.___** +**___Note: When using LoRA we can use a much higher learning rate compared to non-LoRA fine-tuning. Here we use *1e-4* instead of the usual *1e-5*. The final LoRA embedding weights have been uploaded to [sayakpaul/sd-model-finetuned-lora-t4](https://huggingface.co/sayakpaul/sd-model-finetuned-lora-t4). **___Note: [The final weights](https://huggingface.co/sayakpaul/sd-model-finetuned-lora-t4/blob/main/pytorch_lora_weights.bin) are only 3 MB in size, which is orders of magnitudes smaller than the original model.___** You can check some inference samples that were logged during the course of the fine-tuning process [here](https://wandb.ai/sayakpaul/text2image-fine-tune/runs/q4lc0xsw). +### Performance + +For the training example above, we trained on the OnePiece dataset and recorded the training speed as follows. + +| Method | NPUs | Global
Batch size | Resolution | Precision | Graph Compile | Speed
(ms/step) | FPS
(img/s) | +|--------|------|------------------------|--------------|-----------|---------------|----------------------|------------------| +| lora | 1 | 1*1 | 512x512 | FP16 | 1~5 mins | 200 | 5.00 | +| lora | 8 | 1*8 | 512x512 | FP16 | 1~5 mins | 231 | 34.63 | + ### Inference -Once you have trained a model using above command, the inference can be done simply using the `StableDiffusionPipeline` after loading the trained LoRA weights. You +If the LoRA weights you want to use is from huggingface, you can replace the following model_path like `model_path = "sayakpaul/sd-model-finetuned-lora-t4"`. Once you have trained a model using above command, the inference can be done simply using the `StableDiffusionPipeline` after loading the trained LoRA weights. You need to pass the `output_dir` for loading the LoRA weights which, in this case, is `sd-onepiece-model-lora`. ```python import mindspore as ms from mindone.diffusers import StableDiffusionPipeline -model_path = "sayakpaul/sd-model-finetuned-lora-t4" +model_path = "sd-onepiece-model-lora" pipe = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4", mindspore_dtype=ms.float16) pipe.load_lora_weights(model_path) -prompt = "A pokemon with green eyes and red legs." +prompt = "a man in a hat and jacket" image = pipe(prompt, num_inference_steps=30, guidance_scale=7.5)[0][0] -image.save("pokemon.png") +image.save(f"onepiece.png") ``` +We trained 15k steps based on the OnePiece dataset. Here are some of the results of the lora fine-tuning. + +| a man in a hat and jacket | a man in a yellow coat | a man with a big smile on his face | a man with a hat and mustache | +|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:-------------------------------------------------------------------------------------------------------------------------------------------------------:|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:--------------------------------------------------------------------------------------------------------------------------------------------------------------:| +| | | | | + If you are loading the LoRA parameters from the Hub and if the Hub repository has a `base_model` tag (such as [this](https://huggingface.co/sayakpaul/sd-model-finetuned-lora-t4/blob/main/README.md?code=true#L4)), then you can do: diff --git a/examples/diffusers/text_to_image/README_sdxl.md b/examples/diffusers/text_to_image/README_sdxl.md index 0e4c38fb84..c6f3488512 100644 --- a/examples/diffusers/text_to_image/README_sdxl.md +++ b/examples/diffusers/text_to_image/README_sdxl.md @@ -12,6 +12,9 @@ Before running the scripts, make sure to install the library's training dependen **Important** +The training script is compute-intensive and only runs on an Ascend 910*. Please run the scripts with CANN version ([CANN 8.0.RC2.beta1](https://www.hiascend.com/developer/download/community/result?module=cann&cann=8.0.RC2.beta1)) and MindSpore version ([MS 2.3.0](https://www.mindspore.cn/versions#2.3.0)). You can use +`cat /usr/local/Ascend/ascend-toolkit/latest/version.cfg` check the CANN version and you can see the specific version number [7.3.0.1.231:8.0.RC2]. If you have a custom installation path for CANN, find the `version.cfg` in your own CANN installation path to verify the version. + To make sure you can successfully run the latest versions of the example scripts, we highly recommend **installing from source** and keeping the install up to date as we update the example scripts frequently and install some example-specific requirements. To do this, execute the following steps in a new virtual environment: ```bash @@ -36,6 +39,7 @@ python train_text_to_image_sdxl.py \ --pretrained_model_name_or_path=$MODEL_NAME \ --pretrained_vae_model_name_or_path=$VAE_NAME \ --dataset_name=$DATASET_NAME \ + --enable_xformers_memory_efficient_attention \ --resolution=512 --center_crop --random_flip \ --proportion_empty_prompts=0.2 \ --train_batch_size=1 \ @@ -47,27 +51,97 @@ python train_text_to_image_sdxl.py \ --output_dir="sdxl-onepiece-model-$(date +%Y%m%d%H%M%S)" ``` +For parallel training, use `msrun` and along with `--distributed`: + +```shell +export MODEL_NAME="stabilityai/stable-diffusion-xl-base-1.0" +export VAE_NAME="madebyollin/sdxl-vae-fp16-fix" +export DATASET_NAME="YaYaB/onepiece-blip-captions" + +msrun --worker_num=8 --local_worker_num=8 --log_dir=$output_dir \ + train_text_to_image_sdxl.py \ + --pretrained_model_name_or_path=$MODEL_NAME \ + --pretrained_vae_model_name_or_path=$VAE_NAME \ + --dataset_name=$DATASET_NAME \ + --enable_xformers_memory_efficient_attention \ + --resolution=512 --center_crop --random_flip \ + --proportion_empty_prompts=0.2 \ + --train_batch_size=1 \ + --max_train_steps=10000 \ + --learning_rate=1e-06 --lr_scheduler="constant" --lr_warmup_steps=0 \ + --mixed_precision="fp16" \ + --validation_prompt="a man in a green coat holding two swords" --validation_epochs 5 \ + --checkpointing_steps=5000 \ + --distributed \ + --output_dir="sdxl-onepiece-model-$(date +%Y%m%d%H%M%S)" +``` + **Notes**: * The `train_text_to_image_sdxl.py` script pre-computes text embeddings and the VAE encodings and keeps them in memory. While for smaller datasets like [`lambdalabs/pokemon-blip-captions`](https://hf.co/datasets/lambdalabs/pokemon-blip-captions), it might not be a problem, it can definitely lead to memory problems when the script is used on a larger dataset. For those purposes, you would want to serialize these pre-computed representations to disk separately and load them during the fine-tuning process. Refer to [this PR](https://github.com/huggingface/diffusers/pull/4505) for a more in-depth discussion. -* The training script is compute-intensive and only runs on an Ascend 910*. * The training command shown above performs intermediate quality validation in between the training epochs. `--report_to`, `--validation_prompt`, and `--validation_epochs` are the relevant CLI arguments here. * SDXL's VAE is known to suffer from numerical instability issues. This is why we also expose a CLI argument namely `--pretrained_vae_model_name_or_path` that lets you specify the location of a better VAE (such as [this one](https://huggingface.co/madebyollin/sdxl-vae-fp16-fix)). +### Performance + +For the above training example, we record the training speed as follows. + +| Method | NPUs | Global
Batch size | Resolution | Precision | Graph Compile | Speed
(s/step) | FPS
(img/s) | +|---------|------|------------------------|--------------|-----------|---------------|---------------------|------------------| +| vanilla | 1 | 1*1 | 512x512 | FP16 | 1~5 mins | 0.720 | 1.39 | +| vanilla | 8 | 1*8 | 512x512 | FP16 | 1~5 mins | 1.148 | 6.97 | + ### Inference ```python from mindone.diffusers import DiffusionPipeline import mindspore -model_path = "you-model-id-goes-here" # <-- change this +model_path = "stabilityai/stable-diffusion-xl-base-1.0" # <-- You can modify the model path of your training here. pipe = DiffusionPipeline.from_pretrained(model_path, mindspore_dtype=mindspore.float16) -prompt = "a man in a green coat holding two swords" +prompt = "The boy rides a horse in space" +image = pipe(prompt, num_inference_steps=30, guidance_scale=7.5)[0][0] +image.save("The-boy-rides-a-horse-in-space.png") +``` + +To change the pipelines scheduler, use the from_config() method to load a different scheduler's pipeline.scheduler.config into the pipeline. + +```python +from mindone.diffusers import EulerAncestralDiscreteScheduler + +pipe.scheduler = EulerAncestralDiscreteScheduler.from_config(pipeline.scheduler.config) image = pipe(prompt, num_inference_steps=30, guidance_scale=7.5)[0][0] +image.save("The-boy-rides-a-horse-in-space.png") +``` + +Here are some images generated by inference under different Schedulers. + +| DDIMParallelScheduler
(0.86s/step) | DDIMScheduler
(0.8s/step) | LMSDiscreteScheduler
(0.93s/step) | DPMSolverSinglestepScheduler
(0.83s/step) | +|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:---------------------------------------------------------------------------------------------------------------------------------------------------------------:|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:| +| | | | | + +Checkpoints only save the unet, so to run inference from a checkpoint, just load the unet. + +```python +import mindspore as ms +from mindone.diffusers import StableDiffusionXLPipeline, UNet2DConditionModel + +model_path = "sdxl-onepiece-model" +unet = UNet2DConditionModel.from_pretrained(model_path + "/checkpoint-/unet", mindspore_dtype=ms.float16) + +pipe = StableDiffusionXLPipeline.from_pretrained("", unet=unet, mindspore_dtype=ms.float16) + +image = pipe(prompt="a man with a beard and a shirt")[0][0] image.save("onepiece.png") ``` +We trained 10k steps based on the OnePiece dataset. Here are some of the results of the fine-tuning. + +| a man in a blue suit and a green hat | a man with a big mouth | a man with glasses on his face | a man with red hair and a cape | +|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:-----------------------------------------------------------------------------------------------------------------------------------------------------------:|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------:| +| | | | | + ## LoRA training example for Stable Diffusion XL (SDXL) Low-Rank Adaption of Large Language Models was first introduced by Microsoft in [LoRA: Low-Rank Adaptation of Large Language Models](https://arxiv.org/abs/2106.09685) by *Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen*. @@ -80,16 +154,6 @@ In a nutshell, LoRA allows adapting pretrained models by adding pairs of rank-de [cloneofsimo](https://github.com/cloneofsimo) was the first to try out LoRA training for Stable Diffusion in the popular [lora](https://github.com/cloneofsimo/lora) GitHub repository. -With LoRA, it's possible to fine-tune Stable Diffusion on a custom image-caption pair dataset -on consumer GPUs like Tesla T4, Tesla V100. - -> [!WARNING] -> If you're using mindspore 2.2.x, you have to set the `MS_DEV_TRAVERSE_SUBSTITUTIONS_MODE` environment variables to `1` before running the training commands, -> otherwise you'll get a segmentation fault (core dumped). -> ```bash -> export MS_DEV_TRAVERSE_SUBSTITUTIONS_MODE=1 -> ``` - ### Training First, you need to set up your development environment as is explained in the [installation section](#installing-the-dependencies). Make sure to set the `MODEL_NAME` and `DATASET_NAME` environment variables and, optionally, the `VAE_NAME` variable. Here, we will use [Stable Diffusion XL 1.0-base](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0) and the [OnePiece dataset](https://huggingface.co/datasets/YaYaB/onepiece-blip-captions). @@ -115,12 +179,45 @@ python train_text_to_image_lora_sdxl.py \ --output_dir="sdxl-onepiece-model-lora-$(date +%Y%m%d%H%M%S)" ``` +For parallel training, use `msrun` and along with `--distributed`: + +```shell +export MODEL_NAME="stabilityai/stable-diffusion-xl-base-1.0" +export VAE_NAME="madebyollin/sdxl-vae-fp16-fix" +export DATASET_NAME="YaYaB/onepiece-blip-captions" + +msrun --worker_num=8 --local_worker_num=8 --log_dir=$output_dir \ + train_text_to_image_lora_sdxl.py \ + --pretrained_model_name_or_path=$MODEL_NAME \ + --pretrained_vae_model_name_or_path=$VAE_NAME \ + --dataset_name=$DATASET_NAME \ + --resolution=1024 --center_crop --random_flip \ + --train_batch_size=1 \ + --num_train_epochs=2 --checkpointing_steps=500 \ + --learning_rate=1e-04 --lr_scheduler="constant" --lr_warmup_steps=0 \ + --mixed_precision="fp16" \ + --seed=42 \ + --validation_prompt="a man in a green coat holding two swords" \ + --distributed \ + --output_dir="sdxl-onepiece-model-lora-$(date +%Y%m%d%H%M%S)" +``` + The above command will also run inference as fine-tuning progresses and log the results to local files. **Notes**: * SDXL's VAE is known to suffer from numerical instability issues. This is why we also expose a CLI argument namely `--pretrained_vae_model_name_or_path` that lets you specify the location of a better VAE (such as [this one](https://huggingface.co/madebyollin/sdxl-vae-fp16-fix)). +### Performance + +For the above training example, we record the training speed as follows. + +| Method | NPUs | Global
Batch size | Resolution | Precision | Graph Compile | Speed
(s/step) | FPS
(img/s) | +|--------|------|------------------------|--------------|-----------|---------------|---------------------|-----------------| +| lora | 1 | 1*1 | 1024x1024 | FP16 | 15~20 min | 0.828 | 1.21 | +| lora | 8 | 1*8 | 1024x1024 | FP16 | 15~20 min | 0.907 | 8.82 | + + ### Finetuning the text encoder and UNet The script also allows you to finetune the `text_encoder` along with the `unet`. @@ -140,23 +237,65 @@ python train_text_to_image_lora_sdxl.py \ --seed=42 \ --validation_prompt="a man in a green coat holding two swords" \ --train_text_encoder \ - --output_dir="sdxl-onepiece-model-lora-txt-$(date +%Y%m%d%H%M%S)" + --output_dir="sdxl-onepiece-model-lora-$(date +%Y%m%d%H%M%S)" +``` + +For parallel training, use `msrun` and along with `--distributed`: + +```shell +msrun --worker_num=8 --local_worker_num=8 --log_dir=$output_dir \ + train_text_to_image_lora_sdxl.py \ + --pretrained_model_name_or_path=$MODEL_NAME \ + --dataset_name=$DATASET_NAME \ + --resolution=1024 --center_crop --random_flip \ + --train_batch_size=1 \ + --num_train_epochs=2 --checkpointing_steps=500 \ + --learning_rate=1e-04 --lr_scheduler="constant" --lr_warmup_steps=0 \ + --seed=42 \ + --validation_prompt="a man in a green coat holding two swords" \ + --train_text_encoder \ + --distributed \ + --output_dir="sdxl-onepiece-model-lora-$(date +%Y%m%d%H%M%S)" ``` +### Performance + +For the above training example, we record the training speed as follows. + +| Method | NPUs | Global
Batch size | Resolution | Precision | Graph Compile | Speed
(s/step) | FPS
(img/s) | +|--------|------|------------------------|--------------|-----------|---------------|---------------------|------------------| +| lora | 1 | 1*1 | 1024x1024 | FP16 | 15~20 mins | 0.951 | 1.05 | +| lora | 1 | 1*1 | 1024x1024 | BF16 | 15~20 mins | 0.994 | 1.01 | +| lora | 1 | 1*1 | 1024x1024 | FP32 | 15~20 mins | 1.89 | 0.53 | + ### Inference -Once you have trained a model using above command, the inference can be done simply using the `DiffusionPipeline` after loading the trained LoRA weights. You +If the LoRA weights you want to use is from huggingface, you can replace the following model_path like `model_path = "takuoko/sd-pokemon-model-lora-sdxl"`. Once you have trained a model using above command, the inference can be done simply using the `DiffusionPipeline` after loading the trained LoRA weights. You need to pass the `output_dir` for loading the LoRA weights which, in this case, is `sdxl-onepiece-model-lora`. ```python import mindspore as ms from mindone.diffusers import DiffusionPipeline -model_path = "takuoko/sd-pokemon-model-lora-sdxl" +model_path = "sdxl-onepiece-model-lora" pipe = DiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", mindspore_dtype=ms.float16) pipe.load_lora_weights(model_path) -prompt = "A pokemon with green eyes and red legs." +prompt = "a guy with green hair" image = pipe(prompt, num_inference_steps=30, guidance_scale=7.5)[0][0] -image.save("pokemon.png") +image.save("onepiece.png") ``` + +We trained 8.5k steps based on the OnePiece dataset. Here are some of the results of the lora fine-tuning. + +| a cartoon character with a sword | a girl with a mask on her face | a guy with green hair | a lion sitting on the ground | +|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------:|:--------------------------------------------------------------------------------------------------------------------------------------------------------:|:---------------------------------------------------------------------------------------------------------------------------------------------------------------:| +| | | | | + +| a man holding a book | a man in a cowboy hat | a man in a hat and jacket | a man in a yellow coat | +|:-------------------------------------------------------------------------------------------------------------------------------------------------------:|:--------------------------------------------------------------------------------------------------------------------------------------------------------:|:------------------------------------------------------------------------------------------------------------------------------------------------------------:|:---------------------------------------------------------------------------------------------------------------------------------------------------------:| +| | | | | + +| a man sitting in a chair | a man with a big beard | a man with green hair and a white shirt | a smiling woman in a helmet | +|:-----------------------------------------------------------------------------------------------------------------------------------------------------------:|:---------------------------------------------------------------------------------------------------------------------------------------------------------:|:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:--------------------------------------------------------------------------------------------------------------------------------------------------------------:| +| | | | | From 9e0e7bffa5b0a6fd8ea3a22cd413737952a48630 Mon Sep 17 00:00:00 2001 From: liuchuting Date: Mon, 23 Sep 2024 14:19:10 +0800 Subject: [PATCH 2/2] Complete the readme of the diffusers sdxl. --- examples/diffusers/text_to_image/README.md | 58 ++++---------- .../diffusers/text_to_image/README_sdxl.md | 76 ++----------------- 2 files changed, 21 insertions(+), 113 deletions(-) diff --git a/examples/diffusers/text_to_image/README.md b/examples/diffusers/text_to_image/README.md index 56f07d7d09..ff9c0307e9 100644 --- a/examples/diffusers/text_to_image/README.md +++ b/examples/diffusers/text_to_image/README.md @@ -15,9 +15,6 @@ Before running the scripts, make sure to install the library's training dependen **Important** -The training script is compute-intensive and only runs on an Ascend 910*. Please run the scripts with CANN version ([CANN 8.0.RC2.beta1](https://www.hiascend.com/developer/download/community/result?module=cann&cann=8.0.RC2.beta1)) and MindSpore version ([MS 2.3.0](https://www.mindspore.cn/versions#2.3.0)). You can use -`cat /usr/local/Ascend/ascend-toolkit/latest/version.cfg` check the CANN version and you can see the specific version number [7.3.0.1.231:8.0.RC2]. If you have a custom installation path for CANN, find the `version.cfg` in your own CANN installation path to verify the version. - To make sure you can successfully run the latest versions of the example scripts, we highly recommend **installing from source** and keeping the install up to date as we update the example scripts frequently and install some example-specific requirements. To do this, execute the following steps in a new virtual environment: ```bash git clone https://github.com/mindspore-lab/mindone @@ -30,6 +27,8 @@ Then cd in the example folder `examples/diffusers/text_to_image` and run pip install -r requirements.txt ``` +The training script is compute-intensive and only runs on an Ascend 910*. Please run the scripts with MindSpore version(MS2.3.0). + ### OnePiece example You need to accept the model license before downloading or using the weights. In this example we'll use model version `v1-4`, so you'll need to visit [its card](https://huggingface.co/CompVis/stable-diffusion-v1-4), read the license and tick the checkbox if you agree. @@ -44,6 +43,8 @@ huggingface-cli login If you have already cloned the repo, then you won't need to go through these steps. +
+ #### Hardware With `gradient_checkpointing` and `mixed_precision` it should be possible to fine tune the model on a single 24GB NPU. For higher `batch_size` and faster training it's better to use NPUs with >30GB memory. @@ -108,15 +109,6 @@ msrun --worker_num=8 --local_worker_num=8 --log_dir=$output_dir \ --output_dir="sd-your-dataset-model-$(date +%Y%m%d%H%M%S)" ``` -### Performance - -For the training example above, we trained on the OnePiece dataset and recorded the training speed as follows. - -| Method | NPUs | Global
Batch size | Resolution | Precision | Graph Compile | Speed
(ms/step) | FPS
(img/s) | -|---------|------|------------------------|--------------|-----------|---------------|----------------------|------------------| -| vanilla | 1 | 1*1 | 512x512 | FP16 | 1~5 mins | 260 | 3.85 | -| vanilla | 8 | 1*8 | 512x512 | FP16 | 1~5 mins | 404 | 19.8 | - Once the training is finished the model will be saved in the `output_dir` specified in the command. In this example it's `sd-onepiece-model`. To load the fine-tuned model for inference just pass that path to `StableDiffusionPipeline` ```python @@ -126,8 +118,8 @@ from mindone.diffusers import StableDiffusionPipeline model_path = "sd-onepiece-model" pipe = StableDiffusionPipeline.from_pretrained(model_path, mindspore_dtype=ms.float16) -image = pipe(prompt="a man with a beard and a shirt")[0][0] -image.save("onepiece.png") +image = pipe(prompt="a man in a straw hat")[0][0] +image.save("a-man-in-a-straw-hat.png") ``` Checkpoints only save the unet, so to run inference from a checkpoint, just load the unet @@ -141,21 +133,10 @@ unet = UNet2DConditionModel.from_pretrained(model_path + "/checkpoint-/unet", pipe = StableDiffusionPipeline.from_pretrained("", unet=unet, mindspore_dtype=ms.float16) -image = pipe(prompt="a man with a beard and a shirt")[0][0] -image.save("onepiece.png") +image = pipe(prompt="a man in a straw hat")[0][0] +image.save("a-man-in-a-straw-hat.png") ``` -We trained 6k steps based on the OnePiece dataset. Here are some of the results of the fine-tuning. - -| a girl with a mask on her face | a man holding a book | a man holding a sword | a man sitting on top of a flower | -|:------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:-----------------------------------------------------------------------------------------------------------------------------------------------------:|:------------------------------------------------------------------------------------------------------------------------------------------------------:|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------:| -| | | | | - -| a man with a beard and a shirt | a man with a knife in his hand | a smiling woman in a helmet | a woman in a white dress | -|:---------------------------------------------------------------------------------------------------------------------------------------------------------------:|:---------------------------------------------------------------------------------------------------------------------------------------------------------------:|:------------------------------------------------------------------------------------------------------------------------------------------------------------:|:---------------------------------------------------------------------------------------------------------------------------------------------------------:| -| | | | | - - #### Training with Min-SNR weighting We support training with the Min-SNR weighting strategy proposed in [Efficient Diffusion Training via Min-SNR Weighting Strategy](https://arxiv.org/abs/2303.09556) which helps to achieve faster convergence @@ -238,15 +219,6 @@ The final LoRA embedding weights have been uploaded to [sayakpaul/sd-model-finet You can check some inference samples that were logged during the course of the fine-tuning process [here](https://wandb.ai/sayakpaul/text2image-fine-tune/runs/q4lc0xsw). -### Performance - -For the training example above, we trained on the OnePiece dataset and recorded the training speed as follows. - -| Method | NPUs | Global
Batch size | Resolution | Precision | Graph Compile | Speed
(ms/step) | FPS
(img/s) | -|--------|------|------------------------|--------------|-----------|---------------|----------------------|------------------| -| lora | 1 | 1*1 | 512x512 | FP16 | 1~5 mins | 200 | 5.00 | -| lora | 8 | 1*8 | 512x512 | FP16 | 1~5 mins | 231 | 34.63 | - ### Inference If the LoRA weights you want to use is from huggingface, you can replace the following model_path like `model_path = "sayakpaul/sd-model-finetuned-lora-t4"`. Once you have trained a model using above command, the inference can be done simply using the `StableDiffusionPipeline` after loading the trained LoRA weights. You @@ -260,17 +232,11 @@ model_path = "sd-onepiece-model-lora" pipe = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4", mindspore_dtype=ms.float16) pipe.load_lora_weights(model_path) -prompt = "a man in a hat and jacket" +prompt = "a man in a straw hat" image = pipe(prompt, num_inference_steps=30, guidance_scale=7.5)[0][0] -image.save(f"onepiece.png") +image.save(f"a-man-in-a-straw-hat.png") ``` -We trained 15k steps based on the OnePiece dataset. Here are some of the results of the lora fine-tuning. - -| a man in a hat and jacket | a man in a yellow coat | a man with a big smile on his face | a man with a hat and mustache | -|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:-------------------------------------------------------------------------------------------------------------------------------------------------------:|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:--------------------------------------------------------------------------------------------------------------------------------------------------------------:| -| | | | | - If you are loading the LoRA parameters from the Hub and if the Hub repository has a `base_model` tag (such as [this](https://huggingface.co/sayakpaul/sd-model-finetuned-lora-t4/blob/main/README.md?code=true#L4)), then you can do: @@ -289,3 +255,7 @@ pipe = StableDiffusionPipeline.from_pretrained(base_model_id, mindspore_dtype=ms ## Stable Diffusion XL * We support fine-tuning the UNet shipped in [Stable Diffusion XL](https://huggingface.co/papers/2307.01952) via the `train_text_to_image_sdxl.py` script. Please refer to the docs [here](./README_sdxl.md). + +## Tutorials + +The above training performance and inference results are recorded [here](https://github.com/liuchuting/tutorials/blob/sd_doc/aigc/diffusers/text_to_image/sd_performance_and_inference_results.md). diff --git a/examples/diffusers/text_to_image/README_sdxl.md b/examples/diffusers/text_to_image/README_sdxl.md index c6f3488512..41cfe57150 100644 --- a/examples/diffusers/text_to_image/README_sdxl.md +++ b/examples/diffusers/text_to_image/README_sdxl.md @@ -12,9 +12,6 @@ Before running the scripts, make sure to install the library's training dependen **Important** -The training script is compute-intensive and only runs on an Ascend 910*. Please run the scripts with CANN version ([CANN 8.0.RC2.beta1](https://www.hiascend.com/developer/download/community/result?module=cann&cann=8.0.RC2.beta1)) and MindSpore version ([MS 2.3.0](https://www.mindspore.cn/versions#2.3.0)). You can use -`cat /usr/local/Ascend/ascend-toolkit/latest/version.cfg` check the CANN version and you can see the specific version number [7.3.0.1.231:8.0.RC2]. If you have a custom installation path for CANN, find the `version.cfg` in your own CANN installation path to verify the version. - To make sure you can successfully run the latest versions of the example scripts, we highly recommend **installing from source** and keeping the install up to date as we update the example scripts frequently and install some example-specific requirements. To do this, execute the following steps in a new virtual environment: ```bash @@ -28,6 +25,8 @@ Then cd in the `examples/diffusers/text_to_image` folder and run pip install -r requirements_sdxl.txt ``` +The training script is compute-intensive and only runs on an Ascend 910*. Please run the scripts with MindSpore version(MS2.3.0). + ### Training ```bash @@ -82,45 +81,20 @@ msrun --worker_num=8 --local_worker_num=8 --log_dir=$output_dir \ * The training command shown above performs intermediate quality validation in between the training epochs. `--report_to`, `--validation_prompt`, and `--validation_epochs` are the relevant CLI arguments here. * SDXL's VAE is known to suffer from numerical instability issues. This is why we also expose a CLI argument namely `--pretrained_vae_model_name_or_path` that lets you specify the location of a better VAE (such as [this one](https://huggingface.co/madebyollin/sdxl-vae-fp16-fix)). -### Performance - -For the above training example, we record the training speed as follows. - -| Method | NPUs | Global
Batch size | Resolution | Precision | Graph Compile | Speed
(s/step) | FPS
(img/s) | -|---------|------|------------------------|--------------|-----------|---------------|---------------------|------------------| -| vanilla | 1 | 1*1 | 512x512 | FP16 | 1~5 mins | 0.720 | 1.39 | -| vanilla | 8 | 1*8 | 512x512 | FP16 | 1~5 mins | 1.148 | 6.97 | - ### Inference ```python from mindone.diffusers import DiffusionPipeline import mindspore -model_path = "stabilityai/stable-diffusion-xl-base-1.0" # <-- You can modify the model path of your training here. +model_path = "sdxl-onepiece-model" # <-- change this pipe = DiffusionPipeline.from_pretrained(model_path, mindspore_dtype=mindspore.float16) -prompt = "The boy rides a horse in space" +prompt = "a man with a beard and a shirt" image = pipe(prompt, num_inference_steps=30, guidance_scale=7.5)[0][0] -image.save("The-boy-rides-a-horse-in-space.png") -``` - -To change the pipelines scheduler, use the from_config() method to load a different scheduler's pipeline.scheduler.config into the pipeline. - -```python -from mindone.diffusers import EulerAncestralDiscreteScheduler - -pipe.scheduler = EulerAncestralDiscreteScheduler.from_config(pipeline.scheduler.config) -image = pipe(prompt, num_inference_steps=30, guidance_scale=7.5)[0][0] -image.save("The-boy-rides-a-horse-in-space.png") +image.save("onepiece.png") ``` -Here are some images generated by inference under different Schedulers. - -| DDIMParallelScheduler
(0.86s/step) | DDIMScheduler
(0.8s/step) | LMSDiscreteScheduler
(0.93s/step) | DPMSolverSinglestepScheduler
(0.83s/step) | -|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:---------------------------------------------------------------------------------------------------------------------------------------------------------------:|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:| -| | | | | - Checkpoints only save the unet, so to run inference from a checkpoint, just load the unet. ```python @@ -136,12 +110,6 @@ image = pipe(prompt="a man with a beard and a shirt")[0][0] image.save("onepiece.png") ``` -We trained 10k steps based on the OnePiece dataset. Here are some of the results of the fine-tuning. - -| a man in a blue suit and a green hat | a man with a big mouth | a man with glasses on his face | a man with red hair and a cape | -|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:-----------------------------------------------------------------------------------------------------------------------------------------------------------:|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------:| -| | | | | - ## LoRA training example for Stable Diffusion XL (SDXL) Low-Rank Adaption of Large Language Models was first introduced by Microsoft in [LoRA: Low-Rank Adaptation of Large Language Models](https://arxiv.org/abs/2106.09685) by *Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen*. @@ -208,16 +176,6 @@ The above command will also run inference as fine-tuning progresses and log the * SDXL's VAE is known to suffer from numerical instability issues. This is why we also expose a CLI argument namely `--pretrained_vae_model_name_or_path` that lets you specify the location of a better VAE (such as [this one](https://huggingface.co/madebyollin/sdxl-vae-fp16-fix)). -### Performance - -For the above training example, we record the training speed as follows. - -| Method | NPUs | Global
Batch size | Resolution | Precision | Graph Compile | Speed
(s/step) | FPS
(img/s) | -|--------|------|------------------------|--------------|-----------|---------------|---------------------|-----------------| -| lora | 1 | 1*1 | 1024x1024 | FP16 | 15~20 min | 0.828 | 1.21 | -| lora | 8 | 1*8 | 1024x1024 | FP16 | 15~20 min | 0.907 | 8.82 | - - ### Finetuning the text encoder and UNet The script also allows you to finetune the `text_encoder` along with the `unet`. @@ -258,16 +216,6 @@ msrun --worker_num=8 --local_worker_num=8 --log_dir=$output_dir \ --output_dir="sdxl-onepiece-model-lora-$(date +%Y%m%d%H%M%S)" ``` -### Performance - -For the above training example, we record the training speed as follows. - -| Method | NPUs | Global
Batch size | Resolution | Precision | Graph Compile | Speed
(s/step) | FPS
(img/s) | -|--------|------|------------------------|--------------|-----------|---------------|---------------------|------------------| -| lora | 1 | 1*1 | 1024x1024 | FP16 | 15~20 mins | 0.951 | 1.05 | -| lora | 1 | 1*1 | 1024x1024 | BF16 | 15~20 mins | 0.994 | 1.01 | -| lora | 1 | 1*1 | 1024x1024 | FP32 | 15~20 mins | 1.89 | 0.53 | - ### Inference If the LoRA weights you want to use is from huggingface, you can replace the following model_path like `model_path = "takuoko/sd-pokemon-model-lora-sdxl"`. Once you have trained a model using above command, the inference can be done simply using the `DiffusionPipeline` after loading the trained LoRA weights. You @@ -286,16 +234,6 @@ image = pipe(prompt, num_inference_steps=30, guidance_scale=7.5)[0][0] image.save("onepiece.png") ``` -We trained 8.5k steps based on the OnePiece dataset. Here are some of the results of the lora fine-tuning. - -| a cartoon character with a sword | a girl with a mask on her face | a guy with green hair | a lion sitting on the ground | -|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------:|:--------------------------------------------------------------------------------------------------------------------------------------------------------:|:---------------------------------------------------------------------------------------------------------------------------------------------------------------:| -| | | | | - -| a man holding a book | a man in a cowboy hat | a man in a hat and jacket | a man in a yellow coat | -|:-------------------------------------------------------------------------------------------------------------------------------------------------------:|:--------------------------------------------------------------------------------------------------------------------------------------------------------:|:------------------------------------------------------------------------------------------------------------------------------------------------------------:|:---------------------------------------------------------------------------------------------------------------------------------------------------------:| -| | | | | +## Tutorials -| a man sitting in a chair | a man with a big beard | a man with green hair and a white shirt | a smiling woman in a helmet | -|:-----------------------------------------------------------------------------------------------------------------------------------------------------------:|:---------------------------------------------------------------------------------------------------------------------------------------------------------:|:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:--------------------------------------------------------------------------------------------------------------------------------------------------------------:| -| | | | | +The above training performance and inference results are recorded [here](https://github.com/liuchuting/tutorials/blob/sd_doc/aigc/diffusers/text_to_image/sdxl_performance_and_inference_results.md).