Skip to content

Commit

Permalink
Update the readme of the diffusers sdxl&sd.
Browse files Browse the repository at this point in the history
  • Loading branch information
liuchuting committed Sep 10, 2024
1 parent a4df8a0 commit 01c2f01
Show file tree
Hide file tree
Showing 2 changed files with 225 additions and 34 deletions.
105 changes: 90 additions & 15 deletions examples/diffusers/text_to_image/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,9 @@ Before running the scripts, make sure to install the library's training dependen

**Important**

The training script is compute-intensive and only runs on an Ascend 910*. Please run the scripts with CANN version ([CANN 8.0.RC2.beta1](https://www.hiascend.com/developer/download/community/result?module=cann&cann=8.0.RC2.beta1)) and MindSpore version ([MS 2.3.0](https://www.mindspore.cn/versions#2.3.0)). You can use
`cat /usr/local/Ascend/ascend-toolkit/latest/version.cfg` check the CANN version and you can see the specific version number [7.3.0.1.231:8.0.RC2]. If you have a custom installation path for CANN, find the `version.cfg` in your own CANN installation path to verify the version.

To make sure you can successfully run the latest versions of the example scripts, we highly recommend **installing from source** and keeping the install up to date as we update the example scripts frequently and install some example-specific requirements. To do this, execute the following steps in a new virtual environment:
```bash
git clone https://github.com/mindspore-lab/mindone
Expand All @@ -41,8 +44,6 @@ huggingface-cli login

If you have already cloned the repo, then you won't need to go through these steps.

<br>

#### Hardware

With `gradient_checkpointing` and `mixed_precision` it should be possible to fine tune the model on a single 24GB NPU. For higher `batch_size` and faster training it's better to use NPUs with >30GB memory.
Expand Down Expand Up @@ -86,17 +87,47 @@ python train_text_to_image.py \
--output_dir="sd-your-dataset-model-$(date +%Y%m%d%H%M%S)"
```

For parallel training, use `msrun` and along with `--distributed`:

```bash
export MODEL_NAME="CompVis/stable-diffusion-v1-4"
export TRAIN_DIR="path_to_your_dataset"

msrun --worker_num=8 --local_worker_num=8 --log_dir=$output_dir \
train_text_to_image.py \
--pretrained_model_name_or_path=$MODEL_NAME \
--dataset_name=$DATASET_NAME \
--resolution=512 --center_crop --random_flip \
--train_batch_size=1 \
--max_train_steps=15000 \
--learning_rate=1e-05 \
--max_grad_norm=1 \
--mixed_precision="fp16" \
--distributed \
--lr_scheduler="constant" --lr_warmup_steps=0 \
--output_dir="sd-your-dataset-model-$(date +%Y%m%d%H%M%S)"
```

### Performance

For the training example above, we trained on the OnePiece dataset and recorded the training speed as follows.

| Method | NPUs | Global <br/>Batch size | Resolution | Precision | Graph Compile | Speed <br/>(s/step) | FPS <br/>(img/s) |
|---------|------|------------------------|--------------|-----------|---------------|---------------------|------------------|
| vanilla | 1 | 1*1 | 512x512 | FP16 | 1~5 mins | 260 | 3.85 |
| vanilla | 8 | 1*8 | 512x512 | FP16 | 1~5 mins | 404 | 19.8 |

Once the training is finished the model will be saved in the `output_dir` specified in the command. In this example it's `sd-onepiece-model`. To load the fine-tuned model for inference just pass that path to `StableDiffusionPipeline`

```python
import mindspore as ms
from mindone.diffusers import StableDiffusionPipeline

model_path = "path_to_saved_model"
model_path = "sd-onepiece-model"
pipe = StableDiffusionPipeline.from_pretrained(model_path, mindspore_dtype=ms.float16)

image = pipe(prompt="a man in a straw hat")[0][0]
image.save("a-man-in-a-straw-hat.png")
image = pipe(prompt="a man with a beard and a shirt")[0][0]
image.save("onepiece.png")
```

Checkpoints only save the unet, so to run inference from a checkpoint, just load the unet
Expand All @@ -105,15 +136,26 @@ Checkpoints only save the unet, so to run inference from a checkpoint, just load
import mindspore as ms
from mindone.diffusers import StableDiffusionPipeline, UNet2DConditionModel

model_path = "path_to_saved_model"
model_path = "sd-onepiece-model"
unet = UNet2DConditionModel.from_pretrained(model_path + "/checkpoint-<N>/unet", mindspore_dtype=ms.float16)

pipe = StableDiffusionPipeline.from_pretrained("<initial model>", unet=unet, mindspore_dtype=ms.float16)

image = pipe(prompt="a man in a straw hat")[0][0]
image.save("a-man-in-a-straw-hat.png")
image = pipe(prompt="a man with a beard and a shirt")[0][0]
image.save("onepiece.png")
```

We trained 6k steps based on the OnePiece dataset. Here are some of the results of the fine-tuning.

| a girl with a mask on her face | a man holding a book | a man holding a sword | a man sitting on top of a flower |
|:------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:-----------------------------------------------------------------------------------------------------------------------------------------------------:|:------------------------------------------------------------------------------------------------------------------------------------------------------:|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------:|
| <img src="https://github.com/liuchuting/mindone/blob/image/examples/diffusers/text_to_image/images/sd_base_infer/a_girl_with_a_mask_on_her_face.png?raw=true" width=224> | <img src="https://github.com/liuchuting/mindone/blob/image/examples/diffusers/text_to_image/images/sd_base_infer/a_man_holding_a_book.png" width=224> | <img src="https://github.com/liuchuting/mindone/blob/image/examples/diffusers/text_to_image/images/sd_base_infer/a_man_holding_a_sword.png" width=224> | <img src="https://github.com/liuchuting/mindone/blob/image/examples/diffusers/text_to_image/images/sd_base_infer/a_man_sitting_on_top_of_a_flower.png" width=224> |

| a man with a beard and a shirt | a man with a knife in his hand | a smiling woman in a helmet | a woman in a white dress |
|:---------------------------------------------------------------------------------------------------------------------------------------------------------------:|:---------------------------------------------------------------------------------------------------------------------------------------------------------------:|:------------------------------------------------------------------------------------------------------------------------------------------------------------:|:---------------------------------------------------------------------------------------------------------------------------------------------------------:|
| <img src="https://github.com/liuchuting/mindone/blob/image/examples/diffusers/text_to_image/images/sd_base_infer/a_man_with_a_beard_and_a_shirt.png" width=224> | <img src="https://github.com/liuchuting/mindone/blob/image/examples/diffusers/text_to_image/images/sd_base_infer/a_man_with_a_knife_in_his_hand.png" width=224> | <img src="https://github.com/liuchuting/mindone/blob/image/examples/diffusers/text_to_image/images/sd_base_infer/a_smiling_woman_in_a_helmet.png" width=224> | <img src="https://github.com/liuchuting/mindone/blob/image/examples/diffusers/text_to_image/images/sd_base_infer/a_woman_in_a_white_dress.png" width=224> |


#### Training with Min-SNR weighting

We support training with the Min-SNR weighting strategy proposed in [Efficient Diffusion Training via Min-SNR Weighting Strategy](https://arxiv.org/abs/2303.09556) which helps to achieve faster convergence
Expand Down Expand Up @@ -142,9 +184,6 @@ In a nutshell, LoRA allows adapting pretrained models by adding pairs of rank-de

[cloneofsimo](https://github.com/cloneofsimo) was the first to try out LoRA training for Stable Diffusion in the popular [lora](https://github.com/cloneofsimo/lora) GitHub repository.

With LoRA, it's possible to fine-tune Stable Diffusion on a custom image-caption pair dataset
on consumer GPUs like Tesla T4, Tesla V100.

### Training

First, you need to set up your development environment as is explained in the [installation section](#installing-the-dependencies). Make sure to set the `MODEL_NAME` and `DATASET_NAME` environment variables. Here, we will use [Stable Diffusion v1-4](https://hf.co/CompVis/stable-diffusion-v1-4) and the [OnePiece dataset](https://huggingface.co/datasets/YaYaB/onepiece-blip-captions).
Expand All @@ -170,6 +209,27 @@ python train_text_to_image_lora.py \
--output_dir="sd-onepiece-model-lora-$(date +%Y%m%d%H%M%S)"
```

For parallel training, use `msrun` and along with `--distributed`:

```bash
export MODEL_NAME="CompVis/stable-diffusion-v1-4"
export DATASET_NAME="YaYaB/onepiece-blip-captions"

msrun --worker_num=8 --local_worker_num=8 --log_dir=$output_dir \
train_text_to_image_lora.py \
--pretrained_model_name_or_path=$MODEL_NAME \
--dataset_name=$DATASET_NAME \
--resolution=512 --center_crop --random_flip \
--train_batch_size=1 \
--num_train_epochs=100 --checkpointing_steps=5000 \
--learning_rate=1e-04 --lr_scheduler="constant" --lr_warmup_steps=0 \
--mixed_precision="fp16" \
--seed=42 \
--distributed \
--validation_prompt="a man in a straw hat" \
--output_dir="sd-onepiece-model-lora-$(date +%Y%m%d%H%M%S)"
```

The above command will also run inference as fine-tuning progresses and log the results to local files.

**___Note: When using LoRA we can use a much higher learning rate compared to non-LoRA fine-tuning. Here we use *1e-4* instead of the usual *1e-5*. Also, by using LoRA, it's possible to run `train_text_to_image_lora.py` in consumer GPUs like T4 or V100.___**
Expand All @@ -178,24 +238,39 @@ The final LoRA embedding weights have been uploaded to [sayakpaul/sd-model-finet

You can check some inference samples that were logged during the course of the fine-tuning process [here](https://wandb.ai/sayakpaul/text2image-fine-tune/runs/q4lc0xsw).

### Performance

For the training example above, we trained on the OnePiece dataset and recorded the training speed as follows.

| Method | NPUs | Global <br/>Batch size | Resolution | Precision | Graph Compile | Speed <br/>(s/step) | FPS <br/>(img/s) |
|--------|------|------------------------|--------------|-----------|---------------|---------------------|------------------|
| lora | 1 | 1*1 | 512x512 | FP16 | 1~5 mins | 200 | 5.00 |
| lora | 8 | 1*8 | 512x512 | FP16 | 1~5 mins | 231 | 34.63 |

### Inference

Once you have trained a model using above command, the inference can be done simply using the `StableDiffusionPipeline` after loading the trained LoRA weights. You
If the LoRA weights you want to use is from huggingface, you can replace the following model_path like `model_path = "sayakpaul/sd-model-finetuned-lora-t4"`. Once you have trained a model using above command, the inference can be done simply using the `StableDiffusionPipeline` after loading the trained LoRA weights. You
need to pass the `output_dir` for loading the LoRA weights which, in this case, is `sd-onepiece-model-lora`.

```python
import mindspore as ms
from mindone.diffusers import StableDiffusionPipeline

model_path = "sayakpaul/sd-model-finetuned-lora-t4"
model_path = "sd-onepiece-model-lora"
pipe = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4", mindspore_dtype=ms.float16)
pipe.load_lora_weights(model_path)

prompt = "A pokemon with green eyes and red legs."
prompt = "a man in a hat and jacket"
image = pipe(prompt, num_inference_steps=30, guidance_scale=7.5)[0][0]
image.save("pokemon.png")
image.save(f"onepiece.png")
```

We trained 15k steps based on the OnePiece dataset. Here are some of the results of the lora fine-tuning.

| a man in a hat and jacket | a man in a yellow coat | a man with a big smile on his face | a man with a hat and mustache |
|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:-------------------------------------------------------------------------------------------------------------------------------------------------------:|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:--------------------------------------------------------------------------------------------------------------------------------------------------------------:|
| <img src="https://github.com/liuchuting/mindone/blob/image/examples/diffusers/text_to_image/images/sd_lora_infer/a_man_in_a_hat_and_jacket.png?raw=true" width=224> | <img src="https://github.com/liuchuting/mindone/blob/image/examples/diffusers/text_to_image/images/sd_lora_infer/a_man_in_a_yellow_coat.png" width=224> | <img src="https://github.com/liuchuting/mindone/blob/image/examples/diffusers/text_to_image/images/sd_lora_infer/a_man_with_a_big_smile_on_his_face.png" width=224> | <img src="https://github.com/liuchuting/mindone/blob/image/examples/diffusers/text_to_image/images/sd_lora_infer/a_man_with_a_hat_and_mustache.png" width=224> |

If you are loading the LoRA parameters from the Hub and if the Hub repository has
a `base_model` tag (such as [this](https://huggingface.co/sayakpaul/sd-model-finetuned-lora-t4/blob/main/README.md?code=true#L4)), then
you can do:
Expand Down
Loading

0 comments on commit 01c2f01

Please sign in to comment.