Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update the readme of the diffusers sdxl&sd. #655

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
65 changes: 55 additions & 10 deletions examples/diffusers/text_to_image/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,8 @@ Then cd in the example folder `examples/diffusers/text_to_image` and run
pip install -r requirements.txt
```

The training script is compute-intensive and only runs on an Ascend 910*. Please run the scripts with MindSpore version(MS2.3.0).

### OnePiece example

You need to accept the model license before downloading or using the weights. In this example we'll use model version `v1-4`, so you'll need to visit [its card](https://huggingface.co/CompVis/stable-diffusion-v1-4), read the license and tick the checkbox if you agree.
Expand Down Expand Up @@ -86,13 +88,34 @@ python train_text_to_image.py \
--output_dir="sd-your-dataset-model-$(date +%Y%m%d%H%M%S)"
```

For parallel training, use `msrun` and along with `--distributed`:

```bash
export MODEL_NAME="CompVis/stable-diffusion-v1-4"
export TRAIN_DIR="path_to_your_dataset"

msrun --worker_num=8 --local_worker_num=8 --log_dir=$output_dir \
train_text_to_image.py \
--pretrained_model_name_or_path=$MODEL_NAME \
--dataset_name=$DATASET_NAME \
--resolution=512 --center_crop --random_flip \
--train_batch_size=1 \
--max_train_steps=15000 \
--learning_rate=1e-05 \
--max_grad_norm=1 \
--mixed_precision="fp16" \
--distributed \
--lr_scheduler="constant" --lr_warmup_steps=0 \
--output_dir="sd-your-dataset-model-$(date +%Y%m%d%H%M%S)"
```

Once the training is finished the model will be saved in the `output_dir` specified in the command. In this example it's `sd-onepiece-model`. To load the fine-tuned model for inference just pass that path to `StableDiffusionPipeline`

```python
import mindspore as ms
from mindone.diffusers import StableDiffusionPipeline

model_path = "path_to_saved_model"
model_path = "sd-onepiece-model"
pipe = StableDiffusionPipeline.from_pretrained(model_path, mindspore_dtype=ms.float16)

image = pipe(prompt="a man in a straw hat")[0][0]
Expand All @@ -105,7 +128,7 @@ Checkpoints only save the unet, so to run inference from a checkpoint, just load
import mindspore as ms
from mindone.diffusers import StableDiffusionPipeline, UNet2DConditionModel

model_path = "path_to_saved_model"
model_path = "sd-onepiece-model"
unet = UNet2DConditionModel.from_pretrained(model_path + "/checkpoint-<N>/unet", mindspore_dtype=ms.float16)

pipe = StableDiffusionPipeline.from_pretrained("<initial model>", unet=unet, mindspore_dtype=ms.float16)
Expand Down Expand Up @@ -142,9 +165,6 @@ In a nutshell, LoRA allows adapting pretrained models by adding pairs of rank-de

[cloneofsimo](https://github.com/cloneofsimo) was the first to try out LoRA training for Stable Diffusion in the popular [lora](https://github.com/cloneofsimo/lora) GitHub repository.

With LoRA, it's possible to fine-tune Stable Diffusion on a custom image-caption pair dataset
on consumer GPUs like Tesla T4, Tesla V100.

### Training

First, you need to set up your development environment as is explained in the [installation section](#installing-the-dependencies). Make sure to set the `MODEL_NAME` and `DATASET_NAME` environment variables. Here, we will use [Stable Diffusion v1-4](https://hf.co/CompVis/stable-diffusion-v1-4) and the [OnePiece dataset](https://huggingface.co/datasets/YaYaB/onepiece-blip-captions).
Expand All @@ -170,30 +190,51 @@ python train_text_to_image_lora.py \
--output_dir="sd-onepiece-model-lora-$(date +%Y%m%d%H%M%S)"
```

For parallel training, use `msrun` and along with `--distributed`:

```bash
export MODEL_NAME="CompVis/stable-diffusion-v1-4"
export DATASET_NAME="YaYaB/onepiece-blip-captions"

msrun --worker_num=8 --local_worker_num=8 --log_dir=$output_dir \
train_text_to_image_lora.py \
--pretrained_model_name_or_path=$MODEL_NAME \
--dataset_name=$DATASET_NAME \
--resolution=512 --center_crop --random_flip \
--train_batch_size=1 \
--num_train_epochs=100 --checkpointing_steps=5000 \
--learning_rate=1e-04 --lr_scheduler="constant" --lr_warmup_steps=0 \
--mixed_precision="fp16" \
--seed=42 \
--distributed \
--validation_prompt="a man in a straw hat" \
--output_dir="sd-onepiece-model-lora-$(date +%Y%m%d%H%M%S)"
```

The above command will also run inference as fine-tuning progresses and log the results to local files.

**___Note: When using LoRA we can use a much higher learning rate compared to non-LoRA fine-tuning. Here we use *1e-4* instead of the usual *1e-5*. Also, by using LoRA, it's possible to run `train_text_to_image_lora.py` in consumer GPUs like T4 or V100.___**
**___Note: When using LoRA we can use a much higher learning rate compared to non-LoRA fine-tuning. Here we use *1e-4* instead of the usual *1e-5*.

The final LoRA embedding weights have been uploaded to [sayakpaul/sd-model-finetuned-lora-t4](https://huggingface.co/sayakpaul/sd-model-finetuned-lora-t4). **___Note: [The final weights](https://huggingface.co/sayakpaul/sd-model-finetuned-lora-t4/blob/main/pytorch_lora_weights.bin) are only 3 MB in size, which is orders of magnitudes smaller than the original model.___**

You can check some inference samples that were logged during the course of the fine-tuning process [here](https://wandb.ai/sayakpaul/text2image-fine-tune/runs/q4lc0xsw).

### Inference

Once you have trained a model using above command, the inference can be done simply using the `StableDiffusionPipeline` after loading the trained LoRA weights. You
If the LoRA weights you want to use is from huggingface, you can replace the following model_path like `model_path = "sayakpaul/sd-model-finetuned-lora-t4"`. Once you have trained a model using above command, the inference can be done simply using the `StableDiffusionPipeline` after loading the trained LoRA weights. You
need to pass the `output_dir` for loading the LoRA weights which, in this case, is `sd-onepiece-model-lora`.

```python
import mindspore as ms
from mindone.diffusers import StableDiffusionPipeline

model_path = "sayakpaul/sd-model-finetuned-lora-t4"
model_path = "sd-onepiece-model-lora"
pipe = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4", mindspore_dtype=ms.float16)
pipe.load_lora_weights(model_path)

prompt = "A pokemon with green eyes and red legs."
prompt = "a man in a straw hat"
image = pipe(prompt, num_inference_steps=30, guidance_scale=7.5)[0][0]
image.save("pokemon.png")
image.save(f"a-man-in-a-straw-hat.png")
```

If you are loading the LoRA parameters from the Hub and if the Hub repository has
Expand All @@ -214,3 +255,7 @@ pipe = StableDiffusionPipeline.from_pretrained(base_model_id, mindspore_dtype=ms
## Stable Diffusion XL

* We support fine-tuning the UNet shipped in [Stable Diffusion XL](https://huggingface.co/papers/2307.01952) via the `train_text_to_image_sdxl.py` script. Please refer to the docs [here](./README_sdxl.md).

## Tutorials

The above training performance and inference results are recorded [here](https://github.com/liuchuting/tutorials/blob/sd_doc/aigc/diffusers/text_to_image/sd_performance_and_inference_results.md).
113 changes: 95 additions & 18 deletions examples/diffusers/text_to_image/README_sdxl.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,8 @@ Then cd in the `examples/diffusers/text_to_image` folder and run
pip install -r requirements_sdxl.txt
```

The training script is compute-intensive and only runs on an Ascend 910*. Please run the scripts with MindSpore version(MS2.3.0).

### Training

```bash
Expand All @@ -36,6 +38,7 @@ python train_text_to_image_sdxl.py \
--pretrained_model_name_or_path=$MODEL_NAME \
--pretrained_vae_model_name_or_path=$VAE_NAME \
--dataset_name=$DATASET_NAME \
--enable_xformers_memory_efficient_attention \
--resolution=512 --center_crop --random_flip \
--proportion_empty_prompts=0.2 \
--train_batch_size=1 \
Expand All @@ -47,10 +50,34 @@ python train_text_to_image_sdxl.py \
--output_dir="sdxl-onepiece-model-$(date +%Y%m%d%H%M%S)"
```

For parallel training, use `msrun` and along with `--distributed`:

```shell
export MODEL_NAME="stabilityai/stable-diffusion-xl-base-1.0"
export VAE_NAME="madebyollin/sdxl-vae-fp16-fix"
export DATASET_NAME="YaYaB/onepiece-blip-captions"

msrun --worker_num=8 --local_worker_num=8 --log_dir=$output_dir \
train_text_to_image_sdxl.py \
--pretrained_model_name_or_path=$MODEL_NAME \
--pretrained_vae_model_name_or_path=$VAE_NAME \
--dataset_name=$DATASET_NAME \
--enable_xformers_memory_efficient_attention \
--resolution=512 --center_crop --random_flip \
--proportion_empty_prompts=0.2 \
--train_batch_size=1 \
--max_train_steps=10000 \
--learning_rate=1e-06 --lr_scheduler="constant" --lr_warmup_steps=0 \
--mixed_precision="fp16" \
--validation_prompt="a man in a green coat holding two swords" --validation_epochs 5 \
--checkpointing_steps=5000 \
--distributed \
--output_dir="sdxl-onepiece-model-$(date +%Y%m%d%H%M%S)"
```

**Notes**:

* The `train_text_to_image_sdxl.py` script pre-computes text embeddings and the VAE encodings and keeps them in memory. While for smaller datasets like [`lambdalabs/pokemon-blip-captions`](https://hf.co/datasets/lambdalabs/pokemon-blip-captions), it might not be a problem, it can definitely lead to memory problems when the script is used on a larger dataset. For those purposes, you would want to serialize these pre-computed representations to disk separately and load them during the fine-tuning process. Refer to [this PR](https://github.com/huggingface/diffusers/pull/4505) for a more in-depth discussion.
* The training script is compute-intensive and only runs on an Ascend 910*.
* The training command shown above performs intermediate quality validation in between the training epochs. `--report_to`, `--validation_prompt`, and `--validation_epochs` are the relevant CLI arguments here.
* SDXL's VAE is known to suffer from numerical instability issues. This is why we also expose a CLI argument namely `--pretrained_vae_model_name_or_path` that lets you specify the location of a better VAE (such as [this one](https://huggingface.co/madebyollin/sdxl-vae-fp16-fix)).

Expand All @@ -60,14 +87,29 @@ python train_text_to_image_sdxl.py \
from mindone.diffusers import DiffusionPipeline
import mindspore

model_path = "you-model-id-goes-here" # <-- change this
model_path = "sdxl-onepiece-model" # <-- change this
pipe = DiffusionPipeline.from_pretrained(model_path, mindspore_dtype=mindspore.float16)

prompt = "a man in a green coat holding two swords"
prompt = "a man with a beard and a shirt"
image = pipe(prompt, num_inference_steps=30, guidance_scale=7.5)[0][0]
image.save("onepiece.png")
```

Checkpoints only save the unet, so to run inference from a checkpoint, just load the unet.

```python
import mindspore as ms
from mindone.diffusers import StableDiffusionXLPipeline, UNet2DConditionModel

model_path = "sdxl-onepiece-model"
unet = UNet2DConditionModel.from_pretrained(model_path + "/checkpoint-<N>/unet", mindspore_dtype=ms.float16)

pipe = StableDiffusionXLPipeline.from_pretrained("<initial model>", unet=unet, mindspore_dtype=ms.float16)

image = pipe(prompt="a man with a beard and a shirt")[0][0]
image.save("onepiece.png")
```

## LoRA training example for Stable Diffusion XL (SDXL)

Low-Rank Adaption of Large Language Models was first introduced by Microsoft in [LoRA: Low-Rank Adaptation of Large Language Models](https://arxiv.org/abs/2106.09685) by *Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen*.
Expand All @@ -80,16 +122,6 @@ In a nutshell, LoRA allows adapting pretrained models by adding pairs of rank-de

[cloneofsimo](https://github.com/cloneofsimo) was the first to try out LoRA training for Stable Diffusion in the popular [lora](https://github.com/cloneofsimo/lora) GitHub repository.

With LoRA, it's possible to fine-tune Stable Diffusion on a custom image-caption pair dataset
on consumer GPUs like Tesla T4, Tesla V100.

> [!WARNING]
> If you're using mindspore 2.2.x, you have to set the `MS_DEV_TRAVERSE_SUBSTITUTIONS_MODE` environment variables to `1` before running the training commands,
> otherwise you'll get a segmentation fault (core dumped).
> ```bash
> export MS_DEV_TRAVERSE_SUBSTITUTIONS_MODE=1
> ```

### Training

First, you need to set up your development environment as is explained in the [installation section](#installing-the-dependencies). Make sure to set the `MODEL_NAME` and `DATASET_NAME` environment variables and, optionally, the `VAE_NAME` variable. Here, we will use [Stable Diffusion XL 1.0-base](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0) and the [OnePiece dataset](https://huggingface.co/datasets/YaYaB/onepiece-blip-captions).
Expand All @@ -115,6 +147,29 @@ python train_text_to_image_lora_sdxl.py \
--output_dir="sdxl-onepiece-model-lora-$(date +%Y%m%d%H%M%S)"
```

For parallel training, use `msrun` and along with `--distributed`:

```shell
export MODEL_NAME="stabilityai/stable-diffusion-xl-base-1.0"
export VAE_NAME="madebyollin/sdxl-vae-fp16-fix"
export DATASET_NAME="YaYaB/onepiece-blip-captions"

msrun --worker_num=8 --local_worker_num=8 --log_dir=$output_dir \
train_text_to_image_lora_sdxl.py \
--pretrained_model_name_or_path=$MODEL_NAME \
--pretrained_vae_model_name_or_path=$VAE_NAME \
--dataset_name=$DATASET_NAME \
--resolution=1024 --center_crop --random_flip \
--train_batch_size=1 \
--num_train_epochs=2 --checkpointing_steps=500 \
--learning_rate=1e-04 --lr_scheduler="constant" --lr_warmup_steps=0 \
--mixed_precision="fp16" \
--seed=42 \
--validation_prompt="a man in a green coat holding two swords" \
--distributed \
--output_dir="sdxl-onepiece-model-lora-$(date +%Y%m%d%H%M%S)"
```

The above command will also run inference as fine-tuning progresses and log the results to local files.

**Notes**:
Expand All @@ -140,23 +195,45 @@ python train_text_to_image_lora_sdxl.py \
--seed=42 \
--validation_prompt="a man in a green coat holding two swords" \
--train_text_encoder \
--output_dir="sdxl-onepiece-model-lora-txt-$(date +%Y%m%d%H%M%S)"
--output_dir="sdxl-onepiece-model-lora-$(date +%Y%m%d%H%M%S)"
```

For parallel training, use `msrun` and along with `--distributed`:

```shell
msrun --worker_num=8 --local_worker_num=8 --log_dir=$output_dir \
train_text_to_image_lora_sdxl.py \
--pretrained_model_name_or_path=$MODEL_NAME \
--dataset_name=$DATASET_NAME \
--resolution=1024 --center_crop --random_flip \
--train_batch_size=1 \
--num_train_epochs=2 --checkpointing_steps=500 \
--learning_rate=1e-04 --lr_scheduler="constant" --lr_warmup_steps=0 \
--seed=42 \
--validation_prompt="a man in a green coat holding two swords" \
--train_text_encoder \
--distributed \
--output_dir="sdxl-onepiece-model-lora-$(date +%Y%m%d%H%M%S)"
```

### Inference

Once you have trained a model using above command, the inference can be done simply using the `DiffusionPipeline` after loading the trained LoRA weights. You
If the LoRA weights you want to use is from huggingface, you can replace the following model_path like `model_path = "takuoko/sd-pokemon-model-lora-sdxl"`. Once you have trained a model using above command, the inference can be done simply using the `DiffusionPipeline` after loading the trained LoRA weights. You
need to pass the `output_dir` for loading the LoRA weights which, in this case, is `sdxl-onepiece-model-lora`.

```python
import mindspore as ms
from mindone.diffusers import DiffusionPipeline

model_path = "takuoko/sd-pokemon-model-lora-sdxl"
model_path = "sdxl-onepiece-model-lora"
pipe = DiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", mindspore_dtype=ms.float16)
pipe.load_lora_weights(model_path)

prompt = "A pokemon with green eyes and red legs."
prompt = "a guy with green hair"
image = pipe(prompt, num_inference_steps=30, guidance_scale=7.5)[0][0]
image.save("pokemon.png")
image.save("onepiece.png")
```

## Tutorials

The above training performance and inference results are recorded [here](https://github.com/liuchuting/tutorials/blob/sd_doc/aigc/diffusers/text_to_image/sdxl_performance_and_inference_results.md).