diff --git a/docs/diffusers/_toctree.yml b/docs/diffusers/_toctree.yml index fbbd530b41..503ef9f223 100644 --- a/docs/diffusers/_toctree.yml +++ b/docs/diffusers/_toctree.yml @@ -20,3 +20,218 @@ - local: tutorials/using_peft_for_inference title: Load LoRAs for inference title: Tutorials +- sections: + - local: using-diffusers/loading_overview + title: Overview + - local: using-diffusers/loading + title: Load pipelines + - local: using-diffusers/schedulers + title: Load schedulers and models + - local: using-diffusers/other-formats + title: Model files and layouts + - local: using-diffusers/loading_adapters + title: Load adapters + - local: using-diffusers/push_to_hub + title: Push files to the Hub + title: Load pipelines and adapters +- sections: + - local: using-diffusers/unconditional_image_generation + title: Unconditional image generation + - local: using-diffusers/conditional_image_generation + title: Text-to-image + - local: using-diffusers/img2img + title: Image-to-image + - local: using-diffusers/inpaint + title: Inpainting + - local: using-diffusers/text-img2vid + title: Text or image-to-video + - local: using-diffusers/depth2img + title: Depth-to-image + title: Generative tasks +- sections: + - local: using-diffusers/overview_techniques + title: Overview + - local: using-diffusers/merge_loras + title: Merge LoRAs + - local: using-diffusers/scheduler_features + title: Scheduler features + - local: using-diffusers/callback + title: Pipeline callbacks + - local: using-diffusers/reusing_seeds + title: Reproducible pipelines + title: Inference techniques +- sections: + - local: using-diffusers/sdxl + title: Stable Diffusion XL + - local: using-diffusers/sdxl_turbo + title: SDXL Turbo + - local: using-diffusers/kandinsky + title: Kandinsky + - local: using-diffusers/ip_adapter + title: IP-Adapter + - local: using-diffusers/controlnet + title: ControlNet + - local: using-diffusers/t2i_adapter + title: T2I-Adapter + - local: using-diffusers/inference_with_lcm + title: Latent Consistency Model + - local: using-diffusers/textual_inversion_inference + title: Textual inversion + - local: using-diffusers/shap-e + title: Shap-E + - local: using-diffusers/diffedit + title: DiffEdit + - local: using-diffusers/inference_with_tcd_lora + title: Trajectory Consistency Distillation-LoRA + - local: using-diffusers/svd + title: Stable Video Diffusion + - local: using-diffusers/marigold_usage + title: Marigold Computer Vision + title: Specific pipeline examples +- sections: + - local: training/overview + title: Overview + - local: training/create_dataset + title: Create a dataset for training + - local: training/adapt_a_model + title: Adapt a model to a new task + - isExpanded: false + sections: + - local: training/unconditional_training + title: Unconditional image generation + - local: training/text2image + title: Text-to-image + - local: training/sdxl + title: Stable Diffusion XL + - local: training/controlnet + title: ControlNet + title: Models + - isExpanded: false + sections: + - local: training/text_inversion + title: Textual Inversion + - local: training/dreambooth + title: DreamBooth + - local: training/lora + title: LoRA + title: Methods + title: Training +- sections: + - local: optimization/fp16 + title: Speed up inference + - local: optimization/memory + title: Reduce memory usage + - local: optimization/xformers + title: xFormers + title: Accelerate inference and reduce memory +- sections: + - local: conceptual/philosophy + title: Philosophy + - local: using-diffusers/controlling_generation + title: Controlled generation + title: Conceptual Guides +- sections: + - isExpanded: false + sections: + - local: api/configuration + title: Configuration + - local: api/logging + title: Logging + - local: api/outputs + title: Outputs + title: Main Classes + - isExpanded: false + sections: + - sections: + - local: api/pipelines/stable_diffusion/overview + title: Overview + - local: api/pipelines/stable_diffusion/text2img + title: Text-to-image + - local: api/pipelines/stable_diffusion/img2img + title: Image-to-image + - local: api/pipelines/stable_diffusion/svd + title: Image-to-video + - local: api/pipelines/stable_diffusion/inpaint + title: Inpainting + - local: api/pipelines/stable_diffusion/depth2img + title: Depth-to-image + - local: api/pipelines/stable_diffusion/image_variation + title: Image variation + - local: api/pipelines/stable_diffusion/stable_diffusion_2 + title: Stable Diffusion 2 + - local: api/pipelines/stable_diffusion/stable_diffusion_3 + title: Stable Diffusion 3 + - local: api/pipelines/stable_diffusion/stable_diffusion_xl + title: Stable Diffusion XL + - local: api/pipelines/stable_diffusion/sdxl_turbo + title: SDXL Turbo + - local: api/pipelines/stable_diffusion/latent_upscale + title: Latent upscaler + - local: api/pipelines/stable_diffusion/upscale + title: Super-resolution + - local: api/pipelines/stable_diffusion/adapter + title: T2I-Adapter + - local: api/pipelines/stable_diffusion/gligen + title: GLIGEN (Grounded Language-to-Image Generation) + title: Stable Diffusion + title: Pipelines + - isExpanded: false + sections: + - local: api/schedulers/overview + title: Overview + - local: api/schedulers/cm_stochastic_iterative + title: CMStochasticIterativeScheduler + - local: api/schedulers/consistency_decoder + title: ConsistencyDecoderScheduler + - local: api/schedulers/ddim_inverse + title: DDIMInverseScheduler + - local: api/schedulers/ddim + title: DDIMScheduler + - local: api/schedulers/ddpm + title: DDPMScheduler + - local: api/schedulers/deis + title: DEISMultistepScheduler + - local: api/schedulers/multistep_dpm_solver_inverse + title: DPMSolverMultistepInverse + - local: api/schedulers/multistep_dpm_solver + title: DPMSolverMultistepScheduler + - local: api/schedulers/singlestep_dpm_solver + title: DPMSolverSinglestepScheduler + - local: api/schedulers/edm_multistep_dpm_solver + title: EDMDPMSolverMultistepScheduler + - local: api/schedulers/edm_euler + title: EDMEulerScheduler + - local: api/schedulers/euler_ancestral + title: EulerAncestralDiscreteScheduler + - local: api/schedulers/euler + title: EulerDiscreteScheduler + - local: api/schedulers/flow_match_euler_discrete + title: FlowMatchEulerDiscreteScheduler + - local: api/schedulers/heun + title: HeunDiscreteScheduler + - local: api/schedulers/ipndm + title: IPNDMScheduler + - local: api/schedulers/dpm_discrete_ancestral + title: KDPM2AncestralDiscreteScheduler + - local: api/schedulers/dpm_discrete + title: KDPM2DiscreteScheduler + - local: api/schedulers/lcm + title: LCMScheduler + - local: api/schedulers/lms_discrete + title: LMSDiscreteScheduler + - local: api/schedulers/pndm + title: PNDMScheduler + - local: api/schedulers/repaint + title: RePaintScheduler + - local: api/schedulers/score_sde_ve + title: ScoreSdeVeScheduler + - local: api/schedulers/score_sde_vp + title: ScoreSdeVpScheduler + - local: api/schedulers/tcd + title: TCDScheduler + - local: api/schedulers/unipc + title: UniPCMultistepScheduler + - local: api/schedulers/vq_diffusion + title: VQDiffusionScheduler + title: Schedulers + title: API diff --git a/docs/diffusers/api/configuration.md b/docs/diffusers/api/configuration.md new file mode 100644 index 0000000000..2fd12b5bf9 --- /dev/null +++ b/docs/diffusers/api/configuration.md @@ -0,0 +1,28 @@ + + +# Configuration + +Schedulers from [`SchedulerMixin`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/schedulers/overview/#mindone.diffusers.SchedulerMixin) and models from [`ModelMixin`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/models/overview/#mindone.diffusers.ModelMixin) inherit from [`ConfigMixin`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/configuration/#mindone.diffusers.configuration_utils.ConfigMixin) which stores all the parameters that are passed to their respective `__init__` methods in a JSON-configuration file. + +!!! tip + + To use private or [gated](https://huggingface.co/docs/hub/models-gated#gated-models) models, log-in with `huggingface-cli login`. + +::: mindone.diffusers.configuration_utils.ConfigMixin + options: + members: + - load_config + - from_config + - save_config + - to_json_file + - to_json_string diff --git a/docs/diffusers/api/logging.md b/docs/diffusers/api/logging.md new file mode 100644 index 0000000000..82f92b2612 --- /dev/null +++ b/docs/diffusers/api/logging.md @@ -0,0 +1,121 @@ + + +# Logging + +🤗 Diffusers has a centralized logging system to easily manage the verbosity of the library. The default verbosity is set to `WARNING`. + +To change the verbosity level, use one of the direct setters. For instance, to change the verbosity to the `INFO` level. + +```python +import mindone.diffusers + +mindone.diffusers.logging.set_verbosity_info() +``` + +You can also use the environment variable `DIFFUSERS_VERBOSITY` to override the default verbosity. You can set it +to one of the following: `debug`, `info`, `warning`, `error`, `critical`. For example: + +```bash +DIFFUSERS_VERBOSITY=error ./myprogram.py +``` + +Additionally, some `warnings` can be disabled by setting the environment variable +`DIFFUSERS_NO_ADVISORY_WARNINGS` to a true value, like `1`. This disables any warning logged by +[`logger.warning_advice`]. For example: + +```bash +DIFFUSERS_NO_ADVISORY_WARNINGS=1 ./myprogram.py +``` + +Here is an example of how to use the same logger as the library in your own module or script: + +```python +from mindone.diffusers.utils import logging + +logging.set_verbosity_info() +logger = logging.get_logger("diffusers") +logger.info("INFO") +logger.warning("WARN") +``` + +All methods of the logging module are documented below. The main methods are +[`get_verbosity`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/logging/#mindone.diffusers.utils.logging.get_verbosity) to get the current level of verbosity in the logger and +[`set_verbosity`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/logging/#mindone.diffusers.utils.logging.set_verbosity) to set the verbosity to the level of your choice. + +In order from the least verbose to the most verbose: + +| Method | Integer value | Description | +|----------------------------------------------------------:|--------------:|----------------------------------------------------:| +| `diffusers.logging.CRITICAL` or `diffusers.logging.FATAL` | 50 | only report the most critical errors | +| `diffusers.logging.ERROR` | 40 | only report errors | +| `diffusers.logging.WARNING` or `diffusers.logging.WARN` | 30 | only report errors and warnings (default) | +| `diffusers.logging.INFO` | 20 | only report errors, warnings, and basic information | +| `diffusers.logging.DEBUG` | 10 | report all information | + +By default, `tqdm` progress bars are displayed during model download. [`disable_progress_bar`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/logging/#mindone.diffusers.utils.logging.disable_progress_bar) and [`enable_progress_bar`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/logging/#mindone.diffusers.utils.logging.enable_progress_bar) are used to enable or disable this behavior. + +## Base setters + +::: mindone.diffusers.utils.logging.set_verbosity_error + options: + heading_level: 3 + +::: mindone.diffusers.utils.logging.set_verbosity_warning + options: + heading_level: 3 + +::: mindone.diffusers.utils.logging.set_verbosity_info + options: + heading_level: 3 + +::: mindone.diffusers.utils.logging.set_verbosity_debug + options: + heading_level: 3 + +## Other functions + +::: mindone.diffusers.utils.logging.get_verbosity + options: + heading_level: 3 + +::: mindone.diffusers.utils.logging.set_verbosity + options: + heading_level: 3 + +::: mindone.diffusers.utils.logging.get_logger + options: + heading_level: 3 + +::: mindone.diffusers.utils.logging.enable_default_handler + options: + heading_level: 3 + +::: mindone.diffusers.utils.logging.disable_default_handler + options: + heading_level: 3 + +::: mindone.diffusers.utils.logging.enable_explicit_format + options: + heading_level: 3 + +::: mindone.diffusers.utils.logging.reset_format + options: + heading_level: 3 + +::: mindone.diffusers.utils.logging.enable_progress_bar + options: + heading_level: 3 + +::: mindone.diffusers.utils.logging.disable_progress_bar + options: + heading_level: 3 diff --git a/docs/diffusers/api/outputs.md b/docs/diffusers/api/outputs.md new file mode 100644 index 0000000000..93214e5f9c --- /dev/null +++ b/docs/diffusers/api/outputs.md @@ -0,0 +1,59 @@ + + +# Outputs + +!!! warning + + Default value of `return_dict` is changed to False and the outputs will be used as tuples, for `GRAPH_MODE` does not allow to construct an instance of it. + +All model outputs are subclasses of [`~utils.BaseOutput`], data structures containing all the information returned by the model. The outputs can also be used as tuples or dictionaries. + +For example: + +```python +from mindone.diffusers import DDIMPipeline + +pipeline = DDIMPipeline.from_pretrained("google/ddpm-cifar10-32") +outputs = pipeline() +``` + +The `outputs` object is a [`ImagePipelineOutput`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/pipelines/ddpm/#mindone.diffusers.pipelines.pipeline_utils.ImagePipelineOutput) which means it has an image attribute. + +::: mindone.diffusers.pipelines.pipeline_utils.ImagePipelineOutput + +You can access each attribute as you normally would or with a keyword lookup if you set `return_dict` to `True`, and if that attribute is not returned by the model, you will get `None`: + +```python +outputs.images +outputs["images"] +``` + +When considering the `outputs` object as a tuple, it only considers the attributes that don't have `None` values. +For instance, retrieving an image by indexing into it returns the tuple `(outputs.images)`: + +```python +outputs[:1] +``` + +!!! tip + + To check a specific pipeline or model output, refer to its corresponding API documentation. + +::: mindone.diffusers.utils.BaseOutput + options: + members: + - to_tuple + +::: mindone.diffusers.pipelines.pipeline_utils.ImagePipelineOutput + +::: mindone.diffusers.pipelines.pipeline_utils.AudioPipelineOutput diff --git a/docs/diffusers/api/pipelines/stable_diffusion/adapter.md b/docs/diffusers/api/pipelines/stable_diffusion/adapter.md new file mode 100644 index 0000000000..807cb74a7a --- /dev/null +++ b/docs/diffusers/api/pipelines/stable_diffusion/adapter.md @@ -0,0 +1,25 @@ + + +# T2I-Adapter + +[T2I-Adapter: Learning Adapters to Dig out More Controllable Ability for Text-to-Image Diffusion Models](https://arxiv.org/abs/2302.08453) by Chong Mou, Xintao Wang, Liangbin Xie, Jian Zhang, Zhongang Qi, Ying Shan, Xiaohu Qie. + +Using the pretrained models we can provide control images (for example, a depth map) to control Stable Diffusion text-to-image generation so that it follows the structure of the depth image and fills in the details. + +The abstract of the paper is the following: + +*The incredible generative ability of large-scale text-to-image (T2I) models has demonstrated strong power of learning complex structures and meaningful semantics. However, relying solely on text prompts cannot fully take advantage of the knowledge learned by the model, especially when flexible and accurate controlling (e.g., color and structure) is needed. In this paper, we aim to ``dig out" the capabilities that T2I models have implicitly learned, and then explicitly use them to control the generation more granularly. Specifically, we propose to learn simple and lightweight T2I-Adapters to align internal knowledge in T2I models with external control signals, while freezing the original large T2I models. In this way, we can train various adapters according to different conditions, achieving rich control and editing effects in the color and structure of the generation results. Further, the proposed T2I-Adapters have attractive properties of practical value, such as composability and generalization ability. Extensive experiments demonstrate that our T2I-Adapter has promising generation quality and a wide range of applications.* + +::: mindone.diffusers.StableDiffusionAdapterPipeline + +::: mindone.diffusers.StableDiffusionXLAdapterPipeline diff --git a/docs/diffusers/api/pipelines/stable_diffusion/depth2img.md b/docs/diffusers/api/pipelines/stable_diffusion/depth2img.md new file mode 100644 index 0000000000..d525224fa4 --- /dev/null +++ b/docs/diffusers/api/pipelines/stable_diffusion/depth2img.md @@ -0,0 +1,25 @@ + + +# Depth-to-image + +The Stable Diffusion model can also infer depth based on an image using [MiDaS](https://github.com/isl-org/MiDaS). This allows you to pass a text prompt and an initial image to condition the generation of new images as well as a `depth_map` to preserve the image structure. + +!!! tip + + Make sure to check out the Stable Diffusion [Tips](overview.md#tips) section to learn how to explore the tradeoff between scheduler speed and quality, and how to reuse pipeline components efficiently! + + If you're interested in using one of the official checkpoints for a task, explore the [CompVis](https://huggingface.co/CompVis), [Runway](https://huggingface.co/runwayml), and [Stability AI](https://huggingface.co/stabilityai) Hub organizations! + +::: mindone.diffusers.StableDiffusionDepth2ImgPipeline + +::: mindone.diffusers.pipelines.stable_diffusion.StableDiffusionPipelineOutput diff --git a/docs/diffusers/api/pipelines/stable_diffusion/gligen.md b/docs/diffusers/api/pipelines/stable_diffusion/gligen.md new file mode 100644 index 0000000000..34b149fb02 --- /dev/null +++ b/docs/diffusers/api/pipelines/stable_diffusion/gligen.md @@ -0,0 +1,31 @@ + + +# GLIGEN (Grounded Language-to-Image Generation) + +The GLIGEN model was created by researchers and engineers from [University of Wisconsin-Madison, Columbia University, and Microsoft](https://github.com/gligen/GLIGEN). The [`StableDiffusionGLIGENPipeline`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/pipelines/stable_diffusion/gligen/#mindone.diffusers.StableDiffusionGLIGENPipeline) and [`StableDiffusionGLIGENTextImagePipeline`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/pipelines/stable_diffusion/gligen/#mindone.diffusers.StableDiffusionGLIGENTextImagePipeline) can generate photorealistic images conditioned on grounding inputs. Along with text and bounding boxes with [`StableDiffusionGLIGENPipeline`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/pipelines/stable_diffusion/gligen/#mindone.diffusers.StableDiffusionGLIGENPipeline), if input images are given, [`StableDiffusionGLIGENTextImagePipeline`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/pipelines/stable_diffusion/gligen/#mindone.diffusers.StableDiffusionGLIGENTextImagePipeline) can insert objects described by text at the region defined by bounding boxes. Otherwise, it'll generate an image described by the caption/prompt and insert objects described by text at the region defined by bounding boxes. It's trained on COCO2014D and COCO2014CD datasets, and the model uses a frozen CLIP ViT-L/14 text encoder to condition itself on grounding inputs. + +The abstract from the [paper](https://huggingface.co/papers/2301.07093) is: + +*Large-scale text-to-image diffusion models have made amazing advances. However, the status quo is to use text input alone, which can impede controllability. In this work, we propose GLIGEN, Grounded-Language-to-Image Generation, a novel approach that builds upon and extends the functionality of existing pre-trained text-to-image diffusion models by enabling them to also be conditioned on grounding inputs. To preserve the vast concept knowledge of the pre-trained model, we freeze all of its weights and inject the grounding information into new trainable layers via a gated mechanism. Our model achieves open-world grounded text2img generation with caption and bounding box condition inputs, and the grounding ability generalizes well to novel spatial configurations and concepts. GLIGEN’s zeroshot performance on COCO and LVIS outperforms existing supervised layout-to-image baselines by a large margin.* + +!!! tip + + Make sure to check out the Stable Diffusion [Tips](overview.md#tips) section to learn how to explore the tradeoff between scheduler speed and quality and how to reuse pipeline components efficiently! + + If you want to use one of the official checkpoints for a task, explore the [gligen](https://huggingface.co/gligen) Hub organizations! + +::: mindone.diffusers.StableDiffusionGLIGENPipeline + +::: mindone.diffusers.StableDiffusionGLIGENTextImagePipeline + +::: mindone.diffusers.pipelines.stable_diffusion.StableDiffusionPipelineOutput diff --git a/docs/diffusers/api/pipelines/stable_diffusion/image_variation.md b/docs/diffusers/api/pipelines/stable_diffusion/image_variation.md new file mode 100644 index 0000000000..0be7e9128f --- /dev/null +++ b/docs/diffusers/api/pipelines/stable_diffusion/image_variation.md @@ -0,0 +1,25 @@ + + +# Image variation + +The Stable Diffusion model can also generate variations from an input image. It uses a fine-tuned version of a Stable Diffusion model by [Justin Pinkney](https://www.justinpinkney.com/) from [Lambda](https://lambdalabs.com/). + +The original codebase can be found at [LambdaLabsML/lambda-diffusers](https://github.com/LambdaLabsML/lambda-diffusers#stable-diffusion-image-variations) and additional official checkpoints for image variation can be found at [lambdalabs/sd-image-variations-diffusers](https://huggingface.co/lambdalabs/sd-image-variations-diffusers). + +!!! tip + + Make sure to check out the Stable Diffusion [Tips](./overview.md#tips) section to learn how to explore the tradeoff between scheduler speed and quality, and how to reuse pipeline components efficiently! + +::: mindone.diffusers.StableDiffusionImageVariationPipeline + +::: mindone.diffusers.pipelines.stable_diffusion.StableDiffusionPipelineOutput diff --git a/docs/diffusers/api/pipelines/stable_diffusion/img2img.md b/docs/diffusers/api/pipelines/stable_diffusion/img2img.md new file mode 100644 index 0000000000..75aab875c6 --- /dev/null +++ b/docs/diffusers/api/pipelines/stable_diffusion/img2img.md @@ -0,0 +1,29 @@ + + +# Image-to-image + +The Stable Diffusion model can also be applied to image-to-image generation by passing a text prompt and an initial image to condition the generation of new images. + +The [`StableDiffusionImg2ImgPipeline`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/pipelines/stable_diffusion/img2img/#mindone.diffusers.StableDiffusionImg2ImgPipeline) uses the diffusion-denoising mechanism proposed in [SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations](https://huggingface.co/papers/2108.01073) by Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, Stefano Ermon. + +The abstract from the paper is: + +*Guided image synthesis enables everyday users to create and edit photo-realistic images with minimum effort. The key challenge is balancing faithfulness to the user input (e.g., hand-drawn colored strokes) and realism of the synthesized image. Existing GAN-based methods attempt to achieve such balance using either conditional GANs or GAN inversions, which are challenging and often require additional training data or loss functions for individual applications. To address these issues, we introduce a new image synthesis and editing method, Stochastic Differential Editing (SDEdit), based on a diffusion model generative prior, which synthesizes realistic images by iteratively denoising through a stochastic differential equation (SDE). Given an input image with user guide of any type, SDEdit first adds noise to the input, then subsequently denoises the resulting image through the SDE prior to increase its realism. SDEdit does not require task-specific training or inversions and can naturally achieve the balance between realism and faithfulness. SDEdit significantly outperforms state-of-the-art GAN-based methods by up to 98.09% on realism and 91.72% on overall satisfaction scores, according to a human perception study, on multiple tasks, including stroke-based image synthesis and editing as well as image compositing.* + +!!! tip + + Make sure to check out the Stable Diffusion [Tips](overview.md#tips) section to learn how to explore the tradeoff between scheduler speed and quality, and how to reuse pipeline components efficiently! + +::: mindone.diffusers.StableDiffusionImg2ImgPipeline + +::: mindone.diffusers.pipelines.stable_diffusion.StableDiffusionPipelineOutput diff --git a/docs/diffusers/api/pipelines/stable_diffusion/inpaint.md b/docs/diffusers/api/pipelines/stable_diffusion/inpaint.md new file mode 100644 index 0000000000..4870ea7631 --- /dev/null +++ b/docs/diffusers/api/pipelines/stable_diffusion/inpaint.md @@ -0,0 +1,32 @@ + + +# Inpainting + +The Stable Diffusion model can also be applied to inpainting which lets you edit specific parts of an image by providing a mask and a text prompt using Stable Diffusion. + +## Tips + +It is recommended to use this pipeline with checkpoints that have been specifically fine-tuned for inpainting, such +as [stable-diffusion-v1-5/stable-diffusion-inpainting](https://huggingface.co/stable-diffusion-v1-5/stable-diffusion-inpainting). Default +text-to-image Stable Diffusion checkpoints, such as +[stable-diffusion-v1-5/stable-diffusion-v1-5](https://huggingface.co/stable-diffusion-v1-5/stable-diffusion-v1-5) are also compatible but they might be less performant. + +!!! tip + + Make sure to check out the Stable Diffusion [Tips](overview.md#tips) section to learn how to explore the tradeoff between scheduler speed and quality, and how to reuse pipeline components efficiently! + + If you're interested in using one of the official checkpoints for a task, explore the [CompVis](https://huggingface.co/CompVis), [Runway](https://huggingface.co/runwayml), and [Stability AI](https://huggingface.co/stabilityai) Hub organizations! + +::: mindone.diffusers.StableDiffusionInpaintPipeline + +::: mindone.diffusers.pipelines.stable_diffusion.StableDiffusionPipelineOutput diff --git a/docs/diffusers/api/pipelines/stable_diffusion/latent_upscale.md b/docs/diffusers/api/pipelines/stable_diffusion/latent_upscale.md new file mode 100644 index 0000000000..9cf38c3b84 --- /dev/null +++ b/docs/diffusers/api/pipelines/stable_diffusion/latent_upscale.md @@ -0,0 +1,25 @@ + + +# Latent upscaler + +The Stable Diffusion latent upscaler model was created by [Katherine Crowson](https://github.com/crowsonkb/k-diffusion) in collaboration with [Stability AI](https://stability.ai/). It is used to enhance the output image resolution by a factor of 2. + +!!! tip + + Make sure to check out the Stable Diffusion [Tips](overview.md#tips) section to learn how to explore the tradeoff between scheduler speed and quality, and how to reuse pipeline components efficiently! + + If you're interested in using one of the official checkpoints for a task, explore the [CompVis](https://huggingface.co/CompVis), [Runway](https://huggingface.co/runwayml), and [Stability AI](https://huggingface.co/stabilityai) Hub organizations! + +::: mindone.diffusers.StableDiffusionLatentUpscalePipeline + +::: mindone.diffusers.pipelines.stable_diffusion.StableDiffusionPipelineOutput diff --git a/docs/diffusers/api/pipelines/stable_diffusion/overview.md b/docs/diffusers/api/pipelines/stable_diffusion/overview.md new file mode 100644 index 0000000000..c3e216eb34 --- /dev/null +++ b/docs/diffusers/api/pipelines/stable_diffusion/overview.md @@ -0,0 +1,58 @@ + + +# Stable Diffusion pipelines + +Stable Diffusion is a text-to-image latent diffusion model created by the researchers and engineers from [CompVis](https://github.com/CompVis), [Stability AI](https://stability.ai/) and [LAION](https://laion.ai/). Latent diffusion applies the diffusion process over a lower dimensional latent space to reduce memory and compute complexity. This specific type of diffusion model was proposed in [High-Resolution Image Synthesis with Latent Diffusion Models](https://huggingface.co/papers/2112.10752) by Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, Björn Ommer. + +Stable Diffusion is trained on 512x512 images from a subset of the LAION-5B dataset. This model uses a frozen CLIP ViT-L/14 text encoder to condition the model on text prompts. With its 860M UNet and 123M text encoder, the model is relatively lightweight. + +For more details about how Stable Diffusion works and how it differs from the base latent diffusion model, take a look at the Stability AI [announcement](https://stability.ai/blog/stable-diffusion-announcement) and our own [blog post](https://huggingface.co/blog/stable_diffusion#how-does-stable-diffusion-work) for more technical details. + +You can find the original codebase for Stable Diffusion v1.0 at [CompVis/stable-diffusion](https://github.com/CompVis/stable-diffusion) and Stable Diffusion v2.0 at [Stability-AI/stablediffusion](https://github.com/Stability-AI/stablediffusion) as well as their original scripts for various tasks. Additional official checkpoints for the different Stable Diffusion versions and tasks can be found on the [CompVis](https://huggingface.co/CompVis), [Runway](https://huggingface.co/runwayml), and [Stability AI](https://huggingface.co/stabilityai) Hub organizations. Explore these organizations to find the best checkpoint for your use-case! + +## Tips + +To help you get the most out of the Stable Diffusion pipelines, here are a few tips for improving performance and usability. These tips are applicable to all Stable Diffusion pipelines. + +### Explore tradeoff between speed and quality + +[`StableDiffusionPipeline`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/pipelines/stable_diffusion/text2img/#mindone.diffusers.StableDiffusionPipeline) uses the [`PNDMScheduler`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/schedulers/pndm/#mindone.diffusers.PNDMScheduler) by default, but 🤗 Diffusers provides many other schedulers (some of which are faster or output better quality) that are compatible. For example, if you want to use the [`EulerDiscreteScheduler`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/schedulers/euler/#mindone.diffusers.EulerDiscreteScheduler) instead of the default: + +```py +from mindone.diffusers import StableDiffusionPipeline, EulerDiscreteScheduler + +pipeline = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4") +pipeline.scheduler = EulerDiscreteScheduler.from_config(pipeline.scheduler.config) + +# or +euler_scheduler = EulerDiscreteScheduler.from_pretrained("CompVis/stable-diffusion-v1-4", subfolder="scheduler") +pipeline = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4", scheduler=euler_scheduler) +``` + +### Reuse pipeline components to save memory + +To save memory and use the same components across multiple pipelines, use the `.components` method to avoid loading weights into RAM more than once. + +```py +from mindone.diffusers import ( + StableDiffusionPipeline, + StableDiffusionImg2ImgPipeline, + StableDiffusionInpaintPipeline, +) + +text2img = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4") +img2img = StableDiffusionImg2ImgPipeline(**text2img.components) +inpaint = StableDiffusionInpaintPipeline(**text2img.components) + +# now you can use text2img(...), img2img(...), inpaint(...) just like the call methods of each respective pipeline +``` diff --git a/docs/diffusers/api/pipelines/stable_diffusion/sdxl_turbo.md b/docs/diffusers/api/pipelines/stable_diffusion/sdxl_turbo.md new file mode 100644 index 0000000000..693400dbf2 --- /dev/null +++ b/docs/diffusers/api/pipelines/stable_diffusion/sdxl_turbo.md @@ -0,0 +1,33 @@ + + +# SDXL Turbo + +Stable Diffusion XL (SDXL) Turbo was proposed in [Adversarial Diffusion Distillation](https://stability.ai/research/adversarial-diffusion-distillation) by Axel Sauer, Dominik Lorenz, Andreas Blattmann, and Robin Rombach. + +The abstract from the paper is: + +*We introduce Adversarial Diffusion Distillation (ADD), a novel training approach that efficiently samples large-scale foundational image diffusion models in just 1–4 steps while maintaining high image quality. We use score distillation to leverage large-scale off-the-shelf image diffusion models as a teacher signal in combination with an adversarial loss to ensure high image fidelity even in the low-step regime of one or two sampling steps. Our analyses show that our model clearly outperforms existing few-step methods (GANs,Latent Consistency Models) in a single step and reaches the performance of state-of-the-art diffusion models (SDXL) in only four steps. ADD is the first method to unlock single-step, real-time image synthesis with foundation models.* + +## Tips + +- SDXL Turbo uses the exact same architecture as [SDXL](./stable_diffusion_xl.md), which means it also has the same API. Please refer to the [SDXL](./stable_diffusion_xl.md) API reference for more details. +- SDXL Turbo should disable guidance scale by setting `guidance_scale=0.0`. +- SDXL Turbo should use `timestep_spacing='trailing'` for the scheduler and use between 1 and 4 steps. +- SDXL Turbo has been trained to generate images of size 512x512. +- SDXL Turbo is open-access, but not open-source meaning that one might have to buy a model license in order to use it for commercial applications. Make sure to read the [official model card](https://huggingface.co/stabilityai/sdxl-turbo) to learn more. + +!!! tip + + To learn how to use SDXL Turbo for various tasks, how to optimize performance, and other usage examples, take a look at the [SDXL Turbo](../../../using-diffusers/sdxl_turbo.md) guide. + + Check out the [Stability AI](https://huggingface.co/stabilityai) Hub organization for the official base and refiner model checkpoints! diff --git a/docs/diffusers/api/pipelines/stable_diffusion/stable_diffusion_2.md b/docs/diffusers/api/pipelines/stable_diffusion/stable_diffusion_2.md new file mode 100644 index 0000000000..0d904a4192 --- /dev/null +++ b/docs/diffusers/api/pipelines/stable_diffusion/stable_diffusion_2.md @@ -0,0 +1,121 @@ + + +# Stable Diffusion 2 + +Stable Diffusion 2 is a text-to-image _latent diffusion_ model built upon the work of the original [Stable Diffusion](https://stability.ai/blog/stable-diffusion-public-release), and it was led by Robin Rombach and Katherine Crowson from [Stability AI](https://stability.ai/) and [LAION](https://laion.ai/). + +*The Stable Diffusion 2.0 release includes robust text-to-image models trained using a brand new text encoder (OpenCLIP), developed by LAION with support from Stability AI, which greatly improves the quality of the generated images compared to earlier V1 releases. The text-to-image models in this release can generate images with default resolutions of both 512x512 pixels and 768x768 pixels. +These models are trained on an aesthetic subset of the [LAION-5B dataset](https://laion.ai/blog/laion-5b/) created by the DeepFloyd team at Stability AI, which is then further filtered to remove adult content using [LAION’s NSFW filter](https://openreview.net/forum?id=M3Y74vmsMcY).* + +For more details about how Stable Diffusion 2 works and how it differs from the original Stable Diffusion, please refer to the official [announcement post](https://stability.ai/blog/stable-diffusion-v2-release). + +The architecture of Stable Diffusion 2 is more or less identical to the original [Stable Diffusion model](./text2img.md) so check out it's API documentation for how to use Stable Diffusion 2. We recommend using the [`DPMSolverMultistepScheduler`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/schedulers/multistep_dpm_solver/#mindone.diffusers.DPMSolverMultistepScheduler) as it gives a reasonable speed/quality trade-off and can be run with as little as 20 steps. + +Stable Diffusion 2 is available for tasks like text-to-image, inpainting, super-resolution, and depth-to-image: + +| Task | Repository | +|-------------------------|---------------------------------------------------------------------------------------------------------------| +| text-to-image (512x512) | [stabilityai/stable-diffusion-2-base](https://huggingface.co/stabilityai/stable-diffusion-2-base) | +| text-to-image (768x768) | [stabilityai/stable-diffusion-2](https://huggingface.co/stabilityai/stable-diffusion-2) | +| inpainting | [stabilityai/stable-diffusion-2-inpainting](https://huggingface.co/stabilityai/stable-diffusion-2-inpainting) | +| super-resolution | [stable-diffusion-x4-upscaler](https://huggingface.co/stabilityai/stable-diffusion-x4-upscaler) | +| depth-to-image | [stabilityai/stable-diffusion-2-depth](https://huggingface.co/stabilityai/stable-diffusion-2-depth) | + +Here are some examples for how to use Stable Diffusion 2 for each task: + +!!! tip + + Make sure to check out the Stable Diffusion [Tips](overview.md#tips) section to learn how to explore the tradeoff between scheduler speed and quality, and how to reuse pipeline components efficiently! + + If you're interested in using one of the official checkpoints for a task, explore the [CompVis](https://huggingface.co/CompVis), [Runway](https://huggingface.co/runwayml), and [Stability AI](https://huggingface.co/stabilityai) Hub organizations! + + +## Text-to-image + +```py +from mindone.diffusers import DiffusionPipeline, DPMSolverMultistepScheduler +import mindspore as ms + +repo_id = "stabilityai/stable-diffusion-2-base" +pipe = DiffusionPipeline.from_pretrained(repo_id, mindspore_dtype=ms.float16, revision="fp16") + +pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config) + +prompt = "High quality photo of an astronaut riding a horse in space" +image = pipe(prompt, num_inference_steps=25)[0][0] +image +``` + +## Inpainting + +```py +import mindspore as ms +from mindone.diffusers import DiffusionPipeline, DPMSolverMultistepScheduler +from mindone.diffusers.utils import load_image, make_image_grid + +img_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo.png" +mask_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo_mask.png" + +init_image = load_image(img_url).resize((512, 512)) +mask_image = load_image(mask_url).resize((512, 512)) + +repo_id = "stabilityai/stable-diffusion-2-inpainting" +pipe = DiffusionPipeline.from_pretrained(repo_id, mindspore_dtype=ms.float16, revision="fp16") + +pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config) + +prompt = "Face of a yellow cat, high resolution, sitting on a park bench" +image = pipe(prompt=prompt, image=init_image, mask_image=mask_image, num_inference_steps=25)[0][0] +make_image_grid([init_image, mask_image, image], rows=1, cols=3) +``` + +## Super-resolution + +```py +from mindone.diffusers import StableDiffusionUpscalePipeline +from mindone.diffusers.utils import load_image, make_image_grid +import mindspore as ms + +# load model and scheduler +model_id = "stabilityai/stable-diffusion-x4-upscaler" +pipeline = StableDiffusionUpscalePipeline.from_pretrained(model_id, mindspore_dtype=ms.float16) + +# let's download an image +url = "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/sd2-upscale/low_res_cat.png" +low_res_img = load_image(url) +low_res_img = low_res_img.resize((128, 128)) +prompt = "a white cat" +upscaled_image = pipeline(prompt=prompt, image=low_res_img)[0][0] +make_image_grid([low_res_img.resize((512, 512)), upscaled_image.resize((512, 512))], rows=1, cols=2) +``` + +## Depth-to-image + +```py +import mindspore as ms +from mindone.diffusers import StableDiffusionDepth2ImgPipeline +from mindone.diffusers.utils import load_image, make_image_grid + +pipe = StableDiffusionDepth2ImgPipeline.from_pretrained( + "stabilityai/stable-diffusion-2-depth", + mindspore_dtype=ms.float16, +) + + +url = "http://images.cocodataset.org/val2017/000000039769.jpg" +init_image = load_image(url) +prompt = "two tigers" +negative_prompt = "bad, deformed, ugly, bad anotomy" +image = pipe(prompt=prompt, image=init_image, negative_prompt=negative_prompt, strength=0.7)[0][0] +make_image_grid([init_image, image], rows=1, cols=2) +``` diff --git a/docs/diffusers/api/pipelines/stable_diffusion/stable_diffusion_3.md b/docs/diffusers/api/pipelines/stable_diffusion/stable_diffusion_3.md new file mode 100644 index 0000000000..2582caee0b --- /dev/null +++ b/docs/diffusers/api/pipelines/stable_diffusion/stable_diffusion_3.md @@ -0,0 +1,125 @@ + + +# Stable Diffusion 3 + +Stable Diffusion 3 (SD3) was proposed in [Scaling Rectified Flow Transformers for High-Resolution Image Synthesis](https://arxiv.org/pdf/2403.03206.pdf) by Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Muller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, Kyle Lacey, Alex Goodwin, Yannik Marek, and Robin Rombach. + +The abstract from the paper is: + +*Diffusion models create data from noise by inverting the forward paths of data towards noise and have emerged as a powerful generative modeling technique for high-dimensional, perceptual data such as images and videos. Rectified flow is a recent generative model formulation that connects data and noise in a straight line. Despite its better theoretical properties and conceptual simplicity, it is not yet decisively established as standard practice. In this work, we improve existing noise sampling techniques for training rectified flow models by biasing them towards perceptually relevant scales. Through a large-scale study, we demonstrate the superior performance of this approach compared to established diffusion formulations for high-resolution text-to-image synthesis. Additionally, we present a novel transformer-based architecture for text-to-image generation that uses separate weights for the two modalities and enables a bidirectional flow of information between image and text tokens, improving text comprehension typography, and human preference ratings. We demonstrate that this architecture follows predictable scaling trends and correlates lower validation loss to improved text-to-image synthesis as measured by various metrics and human evaluations.* + +## Usage Example + +_As the model is gated, before using it with diffusers you first need to go to the [Stable Diffusion 3 Medium Hugging Face page](https://huggingface.co/stabilityai/stable-diffusion-3-medium-diffusers), fill in the form and accept the gate. Once you are in, you need to login so that your system knows you’ve accepted the gate._ + +Use the command below to log in: + +```bash +huggingface-cli login +``` + +!!! tip + + The SD3 pipeline uses three text encoders to generate an image. Model offloading is necessary in order for it to run on most commodity hardware. Please use the `ms.float16` data type for additional memory savings. + +```python +import mindspore as ms +from mindone.diffusers import StableDiffusion3Pipeline + +pipe = StableDiffusion3Pipeline.from_pretrained("stabilityai/stable-diffusion-3-medium-diffusers", mindspore_dtype=ms.float16) + +image = pipe( + prompt="a photo of a cat holding a sign that says hello world", + negative_prompt="", + num_inference_steps=28, + height=1024, + width=1024, + guidance_scale=7.0, +)[0][0] + +image.save("sd3_hello_world.png") +``` + +### Dropping the T5 Text Encoder during Inference + +Removing the memory-intensive 4.7B parameter T5-XXL text encoder during inference can significantly decrease the memory requirements for SD3 with only a slight loss in performance. + +```python +import mindspore as ms +from mindone.diffusers import StableDiffusion3Pipeline + +pipe = StableDiffusion3Pipeline.from_pretrained( + "stabilityai/stable-diffusion-3-medium-diffusers", + text_encoder_3=None, + tokenizer_3=None, + mindspore_dtype=ms.float16 +) + +image = pipe( + prompt="a photo of a cat holding a sign that says hello world", + negative_prompt="", + num_inference_steps=28, + height=1024, + width=1024, + guidance_scale=7.0, +)[0][0] + +image.save("sd3_hello_world-no-T5.png") +``` + +## Loading the original checkpoints via `from_single_file` + +The `SD3Transformer2DModel` and `StableDiffusion3Pipeline` classes support loading the original checkpoints via the `from_single_file` method. This method allows you to load the original checkpoint files that were used to train the models. + +## Loading the original checkpoints for the `SD3Transformer2DModel` + +```python +from mindone.diffusers import SD3Transformer2DModel + +model = SD3Transformer2DModel.from_single_file("https://huggingface.co/stabilityai/stable-diffusion-3-medium/blob/main/sd3_medium.safetensors") +``` + +## Loading the single checkpoint for the `StableDiffusion3Pipeline` + +### Loading the single file checkpoint without T5 + +```python +import mindspore as ms +from mindone.diffusers import StableDiffusion3Pipeline + +pipe = StableDiffusion3Pipeline.from_single_file( + "https://huggingface.co/stabilityai/stable-diffusion-3-medium/blob/main/sd3_medium_incl_clips.safetensors", + mindspore_dtype=ms.float16, + text_encoder_3=None +) + +image = pipe("a picture of a cat holding a sign that says hello world").images[0] +image.save('sd3-single-file.png') +``` + +### Loading the single file checkpoint without T5 + +```python +import mindspore as ms +from mindone.diffusers import StableDiffusion3Pipeline + +pipe = StableDiffusion3Pipeline.from_single_file( + "https://huggingface.co/stabilityai/stable-diffusion-3-medium/blob/main/sd3_medium_incl_clips_t5xxlfp8.safetensors", + mindspore_dtype=ms.float16, +) + +image = pipe("a picture of a cat holding a sign that says hello world")[0][0] +image.save('sd3-single-file-t5-fp8.png') +``` + +::: mindone.diffusers.StableDiffusion3Pipeline diff --git a/docs/diffusers/api/pipelines/stable_diffusion/stable_diffusion_xl.md b/docs/diffusers/api/pipelines/stable_diffusion/stable_diffusion_xl.md new file mode 100644 index 0000000000..c9db51188c --- /dev/null +++ b/docs/diffusers/api/pipelines/stable_diffusion/stable_diffusion_xl.md @@ -0,0 +1,41 @@ + + +# Stable Diffusion XL + +Stable Diffusion XL (SDXL) was proposed in [SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis](https://huggingface.co/papers/2307.01952) by Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. + +The abstract from the paper is: + +*We present SDXL, a latent diffusion model for text-to-image synthesis. Compared to previous versions of Stable Diffusion, SDXL leverages a three times larger UNet backbone: The increase of model parameters is mainly due to more attention blocks and a larger cross-attention context as SDXL uses a second text encoder. We design multiple novel conditioning schemes and train SDXL on multiple aspect ratios. We also introduce a refinement model which is used to improve the visual fidelity of samples generated by SDXL using a post-hoc image-to-image technique. We demonstrate that SDXL shows drastically improved performance compared the previous versions of Stable Diffusion and achieves results competitive with those of black-box state-of-the-art image generators.* + +## Tips + +- Using SDXL with a DPM++ scheduler for less than 50 steps is known to produce [visual artifacts](https://github.com/huggingface/diffusers/issues/5433) because the solver becomes numerically unstable. To fix this issue, take a look at this [PR](https://github.com/huggingface/diffusers/pull/5541) which recommends for ODE/SDE solvers: + - set `use_karras_sigmas=True` or `lu_lambdas=True` to improve image quality + - set `euler_at_final=True` if you're using a solver with uniform step sizes (DPM++2M or DPM++2M SDE) +- Most SDXL checkpoints work best with an image size of 1024x1024. Image sizes of 768x768 and 512x512 are also supported, but the results aren't as good. Anything below 512x512 is not recommended and likely won't be for default checkpoints like [stabilityai/stable-diffusion-xl-base-1.0](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0). +- SDXL can pass a different prompt for each of the text encoders it was trained on. We can even pass different parts of the same prompt to the text encoders. +- SDXL output images can be improved by making use of a refiner model in an image-to-image setting. +- SDXL offers `negative_original_size`, `negative_crops_coords_top_left`, and `negative_target_size` to negatively condition the model on image resolution and cropping parameters. + +!!! tip + + To learn how to use SDXL for various tasks, how to optimize performance, and other usage examples, take a look at the [Stable Diffusion XL](../../../using-diffusers/sdxl.md) guide. + + Check out the [Stability AI](https://huggingface.co/stabilityai) Hub organization for the official base and refiner model checkpoints! + +::: mindone.diffusers.StableDiffusionXLPipeline + +::: mindone.diffusers.StableDiffusionXLImg2ImgPipeline + +::: mindone.diffusers.StableDiffusionXLInpaintPipeline diff --git a/docs/diffusers/api/pipelines/stable_diffusion/svd.md b/docs/diffusers/api/pipelines/stable_diffusion/svd.md new file mode 100644 index 0000000000..dccc7a16f0 --- /dev/null +++ b/docs/diffusers/api/pipelines/stable_diffusion/svd.md @@ -0,0 +1,35 @@ + + +# Stable Video Diffusion + +Stable Video Diffusion was proposed in [Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets](https://hf.co/papers/2311.15127) by Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, Varun Jampani, Robin Rombach. + +The abstract from the paper is: + +*We present Stable Video Diffusion - a latent video diffusion model for high-resolution, state-of-the-art text-to-video and image-to-video generation. Recently, latent diffusion models trained for 2D image synthesis have been turned into generative video models by inserting temporal layers and finetuning them on small, high-quality video datasets. However, training methods in the literature vary widely, and the field has yet to agree on a unified strategy for curating video data. In this paper, we identify and evaluate three different stages for successful training of video LDMs: text-to-image pretraining, video pretraining, and high-quality video finetuning. Furthermore, we demonstrate the necessity of a well-curated pretraining dataset for generating high-quality videos and present a systematic curation process to train a strong base model, including captioning and filtering strategies. We then explore the impact of finetuning our base model on high-quality data and train a text-to-video model that is competitive with closed-source video generation. We also show that our base model provides a powerful motion representation for downstream tasks such as image-to-video generation and adaptability to camera motion-specific LoRA modules. Finally, we demonstrate that our model provides a strong multi-view 3D-prior and can serve as a base to finetune a multi-view diffusion model that jointly generates multiple views of objects in a feedforward fashion, outperforming image-based methods at a fraction of their compute budget. We release code and model weights at this https URL.* + +!!! tip + + To learn how to use Stable Video Diffusion, take a look at the [Stable Video Diffusion](../../../using-diffusers/svd.md) guide. + + Check out the [Stability AI](https://huggingface.co/stabilityai) Hub organization for the [base](https://huggingface.co/stabilityai/stable-video-diffusion-img2vid) and [extended frame](https://huggingface.co/stabilityai/stable-video-diffusion-img2vid-xt) checkpoints! + +## Tips + +Video generation is memory-intensive and one way to reduce your memory usage is to set `enable_forward_chunking` on the pipeline's UNet so you don't run the entire feedforward layer at once. Breaking it up into chunks in a loop is more efficient. + +Check out the [Text or image-to-video](text-img2vid.md) guide for more details about how certain parameters can affect video generation and how to optimize inference by reducing memory usage. + +::: mindone.diffusers.StableVideoDiffusionPipeline + +::: mindone.diffusers.pipelines.stable_video_diffusion.StableVideoDiffusionPipelineOutput diff --git a/docs/diffusers/api/pipelines/stable_diffusion/text2img.md b/docs/diffusers/api/pipelines/stable_diffusion/text2img.md new file mode 100644 index 0000000000..adfa109029 --- /dev/null +++ b/docs/diffusers/api/pipelines/stable_diffusion/text2img.md @@ -0,0 +1,29 @@ + + +# Text-to-image + +The Stable Diffusion model was created by researchers and engineers from [CompVis](https://github.com/CompVis), [Stability AI](https://stability.ai/), [Runway](https://github.com/runwayml), and [LAION](https://laion.ai/). The [`StableDiffusionPipeline`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/pipelines/stable_diffusion/text2img/#mindone.diffusers.StableDiffusionPipeline) is capable of generating photorealistic images given any text input. It's trained on 512x512 images from a subset of the LAION-5B dataset. This model uses a frozen CLIP ViT-L/14 text encoder to condition the model on text prompts. With its 860M UNet and 123M text encoder, the model is relatively lightweight. Latent diffusion is the research on top of which Stable Diffusion was built. It was proposed in [High-Resolution Image Synthesis with Latent Diffusion Models](https://huggingface.co/papers/2112.10752) by Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, Björn Ommer. + +The abstract from the paper is: + +*By decomposing the image formation process into a sequential application of denoising autoencoders, diffusion models (DMs) achieve state-of-the-art synthesis results on image data and beyond. Additionally, their formulation allows for a guiding mechanism to control the image generation process without retraining. However, since these models typically operate directly in pixel space, optimization of powerful DMs often consumes hundreds of GPU days and inference is expensive due to sequential evaluations. To enable DM training on limited computational resources while retaining their quality and flexibility, we apply them in the latent space of powerful pretrained autoencoders. In contrast to previous work, training diffusion models on such a representation allows for the first time to reach a near-optimal point between complexity reduction and detail preservation, greatly boosting visual fidelity. By introducing cross-attention layers into the model architecture, we turn diffusion models into powerful and flexible generators for general conditioning inputs such as text or bounding boxes and high-resolution synthesis becomes possible in a convolutional manner. Our latent diffusion models (LDMs) achieve a new state of the art for image inpainting and highly competitive performance on various tasks, including unconditional image generation, semantic scene synthesis, and super-resolution, while significantly reducing computational requirements compared to pixel-based DMs. Code is available at https://github.com/CompVis/latent-diffusion.* + +!!! tip + + Make sure to check out the Stable Diffusion [Tips](overview.md#tips) section to learn how to explore the tradeoff between scheduler speed and quality, and how to reuse pipeline components efficiently! + + If you're interested in using one of the official checkpoints for a task, explore the [CompVis](https://huggingface.co/CompVis), [Runway](https://huggingface.co/runwayml), and [Stability AI](https://huggingface.co/stabilityai) Hub organizations! + +::: mindone.diffusers.StableDiffusionPipeline + +::: mindone.diffusers.pipelines.stable_diffusion.StableDiffusionPipelineOutput diff --git a/docs/diffusers/api/pipelines/stable_diffusion/upscale.md b/docs/diffusers/api/pipelines/stable_diffusion/upscale.md new file mode 100644 index 0000000000..8d6bb60108 --- /dev/null +++ b/docs/diffusers/api/pipelines/stable_diffusion/upscale.md @@ -0,0 +1,25 @@ + + +# Super-resolution + +The Stable Diffusion upscaler diffusion model was created by the researchers and engineers from [CompVis](https://github.com/CompVis), [Stability AI](https://stability.ai/), and [LAION](https://laion.ai/). It is used to enhance the resolution of input images by a factor of 4. + +!!! tip + + Make sure to check out the Stable Diffusion [Tips](overview.md#tips) section to learn how to explore the tradeoff between scheduler speed and quality, and how to reuse pipeline components efficiently! + + If you're interested in using one of the official checkpoints for a task, explore the [CompVis](https://huggingface.co/CompVis), [Runway](https://huggingface.co/runwayml), and [Stability AI](https://huggingface.co/stabilityai) Hub organizations! + +::: mindone.diffusers.StableDiffusionUpscalePipeline + +::: mindone.diffusers.pipelines.stable_diffusion.StableDiffusionPipelineOutput diff --git a/docs/diffusers/api/schedulers/cm_stochastic_iterative.md b/docs/diffusers/api/schedulers/cm_stochastic_iterative.md new file mode 100644 index 0000000000..a6d03f4598 --- /dev/null +++ b/docs/diffusers/api/schedulers/cm_stochastic_iterative.md @@ -0,0 +1,25 @@ + + +# CMStochasticIterativeScheduler + +[Consistency Models](https://huggingface.co/papers/2303.01469) by Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever introduced a multistep and onestep scheduler (Algorithm 1) that is capable of generating good samples in one or a small number of steps. + +The abstract from the paper is: + +*Diffusion models have significantly advanced the fields of image, audio, and video generation, but they depend on an iterative sampling process that causes slow generation. To overcome this limitation, we propose consistency models, a new family of models that generate high quality samples by directly mapping noise to data. They support fast one-step generation by design, while still allowing multistep sampling to trade compute for sample quality. They also support zero-shot data editing, such as image inpainting, colorization, and super-resolution, without requiring explicit training on these tasks. Consistency models can be trained either by distilling pre-trained diffusion models, or as standalone generative models altogether. Through extensive experiments, we demonstrate that they outperform existing distillation techniques for diffusion models in one- and few-step sampling, achieving the new state-of-the-art FID of 3.55 on CIFAR-10 and 6.20 on ImageNet 64x64 for one-step generation. When trained in isolation, consistency models become a new family of generative models that can outperform existing one-step, non-adversarial generative models on standard benchmarks such as CIFAR-10, ImageNet 64x64 and LSUN 256x256.* + +The original codebase can be found at [openai/consistency_models](https://github.com/openai/consistency_models). + +::: mindone.diffusers.CMStochasticIterativeScheduler + +::: mindone.diffusers.schedulers.scheduling_consistency_models.CMStochasticIterativeSchedulerOutput diff --git a/docs/diffusers/api/schedulers/consistency_decoder.md b/docs/diffusers/api/schedulers/consistency_decoder.md new file mode 100644 index 0000000000..00a71a2883 --- /dev/null +++ b/docs/diffusers/api/schedulers/consistency_decoder.md @@ -0,0 +1,19 @@ + + +# ConsistencyDecoderScheduler + +This scheduler is a part of the [`ConsistencyDecoderPipeline`] and was introduced in [DALL-E 3](https://openai.com/dall-e-3). + +The original codebase can be found at [openai/consistency_models](https://github.com/openai/consistency_models). + +::: mindone.diffusers.schedulers.scheduling_consistency_decoder.ConsistencyDecoderScheduler diff --git a/docs/diffusers/api/schedulers/ddim.md b/docs/diffusers/api/schedulers/ddim.md new file mode 100644 index 0000000000..ec2d0e3d77 --- /dev/null +++ b/docs/diffusers/api/schedulers/ddim.md @@ -0,0 +1,77 @@ + + +# DDIMScheduler + +[Denoising Diffusion Implicit Models](https://huggingface.co/papers/2010.02502) (DDIM) by Jiaming Song, Chenlin Meng and Stefano Ermon. + +The abstract from the paper is: + +*Denoising diffusion probabilistic models (DDPMs) have achieved high quality image generation without adversarial training, yet they require simulating a Markov chain for many steps to produce a sample. +To accelerate sampling, we present denoising diffusion implicit models (DDIMs), a more efficient class of iterative implicit probabilistic models +with the same training procedure as DDPMs. In DDPMs, the generative process is defined as the reverse of a Markovian diffusion process. +We construct a class of non-Markovian diffusion processes that lead to the same training objective, but whose reverse process can be much faster to sample from. +We empirically demonstrate that DDIMs can produce high quality samples 10× to 50× faster in terms of wall-clock time compared to DDPMs, allow us to trade off computation for sample quality, and can perform semantically meaningful image interpolation directly in the latent space.* + +The original codebase of this paper can be found at [ermongroup/ddim](https://github.com/ermongroup/ddim), and you can contact the author on [tsong.me](https://tsong.me/). + +## Tips + +The paper [Common Diffusion Noise Schedules and Sample Steps are Flawed](https://huggingface.co/papers/2305.08891) claims that a mismatch between the training and inference settings leads to suboptimal inference generation results for Stable Diffusion. To fix this, the authors propose: + +!!! warning + + 🧪 This is an experimental feature! + +1. rescale the noise schedule to enforce zero terminal signal-to-noise ratio (SNR) + +```py +pipe.scheduler = DDIMScheduler.from_config(pipe.scheduler.config, rescale_betas_zero_snr=True) +``` + +2. train a model with `v_prediction` (add the following argument to the [train_text_to_image.py](https://github.com/mindspore-lab/mindone/blob/master/examples/diffusers/text_to_image/train_text_to_image.py) or [train_text_to_image_lora.py](https://github.com/mindspore-lab/mindone/blob/master/examples/diffusers/text_to_image/train_text_to_image_lora.py) scripts) + +```bash +--prediction_type="v_prediction" +``` + +3. change the sampler to always start from the last timestep + +```py +pipe.scheduler = DDIMScheduler.from_config(pipe.scheduler.config, timestep_spacing="trailing") +``` + +4. rescale classifier-free guidance to prevent over-exposure + +```py +image = pipe(prompt, guidance_rescale=0.7)[0][0] +``` + +For example: + +```py +from mindone.diffusers import DiffusionPipeline, DDIMScheduler +import mindspore as ms + +pipe = DiffusionPipeline.from_pretrained("ptx0/pseudo-journey-v2", mindspore_dtype=ms.float16) +pipe.scheduler = DDIMScheduler.from_config( + pipe.scheduler.config, rescale_betas_zero_snr=True, timestep_spacing="trailing" +) + +prompt = "A lion in galaxies, spirals, nebulae, stars, smoke, iridescent, intricate detail, octane render, 8k" +image = pipe(prompt, guidance_rescale=0.7)[0][0] +image +``` + +::: mindone.diffusers.DDIMScheduler + +::: mindone.diffusers.schedulers.scheduling_ddim.DDIMSchedulerOutput diff --git a/docs/diffusers/api/schedulers/ddim_inverse.md b/docs/diffusers/api/schedulers/ddim_inverse.md new file mode 100644 index 0000000000..fa05d39142 --- /dev/null +++ b/docs/diffusers/api/schedulers/ddim_inverse.md @@ -0,0 +1,18 @@ + + +# DDIMInverseScheduler + +`DDIMInverseScheduler` is the inverted scheduler from [Denoising Diffusion Implicit Models](https://huggingface.co/papers/2010.02502) (DDIM) by Jiaming Song, Chenlin Meng and Stefano Ermon. +The implementation is mostly based on the DDIM inversion definition from [Null-text Inversion for Editing Real Images using Guided Diffusion Models](https://huggingface.co/papers/2211.09794). + +::: mindone.diffusers.DDIMInverseScheduler diff --git a/docs/diffusers/api/schedulers/ddpm.md b/docs/diffusers/api/schedulers/ddpm.md new file mode 100644 index 0000000000..f53b92c51d --- /dev/null +++ b/docs/diffusers/api/schedulers/ddpm.md @@ -0,0 +1,23 @@ + + +# DDPMScheduler + +[Denoising Diffusion Probabilistic Models](https://huggingface.co/papers/2006.11239) (DDPM) by Jonathan Ho, Ajay Jain and Pieter Abbeel proposes a diffusion based model of the same name. In the context of the 🤗 Diffusers library, DDPM refers to the discrete denoising scheduler from the paper as well as the pipeline. + +The abstract from the paper is: + +*We present high quality image synthesis results using diffusion probabilistic models, a class of latent variable models inspired by considerations from nonequilibrium thermodynamics. Our best results are obtained by training on a weighted variational bound designed according to a novel connection between diffusion probabilistic models and denoising score matching with Langevin dynamics, and our models naturally admit a progressive lossy decompression scheme that can be interpreted as a generalization of autoregressive decoding. On the unconditional CIFAR10 dataset, we obtain an Inception score of 9.46 and a state-of-the-art FID score of 3.17. On 256x256 LSUN, we obtain sample quality similar to ProgressiveGAN. Our implementation is available at [this https URL](https://github.com/hojonathanho/diffusion).* + +::: mindone.diffusers.DDPMScheduler + +::: mindone.diffusers.schedulers.scheduling_ddpm.DDPMSchedulerOutput diff --git a/docs/diffusers/api/schedulers/deis.md b/docs/diffusers/api/schedulers/deis.md new file mode 100644 index 0000000000..c9b2b2a819 --- /dev/null +++ b/docs/diffusers/api/schedulers/deis.md @@ -0,0 +1,30 @@ + + +# DEISMultistepScheduler + +Diffusion Exponential Integrator Sampler (DEIS) is proposed in [Fast Sampling of Diffusion Models with Exponential Integrator](https://huggingface.co/papers/2204.13902) by Qinsheng Zhang and Yongxin Chen. `DEISMultistepScheduler` is a fast high order solver for diffusion ordinary differential equations (ODEs). + +This implementation modifies the polynomial fitting formula in log-rho space instead of the original linear `t` space in the DEIS paper. The modification enjoys closed-form coefficients for exponential multistep update instead of replying on the numerical solver. + +The abstract from the paper is: + +*The past few years have witnessed the great success of Diffusion models~(DMs) in generating high-fidelity samples in generative modeling tasks. A major limitation of the DM is its notoriously slow sampling procedure which normally requires hundreds to thousands of time discretization steps of the learned diffusion process to reach the desired accuracy. Our goal is to develop a fast sampling method for DMs with a much less number of steps while retaining high sample quality. To this end, we systematically analyze the sampling procedure in DMs and identify key factors that affect the sample quality, among which the method of discretization is most crucial. By carefully examining the learned diffusion process, we propose Diffusion Exponential Integrator Sampler~(DEIS). It is based on the Exponential Integrator designed for discretizing ordinary differential equations (ODEs) and leverages a semilinear structure of the learned diffusion process to reduce the discretization error. The proposed method can be applied to any DMs and can generate high-fidelity samples in as few as 10 steps. In our experiments, it takes about 3 minutes on one A6000 GPU to generate 50k images from CIFAR10. Moreover, by directly using pre-trained DMs, we achieve the state-of-art sampling performance when the number of score function evaluation~(NFE) is limited, e.g., 4.17 FID with 10 NFEs, 3.37 FID, and 9.74 IS with only 15 NFEs on CIFAR10. Code is available at [this https URL](https://github.com/qsh-zh/deis).* + +## Tips + +It is recommended to set `solver_order` to 2 or 3, while `solver_order=1` is equivalent to [`DDIMScheduler`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/schedulers/ddim/#mindone.diffusers.DDIMScheduler). + +Dynamic thresholding from [Imagen](https://huggingface.co/papers/2205.11487) is supported, and for pixel-space +diffusion models, you can set `thresholding=True` to use the dynamic thresholding. + +::: mindone.diffusers.schedulers.scheduling_deis_multistep.DEISMultistepScheduler diff --git a/docs/diffusers/api/schedulers/dpm_discrete.md b/docs/diffusers/api/schedulers/dpm_discrete.md new file mode 100644 index 0000000000..c0792519db --- /dev/null +++ b/docs/diffusers/api/schedulers/dpm_discrete.md @@ -0,0 +1,19 @@ + + +# KDPM2DiscreteScheduler + +The `KDPM2DiscreteScheduler` is inspired by the [Elucidating the Design Space of Diffusion-Based Generative Models](https://huggingface.co/papers/2206.00364) paper, and the scheduler is ported from and created by [Katherine Crowson](https://github.com/crowsonkb/). + +The original codebase can be found at [crowsonkb/k-diffusion](https://github.com/crowsonkb/k-diffusion). + +::: mindone.diffusers.KDPM2DiscreteScheduler diff --git a/docs/diffusers/api/schedulers/dpm_discrete_ancestral.md b/docs/diffusers/api/schedulers/dpm_discrete_ancestral.md new file mode 100644 index 0000000000..90fccef17b --- /dev/null +++ b/docs/diffusers/api/schedulers/dpm_discrete_ancestral.md @@ -0,0 +1,19 @@ + + +# KDPM2AncestralDiscreteScheduler + +The `KDPM2DiscreteScheduler` with ancestral sampling is inspired by the [Elucidating the Design Space of Diffusion-Based Generative Models](https://huggingface.co/papers/2206.00364) paper, and the scheduler is ported from and created by [Katherine Crowson](https://github.com/crowsonkb/). + +The original codebase can be found at [crowsonkb/k-diffusion](https://github.com/crowsonkb/k-diffusion). + +::: mindone.diffusers.KDPM2AncestralDiscreteScheduler diff --git a/docs/diffusers/api/schedulers/edm_euler.md b/docs/diffusers/api/schedulers/edm_euler.md new file mode 100644 index 0000000000..0079f0ab07 --- /dev/null +++ b/docs/diffusers/api/schedulers/edm_euler.md @@ -0,0 +1,20 @@ + + +# EDMEulerScheduler + +The Karras formulation of the Euler scheduler (Algorithm 2) from the [Elucidating the Design Space of Diffusion-Based Generative Models](https://huggingface.co/papers/2206.00364) paper by Karras et al. This is a fast scheduler which can often generate good outputs in 20-30 steps. The scheduler is based on the original [k-diffusion](https://github.com/crowsonkb/k-diffusion/blob/481677d114f6ea445aa009cf5bd7a9cdee909e47/k_diffusion/sampling.py#L51) implementation by [Katherine Crowson](https://github.com/crowsonkb/). + + +::: mindone.diffusers.EDMEulerScheduler + +::: mindone.diffusers.schedulers.scheduling_edm_euler.EDMEulerSchedulerOutput diff --git a/docs/diffusers/api/schedulers/edm_multistep_dpm_solver.md b/docs/diffusers/api/schedulers/edm_multistep_dpm_solver.md new file mode 100644 index 0000000000..b446b15a69 --- /dev/null +++ b/docs/diffusers/api/schedulers/edm_multistep_dpm_solver.md @@ -0,0 +1,20 @@ + + +# EDMDPMSolverMultistepScheduler + +`EDMDPMSolverMultistepScheduler` is a [Karras formulation](https://huggingface.co/papers/2206.00364) of `DPMSolverMultistepScheduler`, a multistep scheduler from [DPM-Solver: A Fast ODE Solver for Diffusion Probabilistic Model Sampling in Around 10 Steps](https://huggingface.co/papers/2206.00927) and [DPM-Solver++: Fast Solver for Guided Sampling of Diffusion Probabilistic Models](https://huggingface.co/papers/2211.01095) by Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. + +DPMSolver (and the improved version DPMSolver++) is a fast dedicated high-order solver for diffusion ODEs with convergence order guarantee. Empirically, DPMSolver sampling with only 20 steps can generate high-quality +samples, and it can generate quite good samples even in 10 steps. + +::: mindone.diffusers.EDMDPMSolverMultistepScheduler diff --git a/docs/diffusers/api/schedulers/euler.md b/docs/diffusers/api/schedulers/euler.md new file mode 100644 index 0000000000..03d43823a5 --- /dev/null +++ b/docs/diffusers/api/schedulers/euler.md @@ -0,0 +1,20 @@ + + +# EulerDiscreteScheduler + +The Euler scheduler (Algorithm 2) is from the [Elucidating the Design Space of Diffusion-Based Generative Models](https://huggingface.co/papers/2206.00364) paper by Karras et al. This is a fast scheduler which can often generate good outputs in 20-30 steps. The scheduler is based on the original [k-diffusion](https://github.com/crowsonkb/k-diffusion/blob/481677d114f6ea445aa009cf5bd7a9cdee909e47/k_diffusion/sampling.py#L51) implementation by [Katherine Crowson](https://github.com/crowsonkb/). + + +::: mindone.diffusers.EulerDiscreteScheduler + +::: mindone.diffusers.schedulers.scheduling_euler_discrete.EulerDiscreteSchedulerOutput diff --git a/docs/diffusers/api/schedulers/euler_ancestral.md b/docs/diffusers/api/schedulers/euler_ancestral.md new file mode 100644 index 0000000000..7a0d173742 --- /dev/null +++ b/docs/diffusers/api/schedulers/euler_ancestral.md @@ -0,0 +1,19 @@ + + +# EulerAncestralDiscreteScheduler + +A scheduler that uses ancestral sampling with Euler method steps. This is a fast scheduler which can often generate good outputs in 20-30 steps. The scheduler is based on the original [k-diffusion](https://github.com/crowsonkb/k-diffusion/blob/481677d114f6ea445aa009cf5bd7a9cdee909e47/k_diffusion/sampling.py#L72) implementation by [Katherine Crowson](https://github.com/crowsonkb/). + +::: mindone.diffusers.EulerAncestralDiscreteScheduler + +::: mindone.diffusers.schedulers.scheduling_euler_ancestral_discrete.EulerAncestralDiscreteSchedulerOutput diff --git a/docs/diffusers/api/schedulers/flow_match_euler_discrete.md b/docs/diffusers/api/schedulers/flow_match_euler_discrete.md new file mode 100644 index 0000000000..02de120f69 --- /dev/null +++ b/docs/diffusers/api/schedulers/flow_match_euler_discrete.md @@ -0,0 +1,17 @@ + + +# FlowMatchEulerDiscreteScheduler + +`FlowMatchEulerDiscreteScheduler` is based on the flow-matching sampling introduced in [Stable Diffusion 3](https://arxiv.org/abs/2403.03206). + +::: mindone.diffusers.FlowMatchEulerDiscreteScheduler diff --git a/docs/diffusers/api/schedulers/heun.md b/docs/diffusers/api/schedulers/heun.md new file mode 100644 index 0000000000..28f5238e97 --- /dev/null +++ b/docs/diffusers/api/schedulers/heun.md @@ -0,0 +1,17 @@ + + +# HeunDiscreteScheduler + +The Heun scheduler (Algorithm 1) is from the [Elucidating the Design Space of Diffusion-Based Generative Models](https://huggingface.co/papers/2206.00364) paper by Karras et al. The scheduler is ported from the [k-diffusion](https://github.com/crowsonkb/k-diffusion) library and created by [Katherine Crowson](https://github.com/crowsonkb/). + +::: mindone.diffusers.HeunDiscreteScheduler diff --git a/docs/diffusers/api/schedulers/ipndm.md b/docs/diffusers/api/schedulers/ipndm.md new file mode 100644 index 0000000000..a5d4771426 --- /dev/null +++ b/docs/diffusers/api/schedulers/ipndm.md @@ -0,0 +1,17 @@ + + +# IPNDMScheduler + +`IPNDMScheduler` is a fourth-order Improved Pseudo Linear Multistep scheduler. The original implementation can be found at [crowsonkb/v-diffusion-pytorch](https://github.com/crowsonkb/v-diffusion-pytorch/blob/987f8985e38208345c1959b0ea767a625831cc9b/diffusion/sampling.py#L296). + +::: mindone.diffusers.IPNDMScheduler diff --git a/docs/diffusers/api/schedulers/lcm.md b/docs/diffusers/api/schedulers/lcm.md new file mode 100644 index 0000000000..9b3db0bcbe --- /dev/null +++ b/docs/diffusers/api/schedulers/lcm.md @@ -0,0 +1,20 @@ + + +# Latent Consistency Model Multistep Scheduler + +## Overview + +Multistep and onestep scheduler (Algorithm 3) introduced alongside latent consistency models in the paper [Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference](https://arxiv.org/abs/2310.04378) by Simian Luo, Yiqin Tan, Longbo Huang, Jian Li, and Hang Zhao. +This scheduler should be able to generate good samples from [`LatentConsistencyModelPipeline`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/pipelines/latent_consistency_models/#mindone.diffusers.LatentConsistencyModelPipeline) in 1-8 steps. + +::: mindone.diffusers.LCMScheduler diff --git a/docs/diffusers/api/schedulers/lms_discrete.md b/docs/diffusers/api/schedulers/lms_discrete.md new file mode 100644 index 0000000000..bac9199636 --- /dev/null +++ b/docs/diffusers/api/schedulers/lms_discrete.md @@ -0,0 +1,19 @@ + + +# LMSDiscreteScheduler + +`LMSDiscreteScheduler` is a linear multistep scheduler for discrete beta schedules. The scheduler is ported from and created by [Katherine Crowson](https://github.com/crowsonkb/), and the original implementation can be found at [crowsonkb/k-diffusion](https://github.com/crowsonkb/k-diffusion/blob/481677d114f6ea445aa009cf5bd7a9cdee909e47/k_diffusion/sampling.py#L181). + +::: mindone.diffusers.LMSDiscreteScheduler + +::: mindone.diffusers.schedulers.scheduling_lms_discrete.LMSDiscreteSchedulerOutput diff --git a/docs/diffusers/api/schedulers/multistep_dpm_solver.md b/docs/diffusers/api/schedulers/multistep_dpm_solver.md new file mode 100644 index 0000000000..af0184952f --- /dev/null +++ b/docs/diffusers/api/schedulers/multistep_dpm_solver.md @@ -0,0 +1,31 @@ + + +# DPMSolverMultistepScheduler + +`DPMSolverMultistepScheduler` is a multistep scheduler from [DPM-Solver: A Fast ODE Solver for Diffusion Probabilistic Model Sampling in Around 10 Steps](https://huggingface.co/papers/2206.00927) and [DPM-Solver++: Fast Solver for Guided Sampling of Diffusion Probabilistic Models](https://huggingface.co/papers/2211.01095) by Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. + +DPMSolver (and the improved version DPMSolver++) is a fast dedicated high-order solver for diffusion ODEs with convergence order guarantee. Empirically, DPMSolver sampling with only 20 steps can generate high-quality +samples, and it can generate quite good samples even in 10 steps. + +## Tips + +It is recommended to set `solver_order` to 2 for guide sampling, and `solver_order=3` for unconditional sampling. + +Dynamic thresholding from [Imagen](https://huggingface.co/papers/2205.11487) is supported, and for pixel-space +diffusion models, you can set both `algorithm_type="dpmsolver++"` and `thresholding=True` to use the dynamic +thresholding. This thresholding method is unsuitable for latent-space diffusion models such as +Stable Diffusion. + +The SDE variant of DPMSolver and DPM-Solver++ is also supported, but only for the first and second-order solvers. This is a fast SDE solver for the reverse diffusion SDE. It is recommended to use the second-order `sde-dpmsolver++`. + +::: mindone.diffusers.DPMSolverMultistepScheduler diff --git a/docs/diffusers/api/schedulers/multistep_dpm_solver_inverse.md b/docs/diffusers/api/schedulers/multistep_dpm_solver_inverse.md new file mode 100644 index 0000000000..8731dfdd18 --- /dev/null +++ b/docs/diffusers/api/schedulers/multistep_dpm_solver_inverse.md @@ -0,0 +1,26 @@ + + +# DPMSolverMultistepInverse + +`DPMSolverMultistepInverse` is the inverted scheduler from [DPM-Solver: A Fast ODE Solver for Diffusion Probabilistic Model Sampling in Around 10 Steps](https://huggingface.co/papers/2206.00927) and [DPM-Solver++: Fast Solver for Guided Sampling of Diffusion Probabilistic Models](https://huggingface.co/papers/2211.01095) by Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. + +The implementation is mostly based on the DDIM inversion definition of [Null-text Inversion for Editing Real Images using Guided Diffusion Models](https://huggingface.co/papers/2211.09794) and notebook implementation of the [`DiffEdit`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/pipelines/diffedit/#mindone.diffusers.StableDiffusionDiffEditPipeline) latent inversion from [Xiang-cd/DiffEdit-stable-diffusion](https://github.com/Xiang-cd/DiffEdit-stable-diffusion/blob/main/diffedit.ipynb). + +## Tips + +Dynamic thresholding from [Imagen](https://huggingface.co/papers/2205.11487) is supported, and for pixel-space +diffusion models, you can set both `algorithm_type="dpmsolver++"` and `thresholding=True` to use the dynamic +thresholding. This thresholding method is unsuitable for latent-space diffusion models such as +Stable Diffusion. + +::: mindone.diffusers.DPMSolverMultistepInverseScheduler diff --git a/docs/diffusers/api/schedulers/overview.md b/docs/diffusers/api/schedulers/overview.md new file mode 100644 index 0000000000..3f39929839 --- /dev/null +++ b/docs/diffusers/api/schedulers/overview.md @@ -0,0 +1,64 @@ + + +# Schedulers + +🤗 Diffusers provides many scheduler functions for the diffusion process. A scheduler takes a model's output (the sample which the diffusion process is iterating on) and a timestep to return a denoised sample. The timestep is important because it dictates where in the diffusion process the step is; data is generated by iterating forward *n* timesteps and inference occurs by propagating backward through the timesteps. Based on the timestep, a scheduler may be *discrete* in which case the timestep is an `int` or *continuous* in which case the timestep is a `float`. + +Depending on the context, a scheduler defines how to iteratively add noise to an image or how to update a sample based on a model's output: + +- during *training*, a scheduler adds noise (there are different algorithms for how to add noise) to a sample to train a diffusion model +- during *inference*, a scheduler defines how to update a sample based on a pretrained model's output + +Many schedulers are implemented from the [k-diffusion](https://github.com/crowsonkb/k-diffusion) library by [Katherine Crowson](https://github.com/crowsonkb/), and they're also widely used in A1111. To help you map the schedulers from k-diffusion and A1111 to the schedulers in 🤗 Diffusers, take a look at the table below: + +| A1111/k-diffusion | 🤗 Diffusers | Usage | +|---------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------| +| DPM++ 2M | [`DPMSolverMultistepScheduler`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/schedulers/multistep_dpm_solver/#mindone.diffusers.DPMSolverMultistepScheduler) | | +| DPM++ 2M Karras | [`DPMSolverMultistepScheduler`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/schedulers/multistep_dpm_solver/#mindone.diffusers.DPMSolverMultistepScheduler) | init with `use_karras_sigmas=True` | +| DPM++ 2M SDE | [`DPMSolverMultistepScheduler`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/schedulers/multistep_dpm_solver/#mindone.diffusers.DPMSolverMultistepScheduler) | init with `algorithm_type="sde-dpmsolver++"` | +| DPM++ 2M SDE Karras | [`DPMSolverMultistepScheduler`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/schedulers/multistep_dpm_solver/#mindone.diffusers.DPMSolverMultistepScheduler) | init with `use_karras_sigmas=True` and `algorithm_type="sde-dpmsolver++"` | +| DPM++ 2S a | N/A | very similar to `DPMSolverSinglestepScheduler` | +| DPM++ 2S a Karras | N/A | very similar to `DPMSolverSinglestepScheduler(use_karras_sigmas=True, ...)` | +| DPM++ SDE | [`DPMSolverSinglestepScheduler`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/schedulers/singlestep_dpm_solver/#mindone.diffusers.DPMSolverSinglestepScheduler) | | +| DPM++ SDE Karras | [`DPMSolverSinglestepScheduler`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/schedulers/singlestep_dpm_solver/#mindone.diffusers.DPMSolverSinglestepScheduler) | init with `use_karras_sigmas=True` | +| DPM2 | [`KDPM2DiscreteScheduler`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/schedulers/dpm_discrete/#mindone.diffusers.KDPM2DiscreteScheduler) | | +| DPM2 Karras | [`KDPM2DiscreteScheduler`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/schedulers/dpm_discrete/#mindone.diffusers.KDPM2DiscreteScheduler) | init with `use_karras_sigmas=True` | +| DPM2 a | [`KDPM2AncestralDiscreteScheduler`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/schedulers/dpm_discrete_ancestral/#mindone.diffusers.KDPM2AncestralDiscreteScheduler) | | +| DPM2 a Karras | [`KDPM2AncestralDiscreteScheduler`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/schedulers/dpm_discrete_ancestral/#mindone.diffusers.KDPM2AncestralDiscreteScheduler) | init with `use_karras_sigmas=True` | +| DPM adaptive | N/A | | +| DPM fast | N/A | | +| Euler | [`EulerDiscreteScheduler`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/schedulers/euler/#mindone.diffusers.EulerDiscreteScheduler) | | +| Euler a | [`EulerAncestralDiscreteScheduler`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/schedulers/euler_ancestral/#mindone.diffusers.EulerAncestralDiscreteScheduler) | | +| Heun | [`HeunDiscreteScheduler`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/schedulers/heun/#mindone.diffusers.HeunDiscreteScheduler) | | +| LMS | [`LMSDiscreteScheduler`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/schedulers/lms_discrete/#mindone.diffusers.LMSDiscreteScheduler) | | +| LMS Karras | [`LMSDiscreteScheduler`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/schedulers/lms_discrete/#mindone.diffusers.LMSDiscreteScheduler) | init with `use_karras_sigmas=True` | +| N/A | [`DEISMultistepScheduler`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/schedulers/deis/#mindone.diffusers.schedulers.scheduling_deis_multistep.DEISMultistepScheduler) | | +| N/A | [`UniPCMultistepScheduler`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/schedulers/unipc/#mindone.diffusers.UniPCMultistepScheduler) | | + +All schedulers are built from the base [`SchedulerMixin`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/schedulers/overview/#mindone.diffusers.SchedulerMixin) class which implements low level utilities shared by all schedulers. + +::: mindone.diffusers.SchedulerMixin + options: + members: + - from_pretrained + - save_pretrained + +::: mindone.diffusers.schedulers.scheduling_utils.SchedulerOutput + +## KarrasDiffusionSchedulers + +[`KarrasDiffusionSchedulers`] are a broad generalization of schedulers in 🤗 Diffusers. The schedulers in this class are distinguished at a high level by their noise sampling strategy, the type of network and scaling, the training strategy, and how the loss is weighed. + +The different schedulers in this class, depending on the ordinary differential equations (ODE) solver type, fall into the above taxonomy and provide a good abstraction for the design of the main schedulers implemented in 🤗 Diffusers. The schedulers in this class are given [here](https://github.com/mindspore-lab/mindone/blob/master/mindone/diffusers/schedulers/scheduling_utils.py#L32). + +::: mindone.diffusers.utils.PushToHubMixin diff --git a/docs/diffusers/api/schedulers/pndm.md b/docs/diffusers/api/schedulers/pndm.md new file mode 100644 index 0000000000..6e353a5b16 --- /dev/null +++ b/docs/diffusers/api/schedulers/pndm.md @@ -0,0 +1,17 @@ + + +# PNDMScheduler + +`PNDMScheduler`, or pseudo numerical methods for diffusion models, uses more advanced ODE integration techniques like the Runge-Kutta and linear multi-step method. The original implementation can be found at [crowsonkb/k-diffusion](https://github.com/crowsonkb/k-diffusion/blob/481677d114f6ea445aa009cf5bd7a9cdee909e47/k_diffusion/sampling.py#L181). + +::: mindone.diffusers.PNDMScheduler diff --git a/docs/diffusers/api/schedulers/repaint.md b/docs/diffusers/api/schedulers/repaint.md new file mode 100644 index 0000000000..93ee9a8444 --- /dev/null +++ b/docs/diffusers/api/schedulers/repaint.md @@ -0,0 +1,25 @@ + + +# RePaintScheduler + +`RePaintScheduler` is a DDPM-based inpainting scheduler for unsupervised inpainting with extreme masks. It is designed to be used with the [`RePaintPipeline`], and it is based on the paper [RePaint: Inpainting using Denoising Diffusion Probabilistic Models](https://huggingface.co/papers/2201.09865) by Andreas Lugmayr et al. + +The abstract from the paper is: + +*Free-form inpainting is the task of adding new content to an image in the regions specified by an arbitrary binary mask. Most existing approaches train for a certain distribution of masks, which limits their generalization capabilities to unseen mask types. Furthermore, training with pixel-wise and perceptual losses often leads to simple textural extensions towards the missing areas instead of semantically meaningful generation. In this work, we propose RePaint: A Denoising Diffusion Probabilistic Model (DDPM) based inpainting approach that is applicable to even extreme masks. We employ a pretrained unconditional DDPM as the generative prior. To condition the generation process, we only alter the reverse diffusion iterations by sampling the unmasked regions using the given image information. Since this technique does not modify or condition the original DDPM network itself, the model produces high-quality and diverse output images for any inpainting form. We validate our method for both faces and general-purpose image inpainting using standard and extreme masks. RePaint outperforms state-of-the-art Autoregressive, and GAN approaches for at least five out of six mask distributions. GitHub Repository: [this http URL](http://git.io/RePaint).* + +The original implementation can be found at [andreas128/RePaint](https://github.com/andreas128/). + +::: mindone.diffusers.RePaintScheduler + +::: mindone.diffusers.schedulers.scheduling_repaint.RePaintSchedulerOutput diff --git a/docs/diffusers/api/schedulers/score_sde_ve.md b/docs/diffusers/api/schedulers/score_sde_ve.md new file mode 100644 index 0000000000..0a4c436242 --- /dev/null +++ b/docs/diffusers/api/schedulers/score_sde_ve.md @@ -0,0 +1,23 @@ + + +# ScoreSdeVeScheduler + +`ScoreSdeVeScheduler` is a variance exploding stochastic differential equation (SDE) scheduler. It was introduced in the [Score-Based Generative Modeling through Stochastic Differential Equations](https://huggingface.co/papers/2011.13456) paper by Yang Song, Jascha Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon, Ben Poole. + +The abstract from the paper is: + +*Creating noise from data is easy; creating data from noise is generative modeling. We present a stochastic differential equation (SDE) that smoothly transforms a complex data distribution to a known prior distribution by slowly injecting noise, and a corresponding reverse-time SDE that transforms the prior distribution back into the data distribution by slowly removing the noise. Crucially, the reverse-time SDE depends only on the time-dependent gradient field (\aka, score) of the perturbed data distribution. By leveraging advances in score-based generative modeling, we can accurately estimate these scores with neural networks, and use numerical SDE solvers to generate samples. We show that this framework encapsulates previous approaches in score-based generative modeling and diffusion probabilistic modeling, allowing for new sampling procedures and new modeling capabilities. In particular, we introduce a predictor-corrector framework to correct errors in the evolution of the discretized reverse-time SDE. We also derive an equivalent neural ODE that samples from the same distribution as the SDE, but additionally enables exact likelihood computation, and improved sampling efficiency. In addition, we provide a new way to solve inverse problems with score-based models, as demonstrated with experiments on class-conditional generation, image inpainting, and colorization. Combined with multiple architectural improvements, we achieve record-breaking performance for unconditional image generation on CIFAR-10 with an Inception score of 9.89 and FID of 2.20, a competitive likelihood of 2.99 bits/dim, and demonstrate high fidelity generation of 1024 x 1024 images for the first time from a score-based generative model.* + +::: mindone.diffusers.ScoreSdeVeScheduler + +::: mindone.diffusers.schedulers.scheduling_sde_ve.SdeVeOutput diff --git a/docs/diffusers/api/schedulers/score_sde_vp.md b/docs/diffusers/api/schedulers/score_sde_vp.md new file mode 100644 index 0000000000..8796d3d733 --- /dev/null +++ b/docs/diffusers/api/schedulers/score_sde_vp.md @@ -0,0 +1,25 @@ + + +# ScoreSdeVpScheduler + +`ScoreSdeVpScheduler` is a variance preserving stochastic differential equation (SDE) scheduler. It was introduced in the [Score-Based Generative Modeling through Stochastic Differential Equations](https://huggingface.co/papers/2011.13456) paper by Yang Song, Jascha Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon, Ben Poole. + +The abstract from the paper is: + +*Creating noise from data is easy; creating data from noise is generative modeling. We present a stochastic differential equation (SDE) that smoothly transforms a complex data distribution to a known prior distribution by slowly injecting noise, and a corresponding reverse-time SDE that transforms the prior distribution back into the data distribution by slowly removing the noise. Crucially, the reverse-time SDE depends only on the time-dependent gradient field (\aka, score) of the perturbed data distribution. By leveraging advances in score-based generative modeling, we can accurately estimate these scores with neural networks, and use numerical SDE solvers to generate samples. We show that this framework encapsulates previous approaches in score-based generative modeling and diffusion probabilistic modeling, allowing for new sampling procedures and new modeling capabilities. In particular, we introduce a predictor-corrector framework to correct errors in the evolution of the discretized reverse-time SDE. We also derive an equivalent neural ODE that samples from the same distribution as the SDE, but additionally enables exact likelihood computation, and improved sampling efficiency. In addition, we provide a new way to solve inverse problems with score-based models, as demonstrated with experiments on class-conditional generation, image inpainting, and colorization. Combined with multiple architectural improvements, we achieve record-breaking performance for unconditional image generation on CIFAR-10 with an Inception score of 9.89 and FID of 2.20, a competitive likelihood of 2.99 bits/dim, and demonstrate high fidelity generation of 1024 x 1024 images for the first time from a score-based generative model.* + +!!! warning + + 🚧 This scheduler is under construction! + +mindone.diffusers.schedulers.deprecated.scheduling_sde_vp.ScoreSdeVpScheduler diff --git a/docs/diffusers/api/schedulers/singlestep_dpm_solver.md b/docs/diffusers/api/schedulers/singlestep_dpm_solver.md new file mode 100644 index 0000000000..affccf1868 --- /dev/null +++ b/docs/diffusers/api/schedulers/singlestep_dpm_solver.md @@ -0,0 +1,31 @@ + + +# DPMSolverSinglestepScheduler + +`DPMSolverSinglestepScheduler` is a single step scheduler from [DPM-Solver: A Fast ODE Solver for Diffusion Probabilistic Model Sampling in Around 10 Steps](https://huggingface.co/papers/2206.00927) and [DPM-Solver++: Fast Solver for Guided Sampling of Diffusion Probabilistic Models](https://huggingface.co/papers/2211.01095) by Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. + +DPMSolver (and the improved version DPMSolver++) is a fast dedicated high-order solver for diffusion ODEs with convergence order guarantee. Empirically, DPMSolver sampling with only 20 steps can generate high-quality +samples, and it can generate quite good samples even in 10 steps. + +The original implementation can be found at [LuChengTHU/dpm-solver](https://github.com/LuChengTHU/dpm-solver). + +## Tips + +It is recommended to set `solver_order` to 2 for guide sampling, and `solver_order=3` for unconditional sampling. + +Dynamic thresholding from [Imagen](https://huggingface.co/papers/2205.11487) is supported, and for pixel-space +diffusion models, you can set both `algorithm_type="dpmsolver++"` and `thresholding=True` to use dynamic +thresholding. This thresholding method is unsuitable for latent-space diffusion models such as +Stable Diffusion. + +::: mindone.diffusers.DPMSolverSinglestepScheduler diff --git a/docs/diffusers/api/schedulers/tcd.md b/docs/diffusers/api/schedulers/tcd.md new file mode 100644 index 0000000000..d035664630 --- /dev/null +++ b/docs/diffusers/api/schedulers/tcd.md @@ -0,0 +1,26 @@ + + +# TCDScheduler + +[Trajectory Consistency Distillation](https://huggingface.co/papers/2402.19159) by Jianbin Zheng, Minghui Hu, Zhongyi Fan, Chaoyue Wang, Changxing Ding, Dacheng Tao and Tat-Jen Cham introduced a Strategic Stochastic Sampling (Algorithm 4) that is capable of generating good samples in a small number of steps. Distinguishing it as an advanced iteration of the multistep scheduler (Algorithm 1) in the [Consistency Models](https://huggingface.co/papers/2303.01469), Strategic Stochastic Sampling specifically tailored for the trajectory consistency function. + +The abstract from the paper is: + +*Latent Consistency Model (LCM) extends the Consistency Model to the latent space and leverages the guided consistency distillation technique to achieve impressive performance in accelerating text-to-image synthesis. However, we observed that LCM struggles to generate images with both clarity and detailed intricacy. To address this limitation, we initially delve into and elucidate the underlying causes. Our investigation identifies that the primary issue stems from errors in three distinct areas. Consequently, we introduce Trajectory Consistency Distillation (TCD), which encompasses trajectory consistency function and strategic stochastic sampling. The trajectory consistency function diminishes the distillation errors by broadening the scope of the self-consistency boundary condition and endowing the TCD with the ability to accurately trace the entire trajectory of the Probability Flow ODE. Additionally, strategic stochastic sampling is specifically designed to circumvent the accumulated errors inherent in multi-step consistency sampling, which is meticulously tailored to complement the TCD model. Experiments demonstrate that TCD not only significantly enhances image quality at low NFEs but also yields more detailed results compared to the teacher model at high NFEs.* + +The original codebase can be found at [jabir-zheng/TCD](https://github.com/jabir-zheng/TCD). + +::: mindone.diffusers.TCDScheduler + + +::: mindone.diffusers.schedulers.scheduling_tcd.TCDSchedulerOutput diff --git a/docs/diffusers/api/schedulers/unipc.md b/docs/diffusers/api/schedulers/unipc.md new file mode 100644 index 0000000000..bb52d614ca --- /dev/null +++ b/docs/diffusers/api/schedulers/unipc.md @@ -0,0 +1,31 @@ + + +# UniPCMultistepScheduler + +`UniPCMultistepScheduler` is a training-free framework designed for fast sampling of diffusion models. It was introduced in [UniPC: A Unified Predictor-Corrector Framework for Fast Sampling of Diffusion Models](https://huggingface.co/papers/2302.04867) by Wenliang Zhao, Lujia Bai, Yongming Rao, Jie Zhou, Jiwen Lu. + +It consists of a corrector (UniC) and a predictor (UniP) that share a unified analytical form and support arbitrary orders. +UniPC is by design model-agnostic, supporting pixel-space/latent-space DPMs on unconditional/conditional sampling. It can also be applied to both noise prediction and data prediction models. The corrector UniC can be also applied after any off-the-shelf solvers to increase the order of accuracy. + +The abstract from the paper is: + +*Diffusion probabilistic models (DPMs) have demonstrated a very promising ability in high-resolution image synthesis. However, sampling from a pre-trained DPM is time-consuming due to the multiple evaluations of the denoising network, making it more and more important to accelerate the sampling of DPMs. Despite recent progress in designing fast samplers, existing methods still cannot generate satisfying images in many applications where fewer steps (e.g., <10) are favored. In this paper, we develop a unified corrector (UniC) that can be applied after any existing DPM sampler to increase the order of accuracy without extra model evaluations, and derive a unified predictor (UniP) that supports arbitrary order as a byproduct. Combining UniP and UniC, we propose a unified predictor-corrector framework called UniPC for the fast sampling of DPMs, which has a unified analytical form for any order and can significantly improve the sampling quality over previous methods, especially in extremely few steps. We evaluate our methods through extensive experiments including both unconditional and conditional sampling using pixel-space and latent-space DPMs. Our UniPC can achieve 3.87 FID on CIFAR10 (unconditional) and 7.51 FID on ImageNet 256×256 (conditional) with only 10 function evaluations. Code is available at [this https URL](https://github.com/wl-zhao/UniPC).* + +## Tips + +It is recommended to set `solver_order` to 2 for guide sampling, and `solver_order=3` for unconditional sampling. + +Dynamic thresholding from [Imagen](https://huggingface.co/papers/2205.11487) is supported, and for pixel-space +diffusion models, you can set both `predict_x0=True` and `thresholding=True` to use dynamic thresholding. This thresholding method is unsuitable for latent-space diffusion models such as Stable Diffusion. + +::: mindone.diffusers.UniPCMultistepScheduler diff --git a/docs/diffusers/api/schedulers/vq_diffusion.md b/docs/diffusers/api/schedulers/vq_diffusion.md new file mode 100644 index 0000000000..d8ed19c871 --- /dev/null +++ b/docs/diffusers/api/schedulers/vq_diffusion.md @@ -0,0 +1,23 @@ + + +# VQDiffusionScheduler + +`VQDiffusionScheduler` converts the transformer model's output into a sample for the unnoised image at the previous diffusion timestep. It was introduced in [Vector Quantized Diffusion Model for Text-to-Image Synthesis](https://huggingface.co/papers/2111.14822) by Shuyang Gu, Dong Chen, Jianmin Bao, Fang Wen, Bo Zhang, Dongdong Chen, Lu Yuan, Baining Guo. + +The abstract from the paper is: + +*We present the vector quantized diffusion (VQ-Diffusion) model for text-to-image generation. This method is based on a vector quantized variational autoencoder (VQ-VAE) whose latent space is modeled by a conditional variant of the recently developed Denoising Diffusion Probabilistic Model (DDPM). We find that this latent-space method is well-suited for text-to-image generation tasks because it not only eliminates the unidirectional bias with existing methods but also allows us to incorporate a mask-and-replace diffusion strategy to avoid the accumulation of errors, which is a serious problem with existing methods. Our experiments show that the VQ-Diffusion produces significantly better text-to-image generation results when compared with conventional autoregressive (AR) models with similar numbers of parameters. Compared with previous GAN-based text-to-image methods, our VQ-Diffusion can handle more complex scenes and improve the synthesized image quality by a large margin. Finally, we show that the image generation computation in our method can be made highly efficient by reparameterization. With traditional AR methods, the text-to-image generation time increases linearly with the output image resolution and hence is quite time consuming even for normal size images. The VQ-Diffusion allows us to achieve a better trade-off between quality and speed. Our experiments indicate that the VQ-Diffusion model with the reparameterization is fifteen times faster than traditional AR methods while achieving a better image quality.* + +::: mindone.diffusers.VQDiffusionScheduler + +::: mindone.diffusers.schedulers.scheduling_vq_diffusion.VQDiffusionSchedulerOutput diff --git a/docs/diffusers/conceptual/philosophy.md b/docs/diffusers/conceptual/philosophy.md new file mode 100644 index 0000000000..2435079465 --- /dev/null +++ b/docs/diffusers/conceptual/philosophy.md @@ -0,0 +1,108 @@ + + +# Philosophy + +🧨 Diffusers provides **state-of-the-art** pretrained diffusion models across multiple modalities. +Its purpose is to serve as a **modular toolbox** for both inference and training. + +We aim at building a library that stands the test of time and therefore take API design very seriously. + +## Usability over Performance + +- While Diffusers has many built-in performance-enhancing features (see [Memory and Speed](../optimization/fp16.md)), models are always loaded with the highest precision and lowest optimization. Therefore, by default diffusion pipelines are always instantiated on specific hardware with float32 precision if not otherwise defined by the user. This ensures usability across different platforms and accelerators and means that no complex installations are required to run the library. +- Diffusers aims to be a **light-weight** package and therefore has very few required dependencies, but many soft dependencies that can improve performance (such as `safetensors`, etc...). We strive to keep the library as lightweight as possible so that it can be added without much concern as a dependency on other packages. +- Diffusers prefers simple, self-explainable code over condensed, magic code. This means that short-hand code syntaxes such as lambda functions, and advanced MindSpore operators are often not desired. + +## Simple over easy + +**Explicit is better than implicit** and **simple is better than complex**. This design philosophy is reflected in multiple parts of the library: +- We follow MindSpore's API with methods like [`DiffusionPipeline.to`](https://github.com/mindspore-lab/mindone/blob/master/mindone/diffusers/pipelines/pipeline_utils.py#L261) to let the user handle precision. +- Raising concise error messages is preferred to silently correct erroneous input. Diffusers aims at teaching the user, rather than making the library as easy to use as possible. +- Complex model vs. scheduler logic is exposed instead of magically handled inside. Schedulers/Samplers are separated from diffusion models with minimal dependencies on each other. This forces the user to write the unrolled denoising loop. However, the separation allows for easier debugging and gives the user more control over adapting the denoising process or switching out diffusion models or schedulers. +- Separately trained components of the diffusion pipeline, *e.g.* the text encoder, the unet, and the variational autoencoder, each have their own model class. This forces the user to handle the interaction between the different model components, and the serialization format separates the model components into different files. However, this allows for easier debugging and customization. DreamBooth or Textual Inversion training +is very simple thanks to Diffusers' ability to separate single components of the diffusion pipeline. + +## Tweakable, contributor-friendly over abstraction + +For large parts of the library, Diffusers adopts an important design principle of the [Transformers library](https://github.com/huggingface/transformers), which is to prefer copy-pasted code over hasty abstractions. This design principle is very opinionated and stands in stark contrast to popular design principles such as [Don't repeat yourself (DRY)](https://en.wikipedia.org/wiki/Don%27t_repeat_yourself). +In short, just like Transformers does for modeling files, Diffusers prefers to keep an extremely low level of abstraction and very self-contained code for pipelines and schedulers. +Functions, long code blocks, and even classes can be copied across multiple files which at first can look like a bad, sloppy design choice that makes the library unmaintainable. +**However**, this design has proven to be extremely successful for Transformers and makes a lot of sense for community-driven, open-source machine learning libraries because: +- Machine Learning is an extremely fast-moving field in which paradigms, model architectures, and algorithms are changing rapidly, which therefore makes it very difficult to define long-lasting code abstractions. +- Machine Learning practitioners like to be able to quickly tweak existing code for ideation and research and therefore prefer self-contained code over one that contains many abstractions. +- Open-source libraries rely on community contributions and therefore must build a library that is easy to contribute to. The more abstract the code, the more dependencies, the harder to read, and the harder to contribute to. Contributors simply stop contributing to very abstract libraries out of fear of breaking vital functionality. If contributing to a library cannot break other fundamental code, not only is it more inviting for potential new contributors, but it is also easier to review and contribute to multiple parts in parallel. + +At Hugging Face, we call this design the **single-file policy** which means that almost all of the code of a certain class should be written in a single, self-contained file. To read more about the philosophy, you can have a look +at [this blog post](https://huggingface.co/blog/transformers-design-philosophy). + +In Diffusers, we follow this philosophy for both pipelines and schedulers, but only partly for diffusion models. The reason we don't follow this design fully for diffusion models is because almost all diffusion pipelines, such +as [DDPM](https://mindspore-lab.github.io/mindone/latest/diffusers/api/pipelines/ddpm), [Stable Diffusion](https://mindspore-lab.github.io/mindone/latest/diffusers/api/pipelines/stable_diffusion/overview#stable-diffusion-pipelines), [unCLIP (DALL·E 2)](https://mindspore-lab.github.io/mindone/latest/diffusers/api/pipelines/unclip) and [Imagen](https://imagen.research.google/) all rely on the same diffusion model, the [UNet](https://mindspore-lab.github.io/mindone/latest/diffusers/api/models/unet2d-cond). + +Great, now you should have generally understood why 🧨 Diffusers is designed the way it is 🤗. +We try to apply these design principles consistently across the library. Nevertheless, there are some minor exceptions to the philosophy or some unlucky design choices. + +## Design Philosophy in Details + +Now, let's look a bit into the nitty-gritty details of the design philosophy. Diffusers essentially consists of three major classes: [pipelines](https://github.com/mindspore-lab/mindone/tree/master/mindone/diffusers/pipelines), [models](https://github.com/mindspore-lab/mindone/tree/master/mindone/diffusers/models), and [schedulers](https://github.com/mindspore-lab/mindone/tree/master/mindone/diffusers/schedulers). +Let's walk through more in-detail design decisions for each class. + +### Pipelines + +Pipelines are designed to be easy to use (therefore do not follow [*Simple over easy*](#simple-over-easy) 100%), are not feature complete, and should loosely be seen as examples of how to use [models](#models) and [schedulers](#schedulers) for inference. + +The following design principles are followed: +- Pipelines follow the single-file policy. All pipelines can be found in individual directories under src/diffusers/pipelines. One pipeline folder corresponds to one diffusion paper/project/release. Multiple pipeline files can be gathered in one pipeline folder, as it’s done for [`mindone/diffusers/pipelines/stable-diffusion`](https://github.com/mindspore-lab/mindone/tree/master/mindone/diffusers/pipelines/stable_diffusion). If pipelines share similar functionality, one can make use of the [#Copied from mechanism](https://github.com/mindspore-lab/mindone/blob/master/mindone/diffusers/pipelines/stable_diffusion_xl/pipeline_stable_diffusion_xl_img2img.py#L730). +- Pipelines all inherit from [`DiffusionPipeline`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/pipelines/overview/#mindone.diffusers.DiffusionPipeline). +- Every pipeline consists of different model and scheduler components, that are documented in the [`model_index.json` file](https://huggingface.co/stable-diffusion-v1-5/stable-diffusion-v1-5/blob/main/model_index.json), are accessible under the same name as attributes of the pipeline and can be shared between pipelines with [`DiffusionPipeline.components`](https://github.com/mindspore-lab/mindone/blob/master/mindone/diffusers/pipelines/pipeline_utils.py#L1048) function. +- Every pipeline should be loadable via the [`from_pretrained`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/pipelines/overview/#mindone.diffusers.DiffusionPipeline.from_pretrained)(https://github.com/mindspore-lab/mindone/blob/master/mindone/diffusers/pipelines/pipeline_utils.py#L308) function. +- Pipelines should be used **only** for inference. +- Pipelines should be very readable, self-explanatory, and easy to tweak. +- Pipelines should be designed to build on top of each other and be easy to integrate into higher-level APIs. +- Pipelines are **not** intended to be feature-complete user interfaces. For feature-complete user interfaces one should rather have a look at [InvokeAI](https://github.com/invoke-ai/InvokeAI), [Diffuzers](https://github.com/abhishekkrthakur/diffuzers), and [lama-cleaner](https://github.com/Sanster/lama-cleaner). +- Every pipeline should have one and only one way to run it via a `__call__` method. The naming of the `__call__` arguments should be shared across all pipelines. +- Pipelines should be named after the task they are intended to solve. +- In almost all cases, novel diffusion pipelines shall be implemented in a new pipeline folder/file. + +### Models + +Models are designed as configurable toolboxes that are natural extensions of [MindSpore's Cell class](https://www.mindspore.cn/docs/en/master/api_python/nn/mindspore.nn.Cell.html). They only partly follow the **single-file policy**. + +The following design principles are followed: +- Models correspond to **a type of model architecture**. *E.g.* the [`UNet2DConditionModel`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/models/unet2d-cond/#mindone.diffusers.UNet2DConditionModel) class is used for all UNet variations that expect 2D image inputs and are conditioned on some context. +- All models can be found in [`mindone/diffusers/models`](https://github.com/mindspore-lab/mindone/tree/master/mindone/diffusers/models) and every model architecture shall be defined in its file, e.g. [`unet_2d_condition.py`](https://github.com/mindspore-lab/mindone/tree/master/mindone/diffusers/models/unets/unet_2d_condition.py), [`transformer_2d.py`](https://github.com/mindspore-lab/mindone/tree/master/mindone/diffusers/models/transformers/transformer_2d.py), etc... +- Models **do not** follow the single-file policy and should make use of smaller model building blocks, such as [`attention.py`](https://github.com/mindspore-lab/mindone/tree/master/mindone/diffusers/models/attention.py), [`resnet.py`](https://github.com/mindspore-lab/mindone/tree/master/mindone/diffusers/models/resnet.py), [`embeddings.py`](https://github.com/mindspore-lab/mindone/tree/master/mindone/diffusers/models/embeddings.py), etc... **Note**: This is in stark contrast to Transformers' modeling files and shows that models do not really follow the single-file policy. +- Models intend to expose complexity, just like MindSpore's `Cell` class, and give clear error messages. +- Models all inherit from `ModelMixin` and `ConfigMixin`. +- Models can be optimized for performance when it doesn’t demand major code changes, keeps backward compatibility, and gives significant memory or compute gain. +- Models should by default have the highest precision and lowest performance setting. +- To integrate new model checkpoints whose general architecture can be classified as an architecture that already exists in Diffusers, the existing model architecture shall be adapted to make it work with the new checkpoint. One should only create a new file if the model architecture is fundamentally different. +- Models should be designed to be easily extendable to future changes. This can be achieved by limiting public function arguments, configuration arguments, and "foreseeing" future changes, *e.g.* it is usually better to add `string` "...type" arguments that can easily be extended to new future types instead of boolean `is_..._type` arguments. Only the minimum amount of changes shall be made to existing architectures to make a new model checkpoint work. +- The model design is a difficult trade-off between keeping code readable and concise and supporting many model checkpoints. For most parts of the modeling code, classes shall be adapted for new model checkpoints, while there are some exceptions where it is preferred to add new classes to make sure the code is kept concise and +readable long-term, such as [UNet blocks](https://github.com/mindspore-lab/mindone/tree/master/mindone/diffusers/models/unets/unet_2d_blocks.py) and [Attention processors](https://github.com/mindspore-lab/mindone/tree/master/mindone/diffusers/models/attention_processor.py). + +### Schedulers + +Schedulers are responsible to guide the denoising process for inference as well as to define a noise schedule for training. They are designed as individual classes with loadable configuration files and strongly follow the **single-file policy**. + +The following design principles are followed: +- All schedulers are found in [`mindone/diffusers/schedulers`](https://github.com/mindspore-lab/mindone/tree/master/mindone/diffusers/schedulers). +- Schedulers are **not** allowed to import from large utils files and shall be kept very self-contained. +- One scheduler Python file corresponds to one scheduler algorithm (as might be defined in a paper). +- If schedulers share similar functionalities, we can make use of the `#Copied from` mechanism. +- Schedulers all inherit from `SchedulerMixin` and `ConfigMixin`. +- Schedulers can be easily swapped out with the [`ConfigMixin.from_config`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/configuration/#mindone.diffusers.configuration_utils.ConfigMixin.from_config) method as explained in detail [here](../using-diffusers/schedulers.md). +- Every scheduler has to have a `set_num_inference_steps`, and a `step` function. `set_num_inference_steps(...)` has to be called before every denoising process, *i.e.* before `step(...)` is called. +- Every scheduler exposes the timesteps to be "looped over" via a `timesteps` attribute, which is an array of timesteps the model will be called upon. +- The `step(...)` function takes a predicted model output and the "current" sample (x_t) and returns the "previous", slightly more denoised sample (x_t-1). +- Given the complexity of diffusion schedulers, the `step` function does not expose all the complexity and can be a bit of a "black box". +- In almost all cases, novel schedulers shall be implemented in a new scheduling file. diff --git a/docs/diffusers/index.md b/docs/diffusers/index.md index 8a8e9ef0af..b0a2a6a012 100644 --- a/docs/diffusers/index.md +++ b/docs/diffusers/index.md @@ -22,39 +22,39 @@ specific language governing permissions and limitations under the License. ???+ info - Due to differences in framework, some APIs will not be identical to [huggingface/diffusers](https://github.com/huggingface/diffusers) in the foreseeable future, see [Limitations](./limitations) for details. + Due to differences in framework, some APIs will not be identical to [huggingface/diffusers](https://github.com/huggingface/diffusers) in the foreseeable future, see [Limitations](./limitations.md) for details. -🤗 Diffusers is the go-to library for state-of-the-art pretrained diffusion models for generating images, audio, and even 3D structures of molecules. Whether you're looking for a simple inference solution or want to train your own diffusion model, 🤗 Diffusers is a modular toolbox that supports both. Our library is designed with a focus on [usability over performance](conceptual/philosophy#usability-over-performance), [simple over easy](conceptual/philosophy#simple-over-easy), and [customizability over abstractions](conceptual/philosophy#tweakable-contributorfriendly-over-abstraction). +🤗 Diffusers is the go-to library for state-of-the-art pretrained diffusion models for generating images, audio, and even 3D structures of molecules. Whether you're looking for a simple inference solution or want to train your own diffusion model, 🤗 Diffusers is a modular toolbox that supports both. Our library is designed with a focus on [usability over performance](conceptual/philosophy.md#usability-over-performance), [simple over easy](conceptual/philosophy.md#simple-over-easy), and [customizability over abstractions](conceptual/philosophy.md#tweakable-contributor-friendly-over-abstraction). The library has three main components: -- State-of-the-art diffusion pipelines for inference with just a few lines of code. There are many pipelines in 🤗 Diffusers, check out the table in the pipeline [overview](api/pipelines/overview) for a complete list of available pipelines and the task they solve. -- Interchangeable [noise schedulers](api/schedulers/overview) for balancing trade-offs between generation speed and quality. -- Pretrained [models](api/models) that can be used as building blocks, and combined with schedulers, for creating your own end-to-end diffusion systems. +- State-of-the-art diffusion pipelines for inference with just a few lines of code. There are many pipelines in 🤗 Diffusers, check out the table in the pipeline [overview](api/pipelines/overview.md) for a complete list of available pipelines and the task they solve. +- Interchangeable [noise schedulers](api/schedulers/overview.md) for balancing trade-offs between generation speed and quality. +- Pretrained [models](api/models.md) that can be used as building blocks, and combined with schedulers, for creating your own end-to-end diffusion systems.
-- __[Tutorials](./tutorials/tutorial_overview)__ +- __[Tutorials](./tutorials/tutorial_overview.md)__ --- Learn the fundamental skills you need to start generating outputs, build your own diffusion system, and train a diffusion model. We recommend starting here if you're using 🤗 Diffusers for the first time! -- __[How-to guides](./using-diffusers/loading_overview)__ +- __[How-to guides](./using-diffusers/loading_overview.md)__ --- Practical guides for helping you load pipelines, models, and schedulers. You'll also learn how to use pipelines for specific tasks, control how outputs are generated, optimize for inference speed, and different training techniques. -- __[Conceptual guides](./conceptual/philosophy)__ +- __[Conceptual guides](./conceptual/philosophy.md)__ --- Understand why the library was designed the way it was, and learn more about the ethical guidelines and safety implementations for using the library. -- __[Reference](./api/models/overview)__ +- __[Reference](./api/models/overview.md)__ --- diff --git a/docs/diffusers/installation.md b/docs/diffusers/installation.md index c32c624947..f266e2f1bc 100644 --- a/docs/diffusers/installation.md +++ b/docs/diffusers/installation.md @@ -95,7 +95,7 @@ Your Python environment will find the `main` version of 🤗 Diffusers on the ne ## Cache -Model weights and files are downloaded from the Hub to a cache which is usually your home directory. You can change the cache location by specifying the `HF_HOME` or `HUGGINFACE_HUB_CACHE` environment variables or configuring the `cache_dir` parameter in methods like [`~DiffusionPipeline.from_pretrained`]. +Model weights and files are downloaded from the Hub to a cache which is usually your home directory. You can change the cache location by specifying the `HF_HOME` or `HUGGINFACE_HUB_CACHE` environment variables or configuring the `cache_dir` parameter in methods like [`from_pretrained`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/pipelines/overview/#mindone.diffusers.DiffusionPipeline.from_pretrained). Cached files allow you to run 🤗 Diffusers offline. To prevent 🤗 Diffusers from connecting to the internet, set the `HF_HUB_OFFLINE` environment variable to `True` and 🤗 Diffusers will only load previously downloaded files in the cache. @@ -107,7 +107,7 @@ For more details about managing and cleaning the cache, take a look at the [cach ## Telemetry logging -Our library gathers telemetry information during [`~DiffusionPipeline.from_pretrained`] requests. +Our library gathers telemetry information during [`from_pretrained`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/pipelines/overview/#mindone.diffusers.DiffusionPipeline.from_pretrained) requests. The data gathered includes the version of 🤗 Diffusers and PyTorch/Flax, the requested model or pipeline class, and the path to a pretrained checkpoint if it is hosted on the Hugging Face Hub. This usage data helps us debug issues and prioritize new features. diff --git a/docs/diffusers/limitations.md b/docs/diffusers/limitations.md index 6a3a7fcedf..5416a6df36 100644 --- a/docs/diffusers/limitations.md +++ b/docs/diffusers/limitations.md @@ -70,65 +70,65 @@ whether they have support in Pynative fp16 mode, Graph fp16 mode, Pynative fp32 > precision issues of pipelines, the experiments in the table below default to upcasting GroupNorm to FP32 to avoid > this issue. -| **Pipelines** | **Pynative FP16** | **Pynative FP32** | **Graph FP16** | **Graph FP32** | **Description** | -|:------------------------------------------:|:------------------:|:------------------:|:------------------:|:------------------:|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------:| -| AnimateDiffPipeline | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: | | -| AnimateDiffVideoToVideoPipeline | :white_check_mark: | :x: | :white_check_mark: | :white_check_mark: | In FP32 and Pynative mode, this pipeline will run out of memory | -| BlipDiffusionPipeline | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: | | -| ConsistencyModelPipeline | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: | | -| DDIMPipeline | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: | | -| DDPMPipeline | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: | | -| DiTPipeline | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: | | -| I2VGenXLPipeline | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: | ops.bmm and ops.softmax have precision issues under FP16, so we need to upcast them to FP32 to get a good result | -| IFImg2ImgPipeline | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: | | -| IFImg2ImgSuperResolutionPipeline | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: | | -| IFInpaintingPipeline | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: | | -| IFInpaintingSuperResolutionPipeline | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: | | -| IFPipeline | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: | | -| IFSuperResolutionPipeline | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: | | -| Kandinsky3Img2ImgPipeline | :x: | :x: | :x: | :x: | Kandinsky3 only provides FP16 weights; additionally, T5 has precision issues, so to achieve the desired results, you need to directly input prompt_embeds and attention_mask. | -| Kandinsky3Pipeline | :x: | :x: | :x: | :x: | Kandinsky3 only provides FP16 weights; additionally, T5 has precision issues, so to achieve the desired results, you need to directly input prompt_embeds and attention_mask. | -| KandinskyImg2ImgPipeline | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: | | -| KandinskyInpaintPipeline | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: | | -| KandinskyPipeline | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: | | -| KandinskyV22ControlnetImg2ImgPipeline | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: | | -| KandinskyV22ControlnetPipeline | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: | | -| KandinskyV22Img2ImgPipeline | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: | | -| KandinskyV22InpaintPipeline | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: | | -| KandinskyV22Pipeline | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: | | -| LatentConsistencyModelImg2ImgPipeline | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: | | -| LatentConsistencyModelPipeline | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: | | -| LDMSuperResolutionPipeline | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: | | -| LDMTextToImagePipeline | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: | | -| PixArtAlphaPipeline | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: | | -| ShapEImg2ImgPipeline | :white_check_mark: | :white_check_mark: | :x: | :x: | The syntax in Render only supports Pynative mode | -| ShapEPipeline | :white_check_mark: | :white_check_mark: | :x: | :x: | The syntax in Render only supports Pynative mode | -| StableCascadePipeline | :x: | :white_check_mark: | :x: | :white_check_mark: | This pipeline does not support FP16 due to precision issues | -| StableDiffusion3Pipeline | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: | | -| StableDiffusionAdapterPipeline | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: | | -| StableDiffusionControlNetImg2ImgPipeline | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: | | -| StableDiffusionControlNetInpaintPipeline | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: | | -| StableDiffusionControlNetPipeline | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: | | -| StableDiffusionDepth2ImgPipeline | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: | | -| StableDiffusionDiffEditPipeline | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: | | -| StableDiffusionGLIGENPipeline | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: | | -| StableDiffusionGLIGENTextImagePipeline | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: | | -| StableDiffusionImageVariationPipeline | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: | | -| StableDiffusionImg2ImgPipeline | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: | | -| StableDiffusionInpaintPipeline | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: | | -| StableDiffusionInstructPix2PixPipeline | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: | | -| StableDiffusionLatentUpscalePipeline | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: | | -| StableDiffusionPipeline | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: | | -| StableDiffusionUpscalePipeline | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: | | -| StableDiffusionXLAdapterPipeline | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: | | -| StableDiffusionXLControlNetImg2ImgPipeline | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: | | -| StableDiffusionXLControlNetInpaintPipeline | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: | | -| StableDiffusionXLControlNetPipeline | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: | | -| StableDiffusionXLImg2ImgPipeline | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: | | -| StableDiffusionXLInpaintPipeline | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: | | -| StableDiffusionXLInstructPix2PixPipeline | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: | | -| StableDiffusionXLPipeline | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: | | -| StableVideoDiffusionPipeline | :white_check_mark: | :x: | :white_check_mark: | :x: | This pipeline will run out of memory under FP32; ops.bmm and ops.softmax have precision issues under FP16, so we need to upcast them to FP32 to get a good result | -| UnCLIPImageVariationPipeline | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: | | -| UnCLIPPipeline | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: | | -| WuerstchenPipeline | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: | GlobalResponseNorm has precision issue under FP16, so we need to upcast it to FP32 to get a good result | +| **Pipelines** | **Pynative FP16** | **Pynative FP32** | **Graph FP16** | **Graph FP32** | **Description** | +|:------------------------------------------:|:-----------------:|:-----------------:|:--------------:|:--------------:|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------:| +| AnimateDiffPipeline | ✅ | ✅ | ✅ | ✅ | | +| AnimateDiffVideoToVideoPipeline | ✅ | ❌ | ✅ | ✅ | In FP32 and Pynative mode, this pipeline will run out of memory | +| BlipDiffusionPipeline | ✅ | ✅ | ✅ | ✅ | | +| ConsistencyModelPipeline | ✅ | ✅ | ✅ | ✅ | | +| DDIMPipeline | ✅ | ✅ | ✅ | ✅ | | +| DDPMPipeline | ✅ | ✅ | ✅ | ✅ | | +| DiTPipeline | ✅ | ✅ | ✅ | ✅ | | +| I2VGenXLPipeline | ✅ | ✅ | ✅ | ✅ | ops.bmm and ops.softmax have precision issues under FP16, so we need to upcast them to FP32 to get a good result | +| IFImg2ImgPipeline | ✅ | ✅ | ✅ | ✅ | | +| IFImg2ImgSuperResolutionPipeline | ✅ | ✅ | ✅ | ✅ | | +| IFInpaintingPipeline | ✅ | ✅ | ✅ | ✅ | | +| IFInpaintingSuperResolutionPipeline | ✅ | ✅ | ✅ | ✅ | | +| IFPipeline | ✅ | ✅ | ✅ | ✅ | | +| IFSuperResolutionPipeline | ✅ | ✅ | ✅ | ✅ | | +| Kandinsky3Img2ImgPipeline | ❌ | ❌ | ❌ | ❌ | Kandinsky3 only provides FP16 weights; additionally, T5 has precision issues, so to achieve the desired results, you need to directly input prompt_embeds and attention_mask. | +| Kandinsky3Pipeline | ❌ | ❌ | ❌ | ❌ | Kandinsky3 only provides FP16 weights; additionally, T5 has precision issues, so to achieve the desired results, you need to directly input prompt_embeds and attention_mask. | +| KandinskyImg2ImgPipeline | ✅ | ✅ | ✅ | ✅ | | +| KandinskyInpaintPipeline | ✅ | ✅ | ✅ | ✅ | | +| KandinskyPipeline | ✅ | ✅ | ✅ | ✅ | | +| KandinskyV22ControlnetImg2ImgPipeline | ✅ | ✅ | ✅ | ✅ | | +| KandinskyV22ControlnetPipeline | ✅ | ✅ | ✅ | ✅ | | +| KandinskyV22Img2ImgPipeline | ✅ | ✅ | ✅ | ✅ | | +| KandinskyV22InpaintPipeline | ✅ | ✅ | ✅ | ✅ | | +| KandinskyV22Pipeline | ✅ | ✅ | ✅ | ✅ | | +| LatentConsistencyModelImg2ImgPipeline | ✅ | ✅ | ✅ | ✅ | | +| LatentConsistencyModelPipeline | ✅ | ✅ | ✅ | ✅ | | +| LDMSuperResolutionPipeline | ✅ | ✅ | ✅ | ✅ | | +| LDMTextToImagePipeline | ✅ | ✅ | ✅ | ✅ | | +| PixArtAlphaPipeline | ✅ | ✅ | ✅ | ✅ | | +| ShapEImg2ImgPipeline | ✅ | ✅ | ❌ | ❌ | The syntax in Render only supports Pynative mode | +| ShapEPipeline | ✅ | ✅ | ❌ | ❌ | The syntax in Render only supports Pynative mode | +| StableCascadePipeline | ❌ | ✅ | ❌ | ✅ | This pipeline does not support FP16 due to precision issues | +| StableDiffusion3Pipeline | ✅ | ✅ | ✅ | ✅ | | +| StableDiffusionAdapterPipeline | ✅ | ✅ | ✅ | ✅ | | +| StableDiffusionControlNetImg2ImgPipeline | ✅ | ✅ | ✅ | ✅ | | +| StableDiffusionControlNetInpaintPipeline | ✅ | ✅ | ✅ | ✅ | | +| StableDiffusionControlNetPipeline | ✅ | ✅ | ✅ | ✅ | | +| StableDiffusionDepth2ImgPipeline | ✅ | ✅ | ✅ | ✅ | | +| StableDiffusionDiffEditPipeline | ✅ | ✅ | ✅ | ✅ | | +| StableDiffusionGLIGENPipeline | ✅ | ✅ | ✅ | ✅ | | +| StableDiffusionGLIGENTextImagePipeline | ✅ | ✅ | ✅ | ✅ | | +| StableDiffusionImageVariationPipeline | ✅ | ✅ | ✅ | ✅ | | +| StableDiffusionImg2ImgPipeline | ✅ | ✅ | ✅ | ✅ | | +| StableDiffusionInpaintPipeline | ✅ | ✅ | ✅ | ✅ | | +| StableDiffusionInstructPix2PixPipeline | ✅ | ✅ | ✅ | ✅ | | +| StableDiffusionLatentUpscalePipeline | ✅ | ✅ | ✅ | ✅ | | +| StableDiffusionPipeline | ✅ | ✅ | ✅ | ✅ | | +| StableDiffusionUpscalePipeline | ✅ | ✅ | ✅ | ✅ | | +| StableDiffusionXLAdapterPipeline | ✅ | ✅ | ✅ | ✅ | | +| StableDiffusionXLControlNetImg2ImgPipeline | ✅ | ✅ | ✅ | ✅ | | +| StableDiffusionXLControlNetInpaintPipeline | ✅ | ✅ | ✅ | ✅ | | +| StableDiffusionXLControlNetPipeline | ✅ | ✅ | ✅ | ✅ | | +| StableDiffusionXLImg2ImgPipeline | ✅ | ✅ | ✅ | ✅ | | +| StableDiffusionXLInpaintPipeline | ✅ | ✅ | ✅ | ✅ | | +| StableDiffusionXLInstructPix2PixPipeline | ✅ | ✅ | ✅ | ✅ | | +| StableDiffusionXLPipeline | ✅ | ✅ | ✅ | ✅ | | +| StableVideoDiffusionPipeline | ✅ | ❌ | ✅ | ❌ | This pipeline will run out of memory under FP32; ops.bmm and ops.softmax have precision issues under FP16, so we need to upcast them to FP32 to get a good result | +| UnCLIPImageVariationPipeline | ✅ | ✅ | ✅ | ✅ | | +| UnCLIPPipeline | ✅ | ✅ | ✅ | ✅ | | +| WuerstchenPipeline | ✅ | ✅ | ✅ | ✅ | GlobalResponseNorm has precision issue under FP16, so we need to upcast it to FP32 to get a good result | diff --git a/docs/diffusers/optimization/fp16.md b/docs/diffusers/optimization/fp16.md new file mode 100644 index 0000000000..a16a570ab8 --- /dev/null +++ b/docs/diffusers/optimization/fp16.md @@ -0,0 +1,125 @@ + + +# Speed up inference + +There are several ways to optimize Diffusers for inference speed, such as reducing the computational burden by lowering the data precision or using a lightweight distilled model. There are also memory-efficient attention implementations, like [xFormers](xformers.md), that reduce memory usage which also indirectly speeds up inference. Different speed optimizations can be stacked together to get the fastest inference times. + +!!! tip + + Optimizing for inference speed or reduced memory usage can lead to improved performance in the other category, so you should try to optimize for both whenever you can. This guide focuses on inference speed, but you can learn more about lowering memory usage in the [Reduce memory usage](memory.md) guide. + +The inference times below are obtained from generating a single 512x512 image from the prompt "a photo of an astronaut riding a horse on mars" with 50 DDIM steps on a Ascend 910B in Graph mode. + +| setup | latency | speed-up | +|----------|---------|----------| +| baseline | 5.64s | x1 | +| fp16 | 4.03s | x1.40 | + +## Half-precision weights + +To save Ascend memory and get more speed, set `mindspore_dtype=ms.float16` to load and run the model weights directly with half-precision weights. + +```Python +import mindspore as ms +from mindone.diffusers import DiffusionPipeline + +pipe = DiffusionPipeline.from_pretrained( + "stable-diffusion-v1-5/stable-diffusion-v1-5", + mindspore_dtype=ms.float16, + use_safetensors=True, +) +``` + +## Distilled model + +You could also use a distilled Stable Diffusion model and autoencoder to speed up inference. During distillation, many of the UNet's residual and attention blocks are shed to reduce the model size by 51% and improve latency by 43%. The distilled model is faster and uses less memory while generating images of comparable quality to the full Stable Diffusion model. + +!!! tip + + Read the [Open-sourcing Knowledge Distillation Code and Weights of SD-Small and SD-Tiny](https://huggingface.co/blog/sd_distillation) blog post to learn more about how knowledge distillation training works to produce a faster, smaller, and cheaper generative model. + +The inference times below are obtained from generating 4 images from the prompt "a photo of an astronaut riding a horse on mars" with 25 PNDM steps on a Ascend 910B. Each generation is repeated 3 times with the distilled Stable Diffusion v1.4 model by [Nota AI](https://hf.co/nota-ai). + +| setup | latency | speed-up | +|------------------------------|---------|----------| +| baseline | 5.89s | x1 | +| distilled | 3.82s | x1.54 | +| distilled + tiny autoencoder | 3.77s | x1.56 | + +Let's load the distilled Stable Diffusion model and compare it against the original Stable Diffusion model. + +```py +from mindone.diffusers import StableDiffusionPipeline +from mindone.diffusers.utils import make_image_grid +import mindspore as ms +import numpy as np + +distilled = StableDiffusionPipeline.from_pretrained( + "nota-ai/bk-sdm-small", mindspore_dtype=ms.float16, use_safetensors=True, +) +prompt = "a golden vase with different flowers" +generator = [np.random.Generator(np.random.PCG64(i)) for i in range(4)] +images = distilled( + "a golden vase with different flowers", + num_inference_steps=25, + generator=generator, + num_images_per_prompt=4 +)[0] +make_image_grid(images, rows=2, cols=2) +``` + +
+
+ +
original Stable Diffusion
+
+
+ +
distilled Stable Diffusion
+
+
+ +### Tiny AutoEncoder + +To speed inference up even more, replace the autoencoder with a [distilled version](https://huggingface.co/sayakpaul/taesdxl-diffusers) of it. + +```py +import mindspore as ms +from mindone.diffusers import AutoencoderTiny, StableDiffusionPipeline +from mindone.diffusers.utils import make_image_grid +import numpy as np + +distilled = StableDiffusionPipeline.from_pretrained( + "nota-ai/bk-sdm-small", mindspore_dtype=ms.float16, use_safetensors=True, +) +distilled.vae = AutoencoderTiny.from_pretrained( + "madebyollin/taesd", mindspore_dtype=ms.float16, use_safetensors=True, +) + +prompt = "a golden vase with different flowers" +generator = [np.random.Generator(np.random.PCG64(i)) for i in range(4)] +images = distilled( + "a golden vase with different flowers", + num_inference_steps=25, + generator=generator, + num_images_per_prompt=4 +)[0] +make_image_grid(images, rows=2, cols=2) +``` + +
+
+ +
distilled Stable Diffusion + Tiny AutoEncoder
+
+
diff --git a/docs/diffusers/optimization/memory.md b/docs/diffusers/optimization/memory.md new file mode 100644 index 0000000000..9f1e8a6023 --- /dev/null +++ b/docs/diffusers/optimization/memory.md @@ -0,0 +1,43 @@ + + +# Reduce memory usage + +A barrier to using diffusion models is the large amount of memory required. To overcome this challenge, there are several memory-reducing techniques you can use to run even some of the largest models on Ascend. Some of these techniques can even be combined to further reduce memory usage. + +!!! tip + + In many cases, optimizing for memory or speed leads to improved performance in the other, so you should try to optimize for both whenever you can. This guide focuses on minimizing memory usage, but you can also learn more about how to [Speed up inference](fp16.md). + +## Memory-efficient attention + +Recent work on optimizing bandwidth in the attention block has generated huge speed-ups and reductions in memory usage. The most recent type of memory-efficient attention is [Flash Attention](https://arxiv.org/abs/2205.14135) (you can check out the original code at [HazyResearch/flash-attention](https://github.com/HazyResearch/flash-attention)). + +Now call [`enable_xformers_memory_efficient_attention`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/models/overview/#mindone.diffusers.ModelMixin.disable_xformers_memory_efficient_attention) on the pipeline: + +```python +from mindone.diffusers import DiffusionPipeline +import mindspore as ms + +pipe = DiffusionPipeline.from_pretrained( + "stable-diffusion-v1-5/stable-diffusion-v1-5", + mindspore_dtype=ms.float16, + use_safetensors=True, +) + +pipe.enable_xformers_memory_efficient_attention() + +sample = pipe("a small cat") + +# optional: You can disable it via +# pipe.disable_xformers_memory_efficient_attention() +``` diff --git a/docs/diffusers/optimization/xformers.md b/docs/diffusers/optimization/xformers.md new file mode 100644 index 0000000000..a72fcc2d96 --- /dev/null +++ b/docs/diffusers/optimization/xformers.md @@ -0,0 +1,21 @@ + + +# xFormers + +!!! warning + + ⚠️ MindONE currently only support `AttnProcessor`, Please check if only `AttnProcessor` is used in the pipeline before using `enable_xformers_memory_efficient_attention()`. + +We recommend [xFormers](https://github.com/facebookresearch/xformers) for both inference and training. In our tests, the optimizations performed in the attention blocks allow for both faster speed and reduced memory consumption. + +You can use `enable_xformers_memory_efficient_attention()` for faster inference and reduced memory consumption as shown in this [section](memory.md#memory-efficient-attention). diff --git a/docs/diffusers/quicktour.md b/docs/diffusers/quicktour.md index 95527f1848..d566eac66e 100644 --- a/docs/diffusers/quicktour.md +++ b/docs/diffusers/quicktour.md @@ -16,11 +16,11 @@ Diffusion models are trained to denoise random Gaussian noise step-by-step to ge Whether you're a developer or an everyday user, this quicktour will introduce you to 🧨 Diffusers and help you get up and generating quickly! There are three main components of the library to know about: -* The [`DiffusionPipeline`] is a high-level end-to-end class designed to rapidly generate samples from pretrained diffusion models for inference. -* Popular pretrained [model](./api/models) architectures and modules that can be used as building blocks for creating diffusion systems. -* Many different [schedulers](./api/schedulers/overview) - algorithms that control how noise is added for training, and how to generate denoised images during inference. +* The [`DiffusionPipeline`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/pipelines/overview/#mindone.diffusers.DiffusionPipeline) is a high-level end-to-end class designed to rapidly generate samples from pretrained diffusion models for inference. +* Popular pretrained [model](./api/models/overview.md) architectures and modules that can be used as building blocks for creating diffusion systems. +* Many different [schedulers](./api/schedulers/overview.md) - algorithms that control how noise is added for training, and how to generate denoised images during inference. -The quicktour will show you how to use the [`DiffusionPipeline`] for inference, and then walk you through how to combine a model and scheduler to replicate what's happening inside the [`DiffusionPipeline`]. +The quicktour will show you how to use the [`DiffusionPipeline`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/pipelines/overview/#mindone.diffusers.DiffusionPipeline) for inference, and then walk you through how to combine a model and scheduler to replicate what's happening inside the [`DiffusionPipeline`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/pipelines/overview/#mindone.diffusers.DiffusionPipeline). !!! tip @@ -31,51 +31,56 @@ Before you begin, make sure you have all the necessary libraries installed: ```py # uncomment to install the necessary libraries in Colab -#!pip install --upgrade mindone +#!pip install --upgrade mindone transformers ``` -- [🤗 Accelerate](../accelerate/index) speeds up model loading for inference and training. -- [🤗 Transformers](../transformers/index) is required to run the most popular diffusion models, such as [Stable Diffusion](./api/pipelines/stable_diffusion/overview). +- [🤗 Transformers](../transformers/index.md) is required to run the most popular diffusion models, such as [Stable Diffusion](./api/pipelines/stable_diffusion/overview.md). ## DiffusionPipeline -The [`DiffusionPipeline`] is the easiest way to use a pretrained diffusion system for inference. It is an end-to-end system containing the model and the scheduler. You can use the [`DiffusionPipeline`] out-of-the-box for many tasks. Take a look at the table below for some supported tasks, and for a complete list of supported tasks, check out the [🧨 Diffusers Summary](./api/pipelines/overview#diffusers-summary) table. +The [`DiffusionPipeline`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/pipelines/overview/#mindone.diffusers.DiffusionPipeline) is the easiest way to use a pretrained diffusion system for inference. It is an end-to-end system containing the model and the scheduler. You can use the [`DiffusionPipeline`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/pipelines/overview/#mindone.diffusers.DiffusionPipeline) out-of-the-box for many tasks. Take a look at the table below for some supported tasks, and for a complete list of supported tasks, check out the [🧨 Diffusers Summary](./api/pipelines/overview.md#diffusers-summary) table. -| **Task** | **Description** | **Pipeline** | -|----------------------------------------|-------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------| -| Unconditional Image Generation | generate an image from Gaussian noise | [unconditional_image_generation](./using-diffusers/unconditional_image_generation) | -| Text-Guided Image Generation | generate an image given a text prompt | [conditional_image_generation](./using-diffusers/conditional_image_generation) | -| Text-Guided Image-to-Image Translation | adapt an image guided by a text prompt | [img2img](./using-diffusers/img2img) | -| Text-Guided Image-Inpainting | fill the masked part of an image given the image, the mask and a text prompt | [inpaint](./using-diffusers/inpaint) | -| Text-Guided Depth-to-Image Translation | adapt parts of an image guided by a text prompt while preserving structure via depth estimation | [depth2img](./using-diffusers/depth2img) | +| **Task** | **Description** | **Pipeline** | +|----------------------------------------|-------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------| +| Unconditional Image Generation | generate an image from Gaussian noise | [unconditional_image_generation](./using-diffusers/unconditional_image_generation.md) | +| Text-Guided Image Generation | generate an image given a text prompt | [conditional_image_generation](./using-diffusers/conditional_image_generation.md) | +| Text-Guided Image-to-Image Translation | adapt an image guided by a text prompt | [img2img](./using-diffusers/img2img.md) | +| Text-Guided Image-Inpainting | fill the masked part of an image given the image, the mask and a text prompt | [inpaint](./using-diffusers/inpaint.md) | +| Text-Guided Depth-to-Image Translation | adapt parts of an image guided by a text prompt while preserving structure via depth estimation | [depth2img](./using-diffusers/depth2img.md) | -Start by creating an instance of a [`DiffusionPipeline`] and specify which pipeline checkpoint you would like to download. -You can use the [`DiffusionPipeline`] for any [checkpoint](https://huggingface.co/models?library=diffusers&sort=downloads) stored on the Hugging Face Hub. -In this quicktour, you'll load the [`stable-diffusion-v1-5`](https://huggingface.co/runwayml/stable-diffusion-v1-5) checkpoint for text-to-image generation. +Start by creating an instance of a [`DiffusionPipeline`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/pipelines/overview/#mindone.diffusers.DiffusionPipeline) and specify which pipeline checkpoint you would like to download. +You can use the [`DiffusionPipeline`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/pipelines/overview/#mindone.diffusers.DiffusionPipeline) for any [checkpoint](https://huggingface.co/models?library=diffusers&sort=downloads) stored on the Hugging Face Hub. +In this quicktour, you'll load the [`stable-diffusion-v1-5`](https://huggingface.co/stable-diffusion-v1-5/stable-diffusion-v1-5) checkpoint for text-to-image generation. !!! warning - For [Stable Diffusion](https://huggingface.co/CompVis/stable-diffusion) models, please carefully read the [license](https://huggingface.co/spaces/CompVis/stable-diffusion-license) first before running the model. 🧨 Diffusers implements a [`safety_checker`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion/safety_checker.py) to prevent offensive or harmful content, but the model's improved image generation capabilities can still produce potentially harmful content. + For [Stable Diffusion](https://huggingface.co/CompVis/stable-diffusion) models, please carefully read the [license](https://huggingface.co/spaces/CompVis/stable-diffusion-license) first before running the model. 🧨 Diffusers implements a [`safety_checker`](https://github.com/The-truthh/mindone/blob/docs/mindone/diffusers/pipelines/stable_diffusion/safety_checker.py) to prevent offensive or harmful content, but the model's improved image generation capabilities can still produce potentially harmful content. -Load the model with the [`~DiffusionPipeline.from_pretrained`] method: +Load the model with the [`from_pretrained`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/pipelines/overview/#mindone.diffusers.DiffusionPipeline.from_pretrained) method: + +!!! tip + + MindONE.diffusers currently does not support loading `.bin` files, if the models in the [Hub](https://huggingface.co/models) consist solely of `.bin` files, please refer to the [`tutorial`](using-diffusers/other-formats.md#bin-files) + + If the connection error occurs while loading the weights, try configuring the [`HF_ENDPOINT`](https://huggingface.co/docs/huggingface_hub/v0.13.2/package_reference/environment_variables#hfendpoint) environment variable to switch to an alternative mirror. ```diff -- >>> from diffusers import DiffusionPipeline -+ >>> from mindone.diffusers import DiffusionPipeline - - >>> pipeline = DiffusionPipeline.from_pretrained( - ... "runwayml/stable-diffusion-v1-5", -- ... torch_dtype=torch.float32, -+ ... mindspore_dtype=mindspore.float32 - ... use_safetensors=True - ... ) +- from diffusers import DiffusionPipeline ++ from mindone.diffusers import DiffusionPipeline + + pipeline = DiffusionPipeline.from_pretrained( + "stable-diffusion-v1-5/stable-diffusion-v1-5", +- torch_dtype=torch.float32, ++ mindspore_dtype=mindspore.float32, + use_safetensors=True + ) ``` -The [`DiffusionPipeline`] downloads and caches all modeling, tokenization, and scheduling components. You'll see that the Stable Diffusion pipeline is composed of the [`UNet2DConditionModel`] and [`PNDMScheduler`] among other things: +The [`DiffusionPipeline`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/pipelines/overview/#mindone.diffusers.DiffusionPipeline) downloads and caches all modeling, tokenization, and scheduling components. You'll see that the Stable Diffusion pipeline is composed of the [`UNet2DConditionModel`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/models/unet2d-cond/#mindone.diffusers.UNet2DConditionModel) and [`PNDMScheduler`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/schedulers/pndm/#mindone.diffusers.PNDMScheduler) among other things: -```pycon ->>> pipeline +```python +pipeline StableDiffusionPipeline { "_class_name": "StableDiffusionPipeline", "_diffusers_version": "0.21.4", @@ -96,27 +101,27 @@ StableDiffusionPipeline { } ``` -We strongly recommend running the pipeline on a GPU because the model consists of roughly 1.4 billion parameters. You can't move the generator object to a GPU **manually**, because MindSpore implicitly does that. Do **NOT** invoke `to("cuda")`: +We strongly recommend running the pipeline on a NPU because the model consists of roughly 1.4 billion parameters. You can't move the generator object to a NPU **manually**, because MindSpore implicitly does that. Do **NOT** invoke `to("cuda")`: ```diff -- >>> pipeline.to("cuda") +- pipeline.to("cuda") ``` Now you can pass a text prompt to the `pipeline` to generate an image, and then access the denoised image. By default, the image output is wrapped in a [`PIL.Image`](https://pillow.readthedocs.io/en/stable/reference/Image.html?highlight=image#the-image-class) object. -```pycon ->>> image = pipeline("An image of a squirrel in Picasso style")[0][0] ->>> image +```python +image = pipeline("An image of a squirrel in Picasso style")[0][0] +image ``` -
- +
+
Save the image by calling `save`: -```pycon ->>> image.save("image_of_squirrel_painting.png") +```python +image.save("image_of_squirrel_painting.png") ``` ### Local pipeline @@ -125,49 +130,49 @@ You can also use the pipeline locally. The only difference is you need to downlo ```bash !git lfs install -!git clone https://huggingface.co/runwayml/stable-diffusion-v1-5 +!git clone https://huggingface.co/stable-diffusion-v1-5/stable-diffusion-v1-5 ``` Then load the saved weights into the pipeline: -```pycon ->>> pipeline = DiffusionPipeline.from_pretrained("./stable-diffusion-v1-5", use_safetensors=True) +```python +pipeline = DiffusionPipeline.from_pretrained("./stable-diffusion-v1-5", use_safetensors=True) ``` Now, you can run the pipeline as you would in the section above. ### Swapping schedulers -Different schedulers come with different denoising speeds and quality trade-offs. The best way to find out which one works best for you is to try them out! One of the main features of 🧨 Diffusers is to allow you to easily switch between schedulers. For example, to replace the default [`PNDMScheduler`] with the [`EulerDiscreteScheduler`], load it with the [`~diffusers.ConfigMixin.from_config`] method: +Different schedulers come with different denoising speeds and quality trade-offs. The best way to find out which one works best for you is to try them out! One of the main features of 🧨 Diffusers is to allow you to easily switch between schedulers. For example, to replace the default [`PNDMScheduler`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/schedulers/pndm/#mindone.diffusers.PNDMScheduler) with the [`EulerDiscreteScheduler`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/schedulers/euler/#mindone.diffusers.EulerDiscreteScheduler), load it with the [`from_config`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/configuration/#mindone.diffusers.configuration_utils.ConfigMixin.from_config) method: -```pycon ->>> from mindone.diffusers import EulerDiscreteScheduler +```python +from mindone.diffusers import EulerDiscreteScheduler ->>> pipeline = DiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", use_safetensors=True) ->>> pipeline.scheduler = EulerDiscreteScheduler.from_config(pipeline.scheduler.config) +pipeline = DiffusionPipeline.from_pretrained("stable-diffusion-v1-5/stable-diffusion-v1-5", use_safetensors=True) +pipeline.scheduler = EulerDiscreteScheduler.from_config(pipeline.scheduler.config) ``` Try generating an image with the new scheduler and see if you notice a difference! -In the next section, you'll take a closer look at the components - the model and scheduler - that make up the [`DiffusionPipeline`] and learn how to use these components to generate an image of a cat. +In the next section, you'll take a closer look at the components - the model and scheduler - that make up the [`DiffusionPipeline`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/pipelines/overview/#mindone.diffusers.DiffusionPipeline) and learn how to use these components to generate an image of a cat. ## Models -Most models take a noisy sample, and at each timestep it predicts the *noise residual* (other models learn to predict the previous sample directly or the velocity or [`v-prediction`](https://github.com/huggingface/diffusers/blob/5e5ce13e2f89ac45a0066cb3f369462a3cf1d9ef/src/diffusers/schedulers/scheduling_ddim.py#L110)), the difference between a less noisy image and the input image. You can mix and match models to create other diffusion systems. +Most models take a noisy sample, and at each timestep it predicts the *noise residual* (other models learn to predict the previous sample directly or the velocity or [`v-prediction`](https://github.com/The-truthh/mindone/blob/docs/mindone/diffusers/schedulers/scheduling_ddim.py#L162)), the difference between a less noisy image and the input image. You can mix and match models to create other diffusion systems. -Models are initiated with the [`~ModelMixin.from_pretrained`] method which also locally caches the model weights so it is faster the next time you load the model. For the quicktour, you'll load the [`UNet2DModel`], a basic unconditional image generation model with a checkpoint trained on cat images: +Models are initiated with the [`from_pretrained`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/models/overview/#mindone.diffusers.ModelMixin.from_pretrained) method which also locally caches the model weights so it is faster the next time you load the model. For the quicktour, you'll load the [`UNet2DModel`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/models/unet2d/#mindone.diffusers.UNet2DModel), a basic unconditional image generation model with a checkpoint trained on cat images: -```pycon ->>> from mindone.diffusers import UNet2DModel +```python +from mindone.diffusers import UNet2DModel ->>> repo_id = "google/ddpm-cat-256" ->>> model = UNet2DModel.from_pretrained(repo_id, use_safetensors=True) +repo_id = "google/ddpm-cat-256" +model = UNet2DModel.from_pretrained(repo_id, use_safetensors=True) ``` To access the model parameters, call `model.config`: -```pycon ->>> model.config +```python +model.config ``` The model configuration is a 🧊 frozen 🧊 dictionary, which means those parameters can't be changed after the model is created. This is intentional and ensures that the parameters used to define the model architecture at the start remain the same, while other parameters can still be adjusted during inference. @@ -182,18 +187,18 @@ Some of the most important parameters are: To use the model for inference, create the image shape with random Gaussian noise. It should have a `batch` axis because the model can receive multiple random noises, a `channel` axis corresponding to the number of input channels, and a `sample_size` axis for the height and width of the image: -```pycon ->>> import mindspore +```python +import mindspore ->>> noisy_sample = mindspore.ops.randn(1, model.config.in_channels, model.config.sample_size, model.config.sample_size) ->>> noisy_sample.shape +noisy_sample = mindspore.ops.randn(1, model.config.in_channels, model.config.sample_size, model.config.sample_size) +noisy_sample.shape [1, 3, 256, 256] ``` For inference, pass the noisy image and a `timestep` to the model. The `timestep` indicates how noisy the input image is, with more noise at the beginning and less at the end. This helps the model determine its position in the diffusion process, whether it is closer to the start or the end. Use the `sample` method to get the model output: -```pycon ->>> noisy_residual = model(sample=noisy_sample, timestep=2)[0] +```python +noisy_residual = model(sample=noisy_sample, timestep=2)[0] ``` To generate actual examples though, you'll need a scheduler to guide the denoising process. In the next section, you'll learn how to couple a model with a scheduler. @@ -204,16 +209,16 @@ Schedulers manage going from a noisy sample to a less noisy sample given the mod !!! tip - 🧨 Diffusers is a toolbox for building diffusion systems. While the [`DiffusionPipeline`] is a convenient way to get started with a pre-built diffusion system, you can also choose your own model and scheduler components separately to build a custom diffusion system. + 🧨 Diffusers is a toolbox for building diffusion systems. While the [`DiffusionPipeline`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/pipelines/overview/#mindone.diffusers.DiffusionPipeline) is a convenient way to get started with a pre-built diffusion system, you can also choose your own model and scheduler components separately to build a custom diffusion system. -For the quicktour, you'll instantiate the [`DDPMScheduler`] with its [`~diffusers.ConfigMixin.from_config`] method: +For the quicktour, you'll instantiate the [`DDPMScheduler`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/schedulers/ddpm/#mindone.diffusers.DDPMScheduler) with its [`from_config`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/configuration/#mindone.diffusers.configuration_utils.ConfigMixin.from_config) method: -```pycon ->>> from mindone.diffusers import DDPMScheduler +```python +from mindone.diffusers import DDPMScheduler ->>> scheduler = DDPMScheduler.from_pretrained(repo_id) ->>> scheduler +scheduler = DDPMScheduler.from_pretrained(repo_id) +scheduler DDPMScheduler { "_class_name": "DDPMScheduler", "_diffusers_version": "0.21.4", @@ -245,11 +250,11 @@ Some of the most important parameters are: * `beta_schedule`: the type of noise schedule to use for inference and training. * `beta_start` and `beta_end`: the start and end noise values for the noise schedule. -To predict a slightly less noisy image, pass the following to the scheduler's [`~diffusers.DDPMScheduler.step`] method: model output, `timestep`, and current `sample`. +To predict a slightly less noisy image, pass the following to the scheduler's [`step`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/schedulers/ddpm/#mindone.diffusers.DDPMScheduler.step) method: model output, `timestep`, and current `sample`. -```pycon ->>> less_noisy_sample = scheduler.step(model_output=noisy_residual, timestep=2, sample=noisy_sample)[0] ->>> less_noisy_sample.shape +```python +less_noisy_sample = scheduler.step(model_output=noisy_residual, timestep=2, sample=noisy_sample)[0] +less_noisy_sample.shape [1, 3, 256, 256] ``` @@ -257,52 +262,52 @@ The `less_noisy_sample` can be passed to the next `timestep` where it'll get eve First, create a function that postprocesses and displays the denoised image as a `PIL.Image`: -```pycon ->>> import PIL.Image ->>> import numpy as np +```python +import PIL.Image +import numpy as np + +def display_sample(sample, i): + image_processed = sample.permute(0, 2, 3, 1) + image_processed = (image_processed + 1.0) * 127.5 + image_processed = image_processed.numpy().astype(np.uint8) ->>> def display_sample(sample, i): -... image_processed = sample.permute(0, 2, 3, 1) -... image_processed = (image_processed + 1.0) * 127.5 -... image_processed = image_processed.numpy().astype(np.uint8) -... -... image_pil = PIL.Image.fromarray(image_processed[0]) -... display(f"Image at step {i}") -... display(image_pil) + image_pil = PIL.Image.fromarray(image_processed[0]) + display(f"Image at step {i}") + display(image_pil) ``` Now create a denoising loop that predicts the residual of the less noisy sample, and computes the less noisy sample with the scheduler: -```pycon ->>> import tqdm - ->>> sample = noisy_sample - ->>> for i, t in enumerate(tqdm.tqdm(scheduler.timesteps)): -... # 1. predict noise residual -... residual = model(sample, t)[0] -... -... # 2. compute less noisy image and set x_t -> x_t-1 -... sample = scheduler.step(residual, t, sample)[0] -... -... # 3. optionally look at image -... if (i + 1) % 50 == 0: -... display_sample(sample, i + 1) +```python +import tqdm + +sample = noisy_sample + +for i, t in enumerate(tqdm.tqdm(scheduler.timesteps)): + # 1. predict noise residual + residual = model(sample, t)[0] + + # 2. compute less noisy image and set x_t -> x_t-1 + sample = scheduler.step(residual, t, sample)[0] + + # 3. optionally look at image + if (i + 1) % 50 == 0: + display_sample(sample, i + 1) ``` Sit back and watch as a cat is generated from nothing but noise! 😻 -
- +
+
## Next steps Hopefully, you generated some cool images with 🧨 Diffusers in this quicktour! For your next steps, you can: -* Train or finetune a model to generate your own images in the [training](./tutorials/basic_training) tutorial. -* See example official and community [training or finetuning scripts](https://github.com/huggingface/diffusers/tree/main/examples#-diffusers-examples) for a variety of use cases. -* Learn more about loading, accessing, changing, and comparing schedulers in the [Using different Schedulers](./using-diffusers/schedulers) guide. -* Explore prompt engineering, speed and memory optimizations, and tips and tricks for generating higher-quality images with the [Stable Diffusion](./stable_diffusion) guide. -* Dive deeper into speeding up 🧨 Diffusers with guides on [optimized MindSpore on a NPU](./optimization/fp16). +* Train or finetune a model to generate your own images in the [training](./tutorials/basic_training.md) tutorial. +* See example official and community [training or finetuning scripts](https://github.com/The-truthh/mindone/tree/docs/examples/diffusers) for a variety of use cases. +* Learn more about loading, accessing, changing, and comparing schedulers in the [Using different Schedulers](./using-diffusers/schedulers.md) guide. +* Explore prompt engineering, speed and memory optimizations, and tips and tricks for generating higher-quality images with the [Stable Diffusion](./stable_diffusion.md) guide. +* Dive deeper into speeding up 🧨 Diffusers with guides on [optimized MindSpore on a NPU](./optimization/fp16.md). diff --git a/docs/diffusers/stable_diffusion.md b/docs/diffusers/stable_diffusion.md index 0e451bae88..5cd9e9dcdb 100644 --- a/docs/diffusers/stable_diffusion.md +++ b/docs/diffusers/stable_diffusion.md @@ -12,18 +12,18 @@ specific language governing permissions and limitations under the License. # Effective and efficient diffusion -Getting the [`DiffusionPipeline`] to generate images in a certain style or include what you want can be tricky. Often times, you have to run the [`DiffusionPipeline`] several times before you end up with an image you're happy with. But generating something out of nothing is a computationally intensive process, especially if you're running inference over and over again. +Getting the [`DiffusionPipeline`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/pipelines/overview/#mindone.diffusers.DiffusionPipeline) to generate images in a certain style or include what you want can be tricky. Often times, you have to run the [`DiffusionPipeline`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/pipelines/overview/#mindone.diffusers.DiffusionPipeline) several times before you end up with an image you're happy with. But generating something out of nothing is a computationally intensive process, especially if you're running inference over and over again. -This is why it's important to get the most *computational* (speed) and *memory* (GPU vRAM) efficiency from the pipeline to reduce the time between inference cycles so you can iterate faster. +This is why it's important to get the most *computational* (speed) and *memory* (NPU vRAM) efficiency from the pipeline to reduce the time between inference cycles so you can iterate faster. -This tutorial walks you through how to generate faster and better with the [`DiffusionPipeline`]. +This tutorial walks you through how to generate faster and better with the [`DiffusionPipeline`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/pipelines/overview/#mindone.diffusers.DiffusionPipeline). -Begin by loading the [`runwayml/stable-diffusion-v1-5`](https://huggingface.co/runwayml/stable-diffusion-v1-5) model: +Begin by loading the [`stable-diffusion-v1-5/stable-diffusion-v1-5`](https://huggingface.co/stable-diffusion-v1-5/stable-diffusion-v1-5) model: ```python from mindone.diffusers import DiffusionPipeline -model_id = "runwayml/stable-diffusion-v1-5" +model_id = "stable-diffusion-v1-5/stable-diffusion-v1-5" pipeline = DiffusionPipeline.from_pretrained(model_id, use_safetensors=True) ``` @@ -35,14 +35,14 @@ prompt = "portrait photo of a old warrior chief" ## Speed -One of the simplest ways to speed up inference is to place the pipeline on a GPU the same way you would with any Mindspore cell. +One of the simplest ways to speed up inference is to place the pipeline on a NPU the same way you would with any Mindspore cell. That is, do nothing! MindSpore will automatically take care of model placement, so you don't need to: ```diff - pipeline = pipeline.to("cuda") ``` -To make sure you can use the same image and improve on it, use a [`Generator`](https://numpy.org/doc/stable/reference/random/generator.html) and set a seed for [reproducibility](./using-diffusers/reproducibility): +To make sure you can use the same image and improve on it, use a [`Generator`](https://numpy.org/doc/stable/reference/random/generator.html) and set a seed for [reproducibility](./using-diffusers/reusing_seeds.md): ```python import numpy as np @@ -57,11 +57,11 @@ image = pipeline(prompt, generator=generator)[0][0] image ``` -
- +
+
-This process took ~30 seconds on a T4 GPU (it might be faster if your allocated GPU is better than a T4). By default, the [`DiffusionPipeline`] runs inference with full `float32` precision for 50 inference steps. You can speed this up by switching to a lower precision like `float16` or running fewer inference steps. +This process took ~5.6 seconds on a Ascend 910B in Graph mode. By default, the [`DiffusionPipeline`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/pipelines/overview/#mindone.diffusers.DiffusionPipeline) runs inference with full `float32` precision for 50 inference steps. You can speed this up by switching to a lower precision like `float16` or running fewer inference steps. Let's start by loading the model in `float16` and generate an image: @@ -74,39 +74,39 @@ image = pipeline(prompt, generator=generator)[0][0] image ``` -
- +
+
-This time, it only took ~11 seconds to generate the image, which is almost 3x faster than before! +This time, it only took ~3.8 seconds to generate the image, which is almost 1.5x faster than before! !!! tip 💡 We strongly suggest always running your pipelines in `float16`, and so far, we've rarely seen any degradation in output quality. -Another option is to reduce the number of inference steps. Choosing a more efficient scheduler could help decrease the number of steps without sacrificing output quality. You can find which schedulers are compatible with the current model in the [`DiffusionPipeline`] by calling the `compatibles` method: +Another option is to reduce the number of inference steps. Choosing a more efficient scheduler could help decrease the number of steps without sacrificing output quality. You can find which schedulers are compatible with the current model in the [`DiffusionPipeline`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/pipelines/overview/#mindone.diffusers.DiffusionPipeline) by calling the `compatibles` method: ```python pipeline.scheduler.compatibles [ - diffusers.schedulers.scheduling_lms_discrete.LMSDiscreteScheduler, - diffusers.schedulers.scheduling_unipc_multistep.UniPCMultistepScheduler, - diffusers.schedulers.scheduling_k_dpm_2_discrete.KDPM2DiscreteScheduler, - diffusers.schedulers.scheduling_deis_multistep.DEISMultistepScheduler, - diffusers.schedulers.scheduling_euler_discrete.EulerDiscreteScheduler, - diffusers.schedulers.scheduling_dpmsolver_multistep.DPMSolverMultistepScheduler, - diffusers.schedulers.scheduling_ddpm.DDPMScheduler, - diffusers.schedulers.scheduling_dpmsolver_singlestep.DPMSolverSinglestepScheduler, - diffusers.schedulers.scheduling_k_dpm_2_ancestral_discrete.KDPM2AncestralDiscreteScheduler, - diffusers.utils.dummy_torch_and_torchsde_objects.DPMSolverSDEScheduler, - diffusers.schedulers.scheduling_heun_discrete.HeunDiscreteScheduler, - diffusers.schedulers.scheduling_pndm.PNDMScheduler, - diffusers.schedulers.scheduling_euler_ancestral_discrete.EulerAncestralDiscreteScheduler, - diffusers.schedulers.scheduling_ddim.DDIMScheduler, + , + , + , + , + , + , + , + , + , + , + , + , + , + ] ``` -The Stable Diffusion model uses the [`PNDMScheduler`] by default which usually requires ~50 inference steps, but more performant schedulers like [`DPMSolverMultistepScheduler`], require only ~20 or 25 inference steps. Use the [`~ConfigMixin.from_config`] method to load a new scheduler: +The Stable Diffusion model uses the [`PNDMScheduler`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/schedulers/pndm/#mindone.diffusers.PNDMScheduler) by default which usually requires ~50 inference steps, but more performant schedulers like [`DPMSolverMultistepScheduler`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/schedulers/multistep_dpm_solver/#mindone.diffusers.DPMSolverMultistepScheduler), require only ~20 or 25 inference steps. Use the [`from_config`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/configuration/#mindone.diffusers.configuration_utils.ConfigMixin.from_config) method to load a new scheduler: ```python from mindone.diffusers import DPMSolverMultistepScheduler @@ -122,8 +122,8 @@ image = pipeline(prompt, generator=generator, num_inference_steps=20)[0][0] image ``` -
- +
+
Great, you've managed to cut the inference time to just 4 seconds! ⚡️ @@ -152,24 +152,18 @@ images = pipeline(**get_inputs(batch_size=4))[0] make_image_grid(images, 2, 2) ``` -Unless you have a GPU with more vRAM, the code above probably returned an `OOM` error! Most of the memory is taken up by the cross-attention layers. Instead of running this operation in a batch, you can run it sequentially to save a significant amount of memory. All you have to do is configure the pipeline to use the [`~DiffusionPipeline.enable_attention_slicing`] function: - -```python -pipeline.enable_attention_slicing() -``` - Now try increasing the `batch_size` to 8! ```python -images = pipeline(**get_inputs(batch_size=8)).images +images = pipeline(**get_inputs(batch_size=8))[0] make_image_grid(images, rows=2, cols=4) ``` -
- +
+
-Whereas before you couldn't even generate a batch of 4 images, now you can generate a batch of 8 images at ~3.5 seconds per image! This is probably the fastest you can go on a T4 GPU without sacrificing quality. +Whereas before you couldn't even generate a batch of 4 images, now you can generate a batch of 8 images at ~1.6 seconds per image! This is probably the fastest you can go on a Ascend 910B without sacrificing quality. ## Quality @@ -188,14 +182,14 @@ You can also try replacing the current pipeline components with a newer version. ```python from mindone.diffusers import AutoencoderKL -vae = AutoencoderKL.from_pretrained("stabilityai/sd-vae-ft-mse", mindspore_dtype=mindspore.float16).to("cuda") +vae = AutoencoderKL.from_pretrained("stabilityai/sd-vae-ft-mse", mindspore_dtype=mindspore.float16) pipeline.vae = vae images = pipeline(**get_inputs(batch_size=8))[0] make_image_grid(images, rows=2, cols=4) ``` -
- +
+
### Better prompt engineering @@ -219,8 +213,8 @@ images = pipeline(**get_inputs(batch_size=8))[0] make_image_grid(images, rows=2, cols=4) ``` -
- +
+
Pretty impressive! Let's tweak the second image - corresponding to the `Generator` with a seed of `1` - a bit more by adding some text about the age of the subject: @@ -238,14 +232,13 @@ images = pipeline(prompt=prompts, generator=generator, num_inference_steps=25)[0 make_image_grid(images, 2, 2) ``` -
- +
+
## Next steps -In this tutorial, you learned how to optimize a [`DiffusionPipeline`] for computational and memory efficiency as well as improving the quality of generated outputs. If you're interested in making your pipeline even faster, take a look at the following resources: +In this tutorial, you learned how to optimize a [`DiffusionPipeline`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/pipelines/overview/#mindone.diffusers.DiffusionPipeline) for computational and memory efficiency as well as improving the quality of generated outputs. If you're interested in making your pipeline even faster, take a look at the following resources: -- Learn how [PyTorch 2.0](./optimization/torch2.0) and [`torch.compile`](https://pytorch.org/docs/stable/generated/torch.compile.html) can yield 5 - 300% faster inference speed. On an A100 GPU, inference can be up to 50% faster! -- If you can't use PyTorch 2, we recommend you install [xFormers](./optimization/xformers). Its memory-efficient attention mechanism works great with PyTorch 1.13.1 for faster speed and reduced memory consumption. -- Other optimization techniques, such as model offloading, are covered in [this guide](./optimization/fp16). +- We recommend you use [xFormers](./optimization/xformers.md). Its memory-efficient attention mechanism works great for faster speed and reduced memory consumption. +- Other optimization techniques, such as model offloading, are covered in [this guide](./optimization/fp16.md). diff --git a/docs/diffusers/training/adapt_a_model.md b/docs/diffusers/training/adapt_a_model.md new file mode 100644 index 0000000000..0352e2a883 --- /dev/null +++ b/docs/diffusers/training/adapt_a_model.md @@ -0,0 +1,47 @@ +# Adapt a model to a new task + +Many diffusion systems share the same components, allowing you to adapt a pretrained model for one task to an entirely different task. + +This guide will show you how to adapt a pretrained text-to-image model for inpainting by initializing and modifying the architecture of a pretrained [`UNet2DConditionModel`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/models/unet2d-cond/#mindone.diffusers.UNet2DConditionModel). + +## Configure UNet2DConditionModel parameters + +A [`UNet2DConditionModel`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/models/unet2d-cond/#mindone.diffusers.UNet2DConditionModel) by default accepts 4 channels in the [input sample](https://mindspore-lab.github.io/mindone/latest/diffusers/api/models/unet2d-cond/#mindone.diffusers.UNet2DConditionModel). For example, load a pretrained text-to-image model like [`stable-diffusion-v1-5/stable-diffusion-v1-5`](https://huggingface.co/stable-diffusion-v1-5/stable-diffusion-v1-5) and take a look at the number of `in_channels`: + +```py +from mindone.diffusers import StableDiffusionPipeline + +pipeline = StableDiffusionPipeline.from_pretrained("stable-diffusion-v1-5/stable-diffusion-v1-5", use_safetensors=True) +pipeline.unet.config["in_channels"] +4 +``` + +Inpainting requires 9 channels in the input sample. You can check this value in a pretrained inpainting model like [`stable-diffusion-v1-5/stable-diffusion-inpainting`](https://huggingface.co/stable-diffusion-v1-5/stable-diffusion-inpainting): + +```py +from mindone.diffusers import StableDiffusionPipeline + +pipeline = StableDiffusionPipeline.from_pretrained("stable-diffusion-v1-5/stable-diffusion-inpainting", use_safetensors=True) +pipeline.unet.config["in_channels"] +9 +``` + +To adapt your text-to-image model for inpainting, you'll need to change the number of `in_channels` from 4 to 9. + +Initialize a [`UNet2DConditionModel`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/models/unet2d-cond/#mindone.diffusers.UNet2DConditionModel) with the pretrained text-to-image model weights, and change `in_channels` to 9. Changing the number of `in_channels` means you need to set `ignore_mismatched_sizes=True` and `low_cpu_mem_usage=False` to avoid a size mismatch error because the shape is different now. + +```py +from mindone.diffusers import UNet2DConditionModel + +model_id = "stable-diffusion-v1-5/stable-diffusion-v1-5" +unet = UNet2DConditionModel.from_pretrained( + model_id, + subfolder="unet", + in_channels=9, + low_cpu_mem_usage=False, + ignore_mismatched_sizes=True, + use_safetensors=True, +) +``` + +The pretrained weights of the other components from the text-to-image model are initialized from their checkpoints, but the input channel weights (`conv_in.weight`) of the `unet` are randomly initialized. It is important to finetune the model for inpainting because otherwise the model returns noise. diff --git a/docs/diffusers/training/controlnet.md b/docs/diffusers/training/controlnet.md new file mode 100644 index 0000000000..0a5eda77bc --- /dev/null +++ b/docs/diffusers/training/controlnet.md @@ -0,0 +1,174 @@ + + +# ControlNet + +[ControlNet](https://hf.co/papers/2302.05543) models are adapters trained on top of another pretrained model. It allows for a greater degree of control over image generation by conditioning the model with an additional input image. The input image can be a canny edge, depth map, human pose, and many more. + +If you want to reduce memory footprint, you should try enabling the `gradient_checkpointing`, `gradient_accumulation_steps`, and `mixed_precision` parameters in the training command. You can also reduce your memory footprint by using memory-efficient attention with [xFormers](../optimization/xformers.md). + +This guide will explore the [train_controlnet.py](https://github.com/mindspore-lab/mindone/blob/master/examples/diffusers/controlnet/train_controlnet.py) training script to help you become familiar with it, and how you can adapt it for your own use-case. + +Before running the script, make sure you install the library from source: + +```bash +git clone https://github.com/mindspore-lab/mindone.git +cd mindone +pip install . +``` + +Lastly, if you want to train a model on your own dataset, take a look at the [Create a dataset for training](create_dataset.md) guide to learn how to create a dataset that works with the training script. + +!!! tip + + The following sections highlight parts of the training script that are important for understanding how to modify it, but it doesn't cover every aspect of the script in detail. If you're interested in learning more, feel free to read through the [script](https://github.com/mindspore-lab/mindone/blob/master/examples/diffusers/controlnet/train_controlnet.py) and let us know if you have any questions or concerns. + +## Script parameters + +The training script provides many parameters to help you customize your training run. All of the parameters and their descriptions are found in the [`parse_args()`](https://github.com/mindspore-lab/mindone/blob/master/examples/diffusers/controlnet/train_controlnet.py#L147) function. This function provides default values for each parameter, such as the training batch size and learning rate, but you can also set your own values in the training command if you'd like. + +For example, to speedup training with mixed precision using the fp16 format, add the `--mixed_precision` parameter to the training command: + +```bash +python train_controlnet.py \ + --mixed_precision="fp16" +``` + +Many of the basic and important parameters are described in the [Text-to-image](text2image.md#script-parameters) training guide, so this guide just focuses on the relevant parameters for ControlNet: + +- `--max_train_samples`: the number of training samples; this can be lowered for faster training, but if you want to stream really large datasets, you'll need to include this parameter and the `--streaming` parameter in your training command +- `--gradient_accumulation_steps`: number of update steps to accumulate before the backward pass; this allows you to train with a bigger batch size than your NPU memory can typically handle + +## Training script + +As with the script parameters, a general walkthrough of the training script is provided in the [Text-to-image](text2image.md#training-script) training guide. Instead, this guide takes a look at the relevant parts of the ControlNet script. + +The training script has a [`make_train_dataset`](https://github.com/mindspore-lab/mindone/blob/master/examples/diffusers/controlnet/train_controlnet.py#L510) function for preprocessing the dataset with image transforms and caption tokenization. You'll see that in addition to the usual caption tokenization and image transforms, the script also includes transforms for the conditioning image. + +```py +conditioning_image_transforms = transforms.Compose( + [ + vision.Resize(args.resolution, interpolation=vision.Inter.BILINEAR), + vision.CenterCrop(args.resolution), + vision.ToTensor(), + ] +) +``` + +Within the [`main()`](https://github.com/mindspore-lab/mindone/blob/master/examples/diffusers/controlnet/train_controlnet.py#L638) function, you'll find the code for loading the tokenizer, text encoder, scheduler and models. This is also where the ControlNet model is loaded either from existing weights or randomly initialized from a UNet: + +```py +if args.controlnet_model_name_or_path: + logger.info("Loading existing controlnet weights") + controlnet = ControlNetModel.from_pretrained(args.controlnet_model_name_or_path) +else: + logger.info("Initializing controlnet weights from unet") + controlnet = ControlNetModel.from_unet(unet) +``` + +The [optimizer](https://github.com/mindspore-lab/mindone/blob/master/examples/diffusers/controlnet/train_controlnet.py#L776) is set up to update the ControlNet parameters: + +```py +params_to_optimize = controlnet.trainable_params() +optimizer = nn.AdamWeightDecay( + params_to_optimize, + learning_rate=lr_scheduler, + beta1=args.adam_beta1, + beta2=args.adam_beta2, + weight_decay=args.adam_weight_decay, + eps=args.adam_epsilon, +) +``` + +Finally, in the [training loop](https://github.com/mindspore-lab/mindone/blob/master/examples/diffusers/controlnet/train_controlnet.py#L846), the conditioning text embeddings and image are passed to the down and mid-blocks of the ControlNet model: + +```py +encoder_hidden_states = self.text_encoder(input_ids, return_dict=False)[0] +controlnet_image = conditioning_pixel_values.to(dtype=self.weight_dtype) + +down_block_res_samples, mid_block_res_sample = self.controlnet( + noisy_latents, + timesteps, + encoder_hidden_states=encoder_hidden_states, + controlnet_cond=controlnet_image, + return_dict=False, +) +``` + +If you want to learn more about how the training loop works, check out the [Understanding pipelines, models and schedulers](../using-diffusers/write_own_pipeline.md) tutorial which breaks down the basic pattern of the denoising process. + +## Launch the script + +Now you're ready to launch the training script! 🚀 + +This guide uses the [fusing/fill50k](https://huggingface.co/datasets/fusing/fill50k) dataset, but remember, you can create and use your own dataset if you want (see the [Create a dataset for training](create_dataset.md) guide). + +Set the environment variable `MODEL_NAME` to a model id on the Hub or a path to a local model and `OUTPUT_DIR` to where you want to save the model. + +Download the following images to condition your training with: + +```bash +wget https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/controlnet_training/conditioning_image_1.png +wget https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/controlnet_training/conditioning_image_2.png +``` + +Then launch the script! + +```bash +export MODEL_DIR="stable-diffusion-v1-5/stable-diffusion-v1-5" +export OUTPUT_DIR="path/to/save/model" + +python train_controlnet.py \ + --pretrained_model_name_or_path=$MODEL_DIR \ + --output_dir=$OUTPUT_DIR \ + --dataset_name=fusing/fill50k \ + --resolution=512 \ + --learning_rate=1e-5 \ + --validation_image "./conditioning_image_1.png" "./conditioning_image_2.png" \ + --validation_prompt "red circle with blue background" "cyan circle with brown floral background" \ + --train_batch_size=1 \ + --gradient_accumulation_steps=4 \ + --push_to_hub +``` + +Once training is complete, you can use your newly trained model for inference! + +```py +from mindone.diffusers import StableDiffusionControlNetPipeline, ControlNetModel +from mindone.diffusers.utils import load_image +import mindspore as ms +import numpy as np + +controlnet = ControlNetModel.from_pretrained("path/to/controlnet", mindspore_dtype=ms.float16) +pipeline = StableDiffusionControlNetPipeline.from_pretrained( + "path/to/base/model", controlnet=controlnet, mindspore_dtype=ms.float16 +) + +control_image = load_image("./conditioning_image_1.png") +prompt = "pale golden rod circle with old lace background" + +generator = np.random.Generator(np.random.PCG64(0)) +image = pipeline(prompt, num_inference_steps=20, generator=generator, image=control_image)[0][0] +image.save("./output.png") +``` + +## Stable Diffusion XL + +Stable Diffusion XL (SDXL) is a powerful text-to-image model that generates high-resolution images, and it adds a second text-encoder to its architecture. Use the [`train_controlnet_sdxl.py`](https://github.com/mindspore-lab/mindone/blob/master/examples/diffusers/controlnet/train_controlnet_sdxl.py) script to train a ControlNet adapter for the SDXL model. + +The SDXL training script is discussed in more detail in the [SDXL training](sdxl.md) guide. + +## Next steps + +Congratulations on training your own ControlNet! To learn more about how to use your new model, the following guides may be helpful: + +- Learn how to [use a ControlNet](../using-diffusers/controlnet.md) for inference on a variety of tasks. diff --git a/docs/diffusers/training/create_dataset.md b/docs/diffusers/training/create_dataset.md new file mode 100644 index 0000000000..cf0cddaffc --- /dev/null +++ b/docs/diffusers/training/create_dataset.md @@ -0,0 +1,81 @@ +# Create a dataset for training + +There are many datasets on the [Hub](https://huggingface.co/datasets?task_categories=task_categories:text-to-image&sort=downloads) to train a model on, but if you can't find one you're interested in or want to use your own, you can create a dataset with the 🤗 [Datasets](hf.co/docs/datasets) library. The dataset structure depends on the task you want to train your model on. The most basic dataset structure is a directory of images for tasks like unconditional image generation. Another dataset structure may be a directory of images and a text file containing their corresponding text captions for tasks like text-to-image generation. + +This guide will show you two ways to create a dataset to finetune on: + +- provide a folder of images to the `--train_data_dir` argument +- upload a dataset to the Hub and pass the dataset repository id to the `--dataset_name` argument + +!!! tip + + 💡 Learn more about how to create an image dataset for training in the [Create an image dataset](https://huggingface.co/docs/datasets/image_dataset) guide. + +## Provide a dataset as a folder + +For unconditional generation, you can provide your own dataset as a folder of images. The training script uses the [`ImageFolder`](https://huggingface.co/docs/datasets/en/image_dataset#imagefolder) builder from 🤗 Datasets to automatically build a dataset from the folder. Your directory structure should look like: + +```bash +data_dir/xxx.png +data_dir/xxy.png +data_dir/[...]/xxz.png +``` + +Pass the path to the dataset directory to the `--train_data_dir` argument, and then you can start training: + +```bash +python train_unconditional.py \ + --train_data_dir \ + +``` + +## Upload your data to the Hub + +!!! tip + + 💡 For more details and context about creating and uploading a dataset to the Hub, take a look at the [Image search with 🤗 Datasets](https://huggingface.co/blog/image-search-datasets) post. + +Start by creating a dataset with the [`ImageFolder`](https://huggingface.co/docs/datasets/image_load#imagefolder) feature, which creates an `image` column containing the PIL-encoded images. + +You can use the `data_dir` or `data_files` parameters to specify the location of the dataset. The `data_files` parameter supports mapping specific files to dataset splits like `train` or `test`: + +```python +from datasets import load_dataset + +# example 1: local folder +dataset = load_dataset("imagefolder", data_dir="path_to_your_folder") + +# example 2: local files (supported formats are tar, gzip, zip, xz, rar, zstd) +dataset = load_dataset("imagefolder", data_files="path_to_zip_file") + +# example 3: remote files (supported formats are tar, gzip, zip, xz, rar, zstd) +dataset = load_dataset( + "imagefolder", + data_files="https://download.microsoft.com/download/3/E/1/3E1C3F21-ECDB-4869-8368-6DEBA77B919F/kagglecatsanddogs_3367a.zip", +) + +# example 4: providing several splits +dataset = load_dataset( + "imagefolder", data_files={"train": ["path/to/file1", "path/to/file2"], "test": ["path/to/file3", "path/to/file4"]} +) +``` + +Then use the [`~datasets.Dataset.push_to_hub`] method to upload the dataset to the Hub: + +```python +# assuming you have ran the huggingface-cli login command in a terminal +dataset.push_to_hub("name_of_your_dataset") + +# if you want to push to a private repo, simply pass private=True: +dataset.push_to_hub("name_of_your_dataset", private=True) +``` + +Now the dataset is available for training by passing the dataset name to the `--dataset_name` argument: + +```bash +python train_text_to_image.py \ + --mixed_precision="fp16" \ + --pretrained_model_name_or_path="stable-diffusion-v1-5/stable-diffusion-v1-5" \ + --dataset_name="name_of_your_dataset" \ + +``` diff --git a/docs/diffusers/training/dreambooth.md b/docs/diffusers/training/dreambooth.md new file mode 100644 index 0000000000..ddee6e2565 --- /dev/null +++ b/docs/diffusers/training/dreambooth.md @@ -0,0 +1,254 @@ + + +# DreamBooth + +[DreamBooth](https://huggingface.co/papers/2208.12242) is a training technique that updates the entire diffusion model by training on just a few images of a subject or style. It works by associating a special word in the prompt with the example images. + +If you want to reduce memory footprint, you should try enabling the `gradient_checkpointing` and `mixed_precision` parameters in the training command. You can also reduce your memory footprint by using memory-efficient attention with [xFormers](../optimization/xformers.md). + +This guide will explore the [train_dreambooth.py](https://github.com/mindspore-lab/mindone/blob/master/examples/diffusers/dreambooth/train_dreambooth.py) script to help you become more familiar with it, and how you can adapt it for your own use-case. + +Before running the script, make sure you install the library from source: + +```bash +git clone https://github.com/mindspore-lab/mindone.git +cd mindone +pip install . +``` + +Lastly, if you want to train a model on your own dataset, take a look at the [Create a dataset for training](create_dataset.md) guide to learn how to create a dataset that works with the training script. + +!!! tip + + The following sections highlight parts of the training script that are important for understanding how to modify it, but it doesn't cover every aspect of the script in detail. If you're interested in learning more, feel free to read through the [script](https://github.com/mindspore-lab/mindone/blob/master/examples/diffusers/dreambooth/train_dreambooth.py) and let us know if you have any questions or concerns. + +## Script parameters + +!!! warning + + DreamBooth is very sensitive to training hyperparameters, and it is easy to overfit. Read the [Training Stable Diffusion with Dreambooth using 🧨 Diffusers](https://huggingface.co/blog/dreambooth) blog post for recommended settings for different subjects to help you choose the appropriate hyperparameters. + +The training script offers many parameters for customizing your training run. All of the parameters and their descriptions are found in the [`parse_args()`](https://github.com/mindspore-lab/mindone/blob/master/examples/diffusers/dreambooth/train_dreambooth.py#L128) function. The parameters are set with default values that should work pretty well out-of-the-box, but you can also set your own values in the training command if you'd like. + +For example, to train in the bf16 format: + +```bash +python train_dreambooth.py \ + --mixed_precision="bf16" +``` + +Some basic and important parameters to know and specify are: + +- `--pretrained_model_name_or_path`: the name of the model on the Hub or a local path to the pretrained model +- `--instance_data_dir`: path to a folder containing the training dataset (example images) +- `--instance_prompt`: the text prompt that contains the special word for the example images +- `--train_text_encoder`: whether to also train the text encoder +- `--output_dir`: where to save the trained model +- `--push_to_hub`: whether to push the trained model to the Hub +- `--checkpointing_steps`: frequency of saving a checkpoint as the model trains; this is useful if for some reason training is interrupted, you can continue training from that checkpoint by adding `--resume_from_checkpoint` to your training command + +### Prior preservation loss + +Prior preservation loss is a method that uses a model's own generated samples to help it learn how to generate more diverse images. Because these generated sample images belong to the same class as the images you provided, they help the model retain what it has learned about the class and how it can use what it already knows about the class to make new compositions. + +- `--with_prior_preservation`: whether to use prior preservation loss +- `--prior_loss_weight`: controls the influence of the prior preservation loss on the model +- `--class_data_dir`: path to a folder containing the generated class sample images +- `--class_prompt`: the text prompt describing the class of the generated sample images + +```bash +python train_dreambooth.py \ + --with_prior_preservation \ + --prior_loss_weight=1.0 \ + --class_data_dir="path/to/class/images" \ + --class_prompt="text prompt describing class" +``` + +### Train text encoder + +To improve the quality of the generated outputs, you can also train the text encoder in addition to the UNet. This requires additional memory. If you have the necessary hardware, then training the text encoder produces better results, especially when generating images of faces. Enable this option by: + +```bash +python train_dreambooth.py \ + --train_text_encoder +``` + +## Training script + +DreamBooth comes with its own dataset classes: + +- [`DreamBoothDataset`](https://github.com/mindspore-lab/mindone/blob/master/examples/diffusers/dreambooth/train_dreambooth.py#L521): preprocesses the images and class images, and tokenizes the prompts for training +- [`PromptDataset`](https://github.com/mindspore-lab/mindone/blob/master/examples/diffusers/dreambooth/train_dreambooth.py#L653): generates the prompt embeddings to generate the class images + +If you enabled [prior preservation loss](https://github.com/mindspore-lab/mindone/blob/master/examples/diffusers/dreambooth/train_dreambooth.py#L734), the class images are generated here: + +```py +sample_dataset = PromptDataset(args.class_prompt, num_new_images) +sample_dataloader = GeneratorDataset( + sample_dataset, column_names=["example"], shard_id=args.rank, num_shards=args.world_size +).batch(batch_size=args.sample_batch_size) + +for (example,) in tqdm( + sample_dataloader_iter, + desc="Generating class images", + total=len(sample_dataloader), + disable=not is_master(args), +): + images = pipeline(example["prompt"].tolist())[0] +``` + +Next is the [`main()`](https://github.com/mindspore-lab/mindone/blob/master/examples/diffusers/dreambooth/train_dreambooth.py#L715) function which handles setting up the dataset for training and the training loop itself. The script loads the [tokenizer](https://github.com/mindspore-lab/mindone/blob/master/examples/diffusers/dreambooth/train_dreambooth.py#L794), [scheduler and models](https://github.com/mindspore-lab/mindone/blob/master/examples/diffusers/dreambooth/train_dreambooth.py#L808): + +```py +# Load the tokenizer +if args.tokenizer_name: + tokenizer = AutoTokenizer.from_pretrained(args.tokenizer_name, revision=args.revision, use_fast=False) +elif args.pretrained_model_name_or_path: + tokenizer = AutoTokenizer.from_pretrained( + args.pretrained_model_name_or_path, + subfolder="tokenizer", + revision=args.revision, + use_fast=False, + ) + +# Load scheduler and models +noise_scheduler = DDPMScheduler.from_pretrained(args.pretrained_model_name_or_path, subfolder="scheduler") +text_encoder = text_encoder_cls.from_pretrained( + args.pretrained_model_name_or_path, subfolder="text_encoder", revision=args.revision, variant=args.variant +) + +if model_has_vae(args): + vae = AutoencoderKL.from_pretrained( + args.pretrained_model_name_or_path, subfolder="vae", revision=args.revision, variant=args.variant + ) +else: + vae = None + +unet = UNet2DConditionModel.from_pretrained( + args.pretrained_model_name_or_path, subfolder="unet", revision=args.revision, variant=args.variant +) +``` + +Then, it's time to [create the training dataset](https://github.com/mindspore-lab/mindone/blob/master/examples/diffusers/dreambooth/train_dreambooth.py#L892) and DataLoader from `DreamBoothDataset`: + +```py +train_dataset = DreamBoothDataset( + instance_data_root=args.instance_data_dir, + instance_prompt=args.instance_prompt, + class_data_root=args.class_data_dir if args.with_prior_preservation else None, + class_prompt=args.class_prompt, + class_num=args.num_class_images, + tokenizer=tokenizer, + size=args.resolution, + center_crop=args.center_crop, + encoder_hidden_states=pre_computed_encoder_hidden_states, + class_prompt_encoder_hidden_states=pre_computed_class_prompt_encoder_hidden_states, + tokenizer_max_length=args.tokenizer_max_length, +) + +train_dataloader = GeneratorDataset( + train_dataset, + column_names=["example"], + shuffle=True, + shard_id=args.rank, + num_shards=args.world_size, + num_parallel_workers=args.dataloader_num_workers, +).batch( + batch_size=args.train_batch_size, + per_batch_map=lambda examples, batch_info: collate_fn(examples, args.with_prior_preservation), + input_columns=["example"], + output_columns=["c1", "c2"] if args.pre_compute_text_embeddings else ["c1", "c2", "c3"], + num_parallel_workers=args.dataloader_num_workers, +) +``` + +Lastly, the [training loop](https://github.com/mindspore-lab/mindone/blob/master/examples/diffusers/dreambooth/train_dreambooth.py#L1028) takes care of the remaining steps such as converting images to latent space, adding noise to the input, predicting the noise residual, and calculating the loss. + +If you want to learn more about how the training loop works, check out the [Understanding pipelines, models and schedulers](../using-diffusers/write_own_pipeline.md) tutorial which breaks down the basic pattern of the denoising process. + +## Launch the script + +You're now ready to launch the training script! 🚀 + +For this guide, you'll download some images of a [dog](https://huggingface.co/datasets/diffusers/dog-example) and store them in a directory. But remember, you can create and use your own dataset if you want (see the [Create a dataset for training](create_dataset.md) guide). + +```py +from huggingface_hub import snapshot_download + +local_dir = "./dog" +snapshot_download( + "diffusers/dog-example", + local_dir=local_dir, + repo_type="dataset", + ignore_patterns=".gitattributes", +) +``` + +Set the environment variable `MODEL_NAME` to a model id on the Hub or a path to a local model, `INSTANCE_DIR` to the path where you just downloaded the dog images to, and `OUTPUT_DIR` to where you want to save the model. You'll use `sks` as the special word to tie the training to. + +If you're interested in following along with the training process, you can periodically save generated images as training progresses. Add the following parameters to the training command: + +```bash +--validation_prompt="a photo of a sks dog" +--num_validation_images=4 +--validation_steps=100 +``` + +```bash +export MODEL_NAME="stable-diffusion-v1-5/stable-diffusion-v1-5" +export INSTANCE_DIR="./dog" +export OUTPUT_DIR="path_to_saved_model" + +python train_dreambooth.py \ + --pretrained_model_name_or_path=$MODEL_NAME \ + --instance_data_dir=$INSTANCE_DIR \ + --output_dir=$OUTPUT_DIR \ + --instance_prompt="a photo of sks dog" \ + --resolution=512 \ + --train_batch_size=1 \ + --gradient_accumulation_steps=1 \ + --learning_rate=5e-6 \ + --lr_scheduler="constant" \ + --lr_warmup_steps=0 \ + --max_train_steps=400 \ + --push_to_hub +``` + +Once training is complete, you can use your newly trained model for inference! + +```py +from mindone.diffusers import DiffusionPipeline +import mindspore as ms + +pipeline = DiffusionPipeline.from_pretrained("path_to_saved_model", mindspore_dtype=ms.float16, use_safetensors=True) +image = pipeline("A photo of sks dog in a bucket", num_inference_steps=50, guidance_scale=7.5)[0][0] +image.save("dog-bucket.png") +``` + +## LoRA + +LoRA is a training technique for significantly reducing the number of trainable parameters. As a result, training is faster and it is easier to store the resulting weights because they are a lot smaller (~100MBs). Use the [train_dreambooth_lora.py](https://github.com/mindspore-lab/mindone/blob/master/examples/diffusers/dreambooth/train_dreambooth_lora.py) script to train with LoRA. + +The LoRA training script is discussed in more detail in the [LoRA training](lora.md) guide. + +## Stable Diffusion XL + +Stable Diffusion XL (SDXL) is a powerful text-to-image model that generates high-resolution images, and it adds a second text-encoder to its architecture. Use the [train_dreambooth_lora_sdxl.py](https://github.com/mindspore-lab/mindone/blob/master/examples/diffusers/dreambooth/train_dreambooth_lora_sdxl.py) script to train a SDXL model with LoRA. + +The SDXL training script is discussed in more detail in the [SDXL training](sdxl.md) guide. + +## Next steps + +Congratulations on training your DreamBooth model! To learn more about how to use your new model, the following guide may be helpful: + +- Learn how to [load a DreamBooth](../using-diffusers/loading_adapters.md) model for inference if you trained your model with LoRA. diff --git a/docs/diffusers/training/lora.md b/docs/diffusers/training/lora.md new file mode 100644 index 0000000000..4064259130 --- /dev/null +++ b/docs/diffusers/training/lora.md @@ -0,0 +1,171 @@ + + +# LoRA + +!!! warning + + This is experimental and the API may change in the future. + +[LoRA (Low-Rank Adaptation of Large Language Models)](https://hf.co/papers/2106.09685) is a popular and lightweight training technique that significantly reduces the number of trainable parameters. It works by inserting a smaller number of new weights into the model and only these are trained. This makes training with LoRA much faster, memory-efficient, and produces smaller model weights (a few hundred MBs), which are easier to store and share. LoRA can also be combined with other training techniques like DreamBooth to speedup training. + +!!! tip + + LoRA is very versatile and supported for [DreamBooth](https://github.com/mindspore-lab/mindone/blob/master/examples/diffusers/dreambooth/train_dreambooth_lora.py), [Stable Diffusion XL](https://github.com/mindspore-lab/mindone/blob/master/examples/diffusers/text_to_image/train_text_to_image_lora_sdxl.py) and [text-to-image](https://github.com/mindspore-lab/mindone/blob/master/examples/diffusers/text_to_image/train_text_to_image_lora.py). + +This guide will explore the [train_text_to_image_lora.py](https://github.com/mindspore-lab/mindone/blob/master/examples/diffusers/text_to_image/train_text_to_image_lora.py) script to help you become more familiar with it, and how you can adapt it for your own use-case. + +Before running the script, make sure you install the library from source: + +```bash +git clone https://github.com/mindspore-lab/mindone.git +cd mindone +pip install . +``` + +Lastly, if you want to train a model on your own dataset, take a look at the [Create a dataset for training](create_dataset.md) guide to learn how to create a dataset that works with the training script. + +!!! tip + + The following sections highlight parts of the training script that are important for understanding how to modify it, but it doesn't cover every aspect of the script in detail. If you're interested in learning more, feel free to read through the [script](https://github.com/mindspore-lab/mindone/blob/master/examples/diffusers/text_to_image/train_text_to_image_lora.py) and let us know if you have any questions or concerns. + +## Script parameters + +The training script has many parameters to help you customize your training run. All of the parameters and their descriptions are found in the [`parse_args()`](https://github.com/mindspore-lab/mindone/blob/master/examples/diffusers/text_to_image/train_text_to_image_lora.py#L98) function. Default values are provided for most parameters that work pretty well, but you can also set your own values in the training command if you'd like. + +For example, to increase the number of epochs to train: + +```bash +python train_text_to_image_lora.py \ + --num_train_epochs=150 \ +``` + +Many of the basic and important parameters are described in the [Text-to-image](text2image.md#script-parameters) training guide, so this guide just focuses on the LoRA relevant parameters: + +- `--rank`: the inner dimension of the low-rank matrices to train; a higher rank means more trainable parameters +- `--learning_rate`: the default learning rate is 1e-4, but with LoRA, you can use a higher learning rate + +## Training script + +The dataset preprocessing code and training loop are found in the [`main()`](https://github.com/mindspore-lab/mindone/blob/master/examples/diffusers/text_to_image/train_text_to_image_lora.py#L402) function, and if you need to adapt the training script, this is where you'll make your changes. + +As with the script parameters, a walkthrough of the training script is provided in the [Text-to-image](text2image.md#training-script) training guide. Instead, this guide takes a look at the LoRA relevant parts of the script. + +=== "UNet" + + Diffusers uses [`~peft.LoraConfig`] from the [PEFT](https://hf.co/docs/peft) library to set up the parameters of the LoRA adapter such as the rank, alpha, and which modules to insert the LoRA weights into. The adapter is added to the UNet, and only the LoRA layers are filtered for optimization in `lora_layers`. + + ```py + unet_lora_config = LoraConfig( + r=args.lora_rank, + lora_alpha=args.lora_rank, + init_lora_weights="gaussian", + target_modules=["to_k", "to_q", "to_v", "to_out.0"], + ) + + unet.add_adapter(unet_lora_config) + lora_layers = list(filter(lambda p: p.requires_grad, unet.get_parameters())) + ``` + +=== "text encoder" + + Diffusers also supports finetuning the text encoder with LoRA from the [PEFT](https://hf.co/docs/peft) library when necessary such as finetuning Stable Diffusion XL (SDXL). The [`~peft.LoraConfig`] is used to configure the parameters of the LoRA adapter which are then added to the text encoder, and only the LoRA layers are filtered for training. + + ```py + text_lora_config = LoraConfig( + r=args.lora_rank, + lora_alpha=args.lora_rank, + init_lora_weights="gaussian", + target_modules=["q_proj", "k_proj", "v_proj", "out_proj"], + ) + + text_encoder_one.add_adapter(text_lora_config) + text_encoder_two.add_adapter(text_lora_config) + if args.train_text_encoder: + params_to_optimize = ( + params_to_optimize + + list(filter(lambda p: p.requires_grad, text_encoder_one.get_parameters())) + + list(filter(lambda p: p.requires_grad, text_encoder_two.get_parameters())) + ) + ``` + +The [optimizer](https://github.com/mindspore-lab/mindone/blob/master/examples/diffusers/text_to_image/train_text_to_image_lora.py#L646) is initialized with the `lora_layers` because these are the only weights that'll be optimized: + +```py +optimizer = nn.AdamWeightDecay( + lora_layers, + learning_rate=lr_scheduler, + beta1=args.adam_beta1, + beta2=args.adam_beta2, + weight_decay=args.adam_weight_decay, + eps=args.adam_epsilon, +) +``` + +Aside from setting up the LoRA layers, the training script is more or less the same as train_text_to_image.py! + +## Launch the script + +Once you've made all your changes or you're okay with the default configuration, you're ready to launch the training script! 🚀 + +Let's train on the [Naruto BLIP captions](https://huggingface.co/datasets/lambdalabs/naruto-blip-captions) dataset to generate your own Naruto characters. Set the environment variables `MODEL_NAME` and `DATASET_NAME` to the model and dataset respectively. You should also specify where to save the model in `OUTPUT_DIR`, and the name of the model to save to on the Hub with `HUB_MODEL_ID`. The script creates and saves the following files to your repository: + +- saved model checkpoints +- `pytorch_lora_weights.safetensors` (the trained LoRA weights) + +```bash +export MODEL_NAME="stable-diffusion-v1-5/stable-diffusion-v1-5" +export OUTPUT_DIR="/sddata/finetune/lora/naruto" +export HUB_MODEL_ID="naruto-lora" +export DATASET_NAME="lambdalabs/naruto-blip-captions" + +python train_text_to_image_lora.py \ + --mixed_precision="fp16" \ + --pretrained_model_name_or_path=$MODEL_NAME \ + --dataset_name=$DATASET_NAME \ + --dataloader_num_workers=8 \ + --resolution=512 \ + --center_crop \ + --random_flip \ + --train_batch_size=1 \ + --gradient_accumulation_steps=4 \ + --max_train_steps=15000 \ + --learning_rate=1e-04 \ + --max_grad_norm=1 \ + --lr_scheduler="cosine" \ + --lr_warmup_steps=0 \ + --output_dir=${OUTPUT_DIR} \ + --push_to_hub \ + --hub_model_id=${HUB_MODEL_ID} \ + --report_to=wandb \ + --checkpointing_steps=500 \ + --validation_prompt="A naruto with blue eyes." \ + --seed=1337 +``` + +Once training has been completed, you can use your model for inference: + +```py +from mindone.diffusers import StableDiffusionPipeline +import mindspore as ms + +pipeline = StableDiffusionPipeline.from_pretrained("stable-diffusion-v1-5/stable-diffusion-v1-5", mindspore_dtype=ms.float16) +pipeline.load_lora_weights("path/to/lora/model", weight_name="pytorch_lora_weights.safetensors") +image = pipeline("A naruto with blue eyes")[0][0] +``` + +## Next steps + +Congratulations on training a new model with LoRA! To learn more about how to use your new model, the following guides may be helpful: + +- Learn how to [load different LoRA formats](../using-diffusers/loading_adapters.md#LoRA) trained using community trainers like Kohya and TheLastBen. +- Learn how to use and [combine multiple LoRA's](../tutorials/using_peft_for_inference.md) with PEFT for inference. diff --git a/docs/diffusers/training/overview.md b/docs/diffusers/training/overview.md new file mode 100644 index 0000000000..300d74ee92 --- /dev/null +++ b/docs/diffusers/training/overview.md @@ -0,0 +1,51 @@ + + +# Overview + +🤗 Diffusers provides a collection of training scripts for you to train your own diffusion models. You can find all of our training scripts in [diffusers/examples](https://github.com/mindspore-lab/mindone/tree/master/examples/diffusers). + +Each training script is: + +- **Self-contained**: the training script does not depend on any local files, and all packages required to run the script are installed from the `requirements.txt` file. +- **Easy-to-tweak**: the training scripts are an example of how to train a diffusion model for a specific task and won't work out-of-the-box for every training scenario. You'll likely need to adapt the training script for your specific use-case. To help you with that, we've fully exposed the data preprocessing code and the training loop so you can modify it for your own use. +- **Beginner-friendly**: the training scripts are designed to be beginner-friendly and easy to understand, rather than including the latest state-of-the-art methods to get the best and most competitive results. Any training methods we consider too complex are purposefully left out. +- **Single-purpose**: each training script is expressly designed for only one task to keep it readable and understandable. + +Our current collection of training scripts include: + +| Training | SDXL-support | LoRA-support | +|---|---|---| +| [unconditional image generation](https://github.com/mindspore-lab/mindone/tree/master/examples/diffusers/unconditional_image_generation) | | | +| [text-to-image](https://github.com/mindspore-lab/mindone/tree/master/examples/diffusers/text_to_image) | 👍 | 👍 | +| [textual inversion](https://github.com/mindspore-lab/mindone/tree/master/examples/diffusers/textual_inversion) | | | +| [DreamBooth](https://github.com/mindspore-lab/mindone/tree/master/examples/diffusers/dreambooth) | 👍 | 👍 | +| [ControlNet](https://github.com/mindspore-lab/mindone/tree/master/examples/diffusers/controlnet) | 👍 | | + +These examples are **actively** maintained, so please feel free to open an issue if they aren't working as expected. If you feel like another training example should be included, you're more than welcome to start a [Feature Request](https://github.com/mindspore-lab/mindone/issues/new?assignees=zhanghuiyao&labels=rfc&projects=&template=feature_request.md&title=) to discuss your feature idea with us and whether it meets our criteria of being self-contained, easy-to-tweak, beginner-friendly, and single-purpose. + +## Install + +Make sure you can successfully run the latest versions of the example scripts by installing the library from source in a new virtual environment: + +```bash +git clone https://github.com/mindspore-lab/mindone.git +cd mindone +pip install . +``` + +Then navigate to the folder of the training script (for example, [DreamBooth](https://github.com/mindspore-lab/mindone/tree/master/examples/diffusers/dreambooth)) and install the `requirements.txt` file. Some training scripts have a specific requirement file for SDXL, LoRA or Flax. If you're using one of these scripts, make sure you install its corresponding requirements file. + +```bash +cd examples/diffusers/dreambooth +pip install -r requirements_sd3.txt +``` diff --git a/docs/diffusers/training/sdxl.md b/docs/diffusers/training/sdxl.md new file mode 100644 index 0000000000..3ef0c3b8a1 --- /dev/null +++ b/docs/diffusers/training/sdxl.md @@ -0,0 +1,191 @@ + + +# Stable Diffusion XL + +!!! warning + + This script is experimental, and it's easy to overfit and run into issues like catastrophic forgetting. Try exploring different hyperparameters to get the best results on your dataset. + +[Stable Diffusion XL (SDXL)](https://hf.co/papers/2307.01952) is a larger and more powerful iteration of the Stable Diffusion model, capable of producing higher resolution images. + +SDXL's UNet is 3x larger and the model adds a second text encoder to the architecture. Depending on the hardware available to you, this can be very computationally intensive. To help fit this larger model into memory and to speedup training, try enabling `gradient_checkpointing`, `mixed_precision`, and `gradient_accumulation_steps`. You can reduce your memory-usage even more by enabling memory-efficient attention with [xFormers](../optimization/xformers.md). + +This guide will explore the [train_text_to_image_sdxl.py](https://github.com/mindspore-lab/mindone/blob/master/examples/diffusers/text_to_image/train_text_to_image_sdxl.py) training script to help you become more familiar with it, and how you can adapt it for your own use-case. + +Before running the script, make sure you install the library from source: + +```bash +git clone https://github.com/mindspore-lab/mindone.git +cd mindone +pip install . +``` + +Lastly, if you want to train a model on your own dataset, take a look at the [Create a dataset for training](create_dataset.md) guide to learn how to create a dataset that works with the training script. + +## Script parameters + +!!! tip + + The following sections highlight parts of the training script that are important for understanding how to modify it, but it doesn't cover every aspect of the script in detail. If you're interested in learning more, feel free to read through the [script](https://github.com/mindspore-lab/mindone/blob/master/examples/diffusers/text_to_image/train_text_to_image_sdxl.py) and let us know if you have any questions or concerns. + +The training script provides many parameters to help you customize your training run. All of the parameters and their descriptions are found in the [`parse_args()`](https://github.com/mindspore-lab/mindone/blob/master/examples/diffusers/text_to_image/train_text_to_image_sdxl.py#L81) function. This function provides default values for each parameter, such as the training batch size and learning rate, but you can also set your own values in the training command if you'd like. + +For example, to speedup training with mixed precision using the bf16 format, add the `--mixed_precision` parameter to the training command: + +```bash +python train_text_to_image_sdxl.py \ + --mixed_precision="bf16" +``` + +Most of the parameters are identical to the parameters in the [Text-to-image](text2image.md#script-parameters) training guide, so you'll focus on the parameters that are relevant to training SDXL in this guide. + +- `--pretrained_vae_model_name_or_path`: path to a pretrained VAE; the SDXL VAE is known to suffer from numerical instability, so this parameter allows you to specify a better [VAE](https://huggingface.co/madebyollin/sdxl-vae-fp16-fix) +- `--proportion_empty_prompts`: the proportion of image prompts to replace with empty strings +- `--timestep_bias_strategy`: where (earlier vs. later) in the timestep to apply a bias, which can encourage the model to either learn low or high frequency details +- `--timestep_bias_multiplier`: the weight of the bias to apply to the timestep +- `--timestep_bias_begin`: the timestep to begin applying the bias +- `--timestep_bias_end`: the timestep to end applying the bias +- `--timestep_bias_portion`: the proportion of timesteps to apply the bias to + +### Min-SNR weighting + +The [Min-SNR](https://huggingface.co/papers/2303.09556) weighting strategy can help with training by rebalancing the loss to achieve faster convergence. The training script supports predicting either `epsilon` (noise) or `v_prediction`, but Min-SNR is compatible with both prediction types. + +Add the `--snr_gamma` parameter and set it to the recommended value of 5.0: + +```bash +python train_text_to_image_sdxl.py \ + --snr_gamma=5.0 +``` + +## Training script + +The training script is also similar to the [Text-to-image](text2image.md#training-script) training guide, but it's been modified to support SDXL training. This guide will focus on the code that is unique to the SDXL training script. + +It starts by creating functions to [tokenize the prompts](https://github.com/mindspore-lab/mindone/blob/master/examples/diffusers/text_to_image/train_text_to_image_sdxl.py#L453) to calculate the prompt embeddings, and to compute the image embeddings with the [VAE](https://github.com/mindspore-lab/mindone/blob/master/examples/diffusers/text_to_image/train_text_to_image_sdxl.py#L493). Next, you'll a function to [generate the timesteps weights](https://github.com/mindspore-lab/mindone/blob/master/examples/diffusers/text_to_image/train_text_to_image_sdxl.py#L504) depending on the number of timesteps and the timestep bias strategy to apply. + +Within the [`main()`](https://github.com/mindspore-lab/mindone/blob/master/examples/diffusers/text_to_image/train_text_to_image_sdxl.py#L548) function, in addition to loading a tokenizer, the script loads a second tokenizer and text encoder because the SDXL architecture uses two of each: + +```py +tokenizer_one = AutoTokenizer.from_pretrained( + args.pretrained_model_name_or_path, + subfolder="tokenizer", + revision=args.revision, + use_fast=False, +) +tokenizer_two = AutoTokenizer.from_pretrained( + args.pretrained_model_name_or_path, + subfolder="tokenizer_2", + revision=args.revision, + use_fast=False, +) + +text_encoder_cls_one = import_model_class_from_model_name_or_path(args.pretrained_model_name_or_path, args.revision) +text_encoder_cls_two = import_model_class_from_model_name_or_path( + args.pretrained_model_name_or_path, args.revision, subfolder="text_encoder_2" +) +``` + +The [prompt and image embeddings](https://github.com/mindspore-lab/mindone/blob/master/examples/diffusers/text_to_image/train_text_to_image_sdxl.py#L764) are computed first and kept in memory, which isn't typically an issue for a smaller dataset, but for larger datasets it can lead to memory problems. If this is the case, you should save the pre-computed embeddings to disk separately and load them into memory during the training process (see this [PR](https://github.com/huggingface/diffusers/pull/4505) for more discussion about this topic). + +```py +text_encoders = [text_encoder_one, text_encoder_two] +tokenizers = [tokenizer_one, tokenizer_two] +compute_embeddings_fn = functools.partial( + encode_prompt, + text_encoders=text_encoders, + tokenizers=tokenizers, + proportion_empty_prompts=args.proportion_empty_prompts, + caption_column=args.caption_column, +) + +train_dataset = train_dataset.map(compute_embeddings_fn, batched=True, new_fingerprint=new_fingerprint) +train_dataset = train_dataset.map( + compute_vae_encodings_fn, + batched=True, + batch_size=args.train_batch_size, + new_fingerprint=new_fingerprint_for_vae, +) +``` + +Finally, the [training loop](https://github.com/mindspore-lab/mindone/blob/master/examples/diffusers/text_to_image/train_text_to_image_sdxl.py#L909) takes care of the rest. If you chose to apply a timestep bias strategy, you'll see the timestep weights are calculated and added as noise: + +```py +weights = generate_timestep_weights(self.args, self.noise_scheduler_num_train_timesteps) +timesteps = multinomial(weights, bsz, replacement=True).long() + +noisy_model_input = self.noise_scheduler.add_noise(model_input, noise, timesteps) +``` + +If you want to learn more about how the training loop works, check out the [Understanding pipelines, models and schedulers](../using-diffusers/write_own_pipeline.md) tutorial which breaks down the basic pattern of the denoising process. + +## Launch the script + +Once you’ve made all your changes or you’re okay with the default configuration, you’re ready to launch the training script! 🚀 + +Let’s train on the [Naruto BLIP captions](https://huggingface.co/datasets/lambdalabs/naruto-blip-captions) dataset to generate your own Naruto characters. Set the environment variables `MODEL_NAME` and `DATASET_NAME` to the model and the dataset (either from the Hub or a local path). You should also specify a VAE other than the SDXL VAE (either from the Hub or a local path) with `VAE_NAME` to avoid numerical instabilities. + +!!! tip + + To monitor training progress with Weights & Biases, add the `--report_to=wandb` parameter to the training command. You’ll also need to add the `--validation_prompt` and `--validation_epochs` to the training command to keep track of results. This can be really useful for debugging the model and viewing intermediate results. + +```bash +export MODEL_NAME="stabilityai/stable-diffusion-xl-base-1.0" +export VAE_NAME="madebyollin/sdxl-vae-fp16-fix" +export DATASET_NAME="lambdalabs/naruto-blip-captions" + +python train_text_to_image_sdxl.py \ + --pretrained_model_name_or_path=$MODEL_NAME \ + --pretrained_vae_model_name_or_path=$VAE_NAME \ + --dataset_name=$DATASET_NAME \ + --enable_xformers_memory_efficient_attention \ + --resolution=512 \ + --center_crop \ + --random_flip \ + --proportion_empty_prompts=0.2 \ + --train_batch_size=1 \ + --gradient_accumulation_steps=4 \ + --gradient_checkpointing \ + --max_train_steps=10000 \ + --use_8bit_adam \ + --learning_rate=1e-06 \ + --lr_scheduler="constant" \ + --lr_warmup_steps=0 \ + --mixed_precision="fp16" \ + --report_to="wandb" \ + --validation_prompt="a cute Sundar Pichai creature" \ + --validation_epochs 5 \ + --checkpointing_steps=5000 \ + --output_dir="sdxl-naruto-model" \ + --push_to_hub +``` + +After you've finished training, you can use your newly trained SDXL model for inference! + +```py +from mindone.diffusers import DiffusionPipeline +import mindspore as ms + +pipeline = DiffusionPipeline.from_pretrained("path/to/your/model", mindspore_dtype=ms.float16) + +prompt = "A naruto with green eyes and red legs." +image = pipeline(prompt, num_inference_steps=30, guidance_scale=7.5)[0][0] +image.save("naruto.png") +``` + +## Next steps + +Congratulations on training a SDXL model! To learn more about how to use your new model, the following guides may be helpful: + +- Read the [Stable Diffusion XL](../using-diffusers/sdxl.md) guide to learn how to use it for a variety of different tasks (text-to-image, image-to-image, inpainting), how to use it's refiner model, and the different types of micro-conditionings. +- Check out the [DreamBooth](dreambooth.md) and [LoRA](lora.md) training guides to learn how to train a personalized SDXL model with just a few example images. These two training techniques can even be combined! diff --git a/docs/diffusers/training/text2image.md b/docs/diffusers/training/text2image.md new file mode 100644 index 0000000000..d570d92e94 --- /dev/null +++ b/docs/diffusers/training/text2image.md @@ -0,0 +1,155 @@ + + +# Text-to-image + +!!! warning + + The text-to-image script is experimental, and it's easy to overfit and run into issues like catastrophic forgetting. Try exploring different hyperparameters to get the best results on your dataset. + +Text-to-image models like Stable Diffusion are conditioned to generate images given a text prompt. + +Training a model can be taxing on your hardware, but if you enable `gradient_checkpointing` and `mixed_precision`, it can reduce video memory usage. You can reduce your memory footprint by enabling memory-efficient attention with [xFormers](../optimization/xformers.md). + +This guide will explore the [train_text_to_image.py](https://github.com/mindspore-lab/mindone/blob/master/examples/diffusers/text_to_image/train_text_to_image.py) training script to help you become familiar with it, and how you can adapt it for your own use-case. + +Before running the script, make sure you install the library from source: + +```bash +git clone https://github.com/mindspore-lab/mindone.git +cd mindone +pip install . +``` + +Lastly, if you want to train a model on your own dataset, take a look at the [Create a dataset for training](create_dataset.md) guide to learn how to create a dataset that works with the training script. + +## Script parameters + +!!! tip + + The following sections highlight parts of the training script that are important for understanding how to modify it, but it doesn't cover every aspect of the script in detail. If you're interested in learning more, feel free to read through the [script](https://github.com/mindspore-lab/mindone/blob/master/examples/diffusers/text_to_image/train_text_to_image.py) and let us know if you have any questions or concerns. + +The training script provides many parameters to help you customize your training run. All of the parameters and their descriptions are found in the [`parse_args()`](https://github.com/mindspore-lab/mindone/blob/master/examples/diffusers/text_to_image/train_text_to_image.py#L90) function. This function provides default values for each parameter, such as the training batch size and learning rate, but you can also set your own values in the training command if you'd like. + +For example, to speedup training with mixed precision using the fp16 format, add the `--mixed_precision` parameter to the training command: + +```bash +python train_text_to_image.py \ + --mixed_precision="fp16" +``` + +Some basic and important parameters include: + +- `--pretrained_model_name_or_path`: the name of the model on the Hub or a local path to the pretrained model +- `--dataset_name`: the name of the dataset on the Hub or a local path to the dataset to train on +- `--image_column`: the name of the image column in the dataset to train on +- `--caption_column`: the name of the text column in the dataset to train on +- `--output_dir`: where to save the trained model +- `--push_to_hub`: whether to push the trained model to the Hub +- `--checkpointing_steps`: frequency of saving a checkpoint as the model trains; this is useful if for some reason training is interrupted, you can continue training from that checkpoint by adding `--resume_from_checkpoint` to your training command + +### Min-SNR weighting + +The [Min-SNR](https://huggingface.co/papers/2303.09556) weighting strategy can help with training by rebalancing the loss to achieve faster convergence. The training script supports predicting `epsilon` (noise) or `v_prediction`, but Min-SNR is compatible with both prediction types. + +Add the `--snr_gamma` parameter and set it to the recommended value of 5.0: + +```bash +python train_text_to_image.py \ + --snr_gamma=5.0 +``` + +You can compare the loss surfaces for different `snr_gamma` values in this [Weights and Biases](https://wandb.ai/sayakpaul/text2image-finetune-minsnr) report. For smaller datasets, the effects of Min-SNR may not be as obvious compared to larger datasets. + +## Training script + +The dataset preprocessing code and training loop are found in the [`main()`](https://github.com/mindspore-lab/mindone/blob/master/examples/diffusers/text_to_image/train_text_to_image.py#L411) function. If you need to adapt the training script, this is where you'll need to make your changes. + +The `train_text_to_image` script starts by [loading a scheduler](https://github.com/mindspore-lab/mindone/blob/master/examples/diffusers/text_to_image/train_text_to_image.py#L437) and tokenizer. You can choose to use a different scheduler here if you want: + +```py +noise_scheduler = DDPMScheduler.from_pretrained(args.pretrained_model_name_or_path, subfolder="scheduler") +tokenizer = CLIPTokenizer.from_pretrained( + args.pretrained_model_name_or_path, subfolder="tokenizer", revision=args.revision +) +``` + +Then the script [loads the UNet](https://github.com/mindspore-lab/mindone/blob/master/examples/diffusers/text_to_image/train_text_to_image.py#L437) model: + +```py +unet = UNet2DConditionModel.from_pretrained( + args.pretrained_model_name_or_path, subfolder="unet", revision=args.non_ema_revision +) +unet.register_to_config(sample_size=args.resolution // (2 ** (len(vae.config.block_out_channels) - 1))) +``` + +Next, the text and image columns of the dataset need to be preprocessed. The [`tokenize_captions`](https://github.com/mindspore-lab/mindone/blob/master/examples/diffusers/text_to_image/train_text_to_image.py#L548) function handles tokenizing the inputs, and the [`train_transforms`](https://github.com/mindspore-lab/mindone/blob/master/examples/diffusers/text_to_image/train_text_to_image.py#L566) function specifies the type of transforms to apply to the image. Both of these functions are bundled into `preprocess_train`: + +```py +def preprocess_train(examples): + images = [image.convert("RGB") for image in examples[image_column]] + examples["pixel_values"] = [train_transforms(image)[0] for image in images] + examples["input_ids"] = tokenize_captions(examples) + return examples +``` + +Lastly, the [training loop](https://github.com/mindspore-lab/mindone/blob/master/examples/diffusers/text_to_image/train_text_to_image.py#L751) handles everything else. It encodes images into latent space, adds noise to the latents, computes the text embeddings to condition on, updates the model parameters, and saves and pushes the model to the Hub. If you want to learn more about how the training loop works, check out the [Understanding pipelines, models and schedulers](../using-diffusers/write_own_pipeline.md) tutorial which breaks down the basic pattern of the denoising process. + +## Launch the script + +Once you've made all your changes or you're okay with the default configuration, you're ready to launch the training script! 🚀 + +Let's train on the [Naruto BLIP captions](https://huggingface.co/datasets/lambdalabs/naruto-blip-captions) dataset to generate your own Naruto characters. Set the environment variables `MODEL_NAME` and `dataset_name` to the model and the dataset (either from the Hub or a local path). + +!!! tip + + To train on a local dataset, set the `TRAIN_DIR` and `OUTPUT_DIR` environment variables to the path of the dataset and where to save the model to. + +```bash +export MODEL_NAME="stable-diffusion-v1-5/stable-diffusion-v1-5" +export dataset_name="lambdalabs/naruto-blip-captions" + +python train_text_to_image.py \ + --mixed_precision="fp16" \ + --pretrained_model_name_or_path=$MODEL_NAME \ + --dataset_name=$dataset_name \ + --resolution=512 --center_crop --random_flip \ + --train_batch_size=1 \ + --gradient_accumulation_steps=4 \ + --gradient_checkpointing \ + --max_train_steps=15000 \ + --learning_rate=1e-05 \ + --max_grad_norm=1 \ + --enable_xformers_memory_efficient_attention \ + --lr_scheduler="constant" --lr_warmup_steps=0 \ + --output_dir="sd-naruto-model" \ + --push_to_hub +``` + +Once training is complete, you can use your newly trained model for inference: + +```py +from mindone.diffusers import StableDiffusionPipeline +import mindspore as ms + +pipeline = StableDiffusionPipeline.from_pretrained("path/to/saved_model", mindspore_dtype=ms.float16, use_safetensors=True) + +image = pipeline(prompt="yoda")[0][0] +image.save("yoda-naruto.png") +``` + +## Next steps + +Congratulations on training your own text-to-image model! To learn more about how to use your new model, the following guides may be helpful: + +- Learn how to [load LoRA weights](../using-diffusers/loading_adapters.md#LoRA) for inference if you trained your model with LoRA. +- Learn more about how certain parameters like guidance scale or techniques such as prompt weighting can help you control inference in the [Text-to-image](../using-diffusers/conditional_image_generation.md) task guide. diff --git a/docs/diffusers/training/text_inversion.md b/docs/diffusers/training/text_inversion.md new file mode 100644 index 0000000000..c002bbb27d --- /dev/null +++ b/docs/diffusers/training/text_inversion.md @@ -0,0 +1,186 @@ + + +# Textual Inversion + +[Textual Inversion](https://hf.co/papers/2208.01618) is a training technique for personalizing image generation models with just a few example images of what you want it to learn. This technique works by learning and updating the text embeddings (the new embeddings are tied to a special word you must use in the prompt) to match the example images you provide. + +If you want to reduce memory footprint, you should try enabling the `gradient_checkpointing` and `mixed_precision` parameters in the training command. You can also reduce your memory footprint by using memory-efficient attention with [xFormers](../optimization/xformers.md). + +This guide will explore the [textual_inversion.py](https://github.com/mindspore-lab/mindone/blob/master/examples/diffusers/textual_inversion/textual_inversion.py) script to help you become more familiar with it, and how you can adapt it for your own use-case. + +Before running the script, make sure you install the library from source: + +```bash +git clone https://github.com/mindspore-lab/mindone.git +cd mindone +pip install . +``` + +Lastly, if you want to train a model on your own dataset, take a look at the [Create a dataset for training](create_dataset.md) guide to learn how to create a dataset that works with the training script. + +!!! tip + + The following sections highlight parts of the training script that are important for understanding how to modify it, but it doesn't cover every aspect of the script in detail. If you're interested in learning more, feel free to read through the [script](https://github.com/mindspore-lab/mindone/blob/master/examples/diffusers/textual_inversion/textual_inversion.py) and let us know if you have any questions or concerns. + +## Script parameters + +The training script has many parameters to help you tailor the training run to your needs. All of the parameters and their descriptions are listed in the [`parse_args()`](https://github.com/mindspore-lab/mindone/blob/master/examples/diffusers/textual_inversion/textual_inversion.py#L100) function. Where applicable, Diffusers provides default values for each parameter such as the training batch size and learning rate, but feel free to change these values in the training command if you'd like. + +For example, to increase the number of gradient accumulation steps above the default value of 1: + +```bash +python textual_inversion.py \ + --gradient_accumulation_steps=4 +``` + +Some other basic and important parameters to specify include: + +- `--pretrained_model_name_or_path`: the name of the model on the Hub or a local path to the pretrained model +- `--train_data_dir`: path to a folder containing the training dataset (example images) +- `--output_dir`: where to save the trained model +- `--push_to_hub`: whether to push the trained model to the Hub +- `--checkpointing_steps`: frequency of saving a checkpoint as the model trains; this is useful if for some reason training is interrupted, you can continue training from that checkpoint by adding `--resume_from_checkpoint` to your training command +- `--num_vectors`: the number of vectors to learn the embeddings with; increasing this parameter helps the model learn better but it comes with increased training costs +- `--placeholder_token`: the special word to tie the learned embeddings to (you must use the word in your prompt for inference) +- `--initializer_token`: a single-word that roughly describes the object or style you're trying to train on +- `--learnable_property`: whether you're training the model to learn a new "style" (for example, Van Gogh's painting style) or "object" (for example, your dog) + +## Training script + +Unlike some of the other training scripts, textual_inversion.py has a custom dataset class, [`TextualInversionDataset`](https://github.com/mindspore-lab/mindone/blob/master/examples/diffusers/textual_inversion/textual_inversion.py#L429) for creating a dataset. You can customize the image size, placeholder token, interpolation method, whether to crop the image, and more. If you need to change how the dataset is created, you can modify `TextualInversionDataset`. + +Next, you'll find the dataset preprocessing code and training loop in the [`main()`](https://github.com/mindspore-lab/mindone/blob/master/examples/diffusers/textual_inversion/textual_inversion.py#L514) function. + +The script starts by loading the [tokenizer](https://github.com/mindspore-lab/mindone/blob/master/examples/diffusers/textual_inversion/textual_inversion.py#L539), [scheduler and model](https://github.com/mindspore-lab/mindone/blob/master/examples/diffusers/textual_inversion/textual_inversion.py#L545): + +```py +# Load tokenizer +if args.tokenizer_name: + tokenizer = CLIPTokenizer.from_pretrained(args.tokenizer_name) +elif args.pretrained_model_name_or_path: + tokenizer = CLIPTokenizer.from_pretrained(args.pretrained_model_name_or_path, subfolder="tokenizer") + +# Load scheduler and models +noise_scheduler = DDPMScheduler.from_pretrained(args.pretrained_model_name_or_path, subfolder="scheduler") +text_encoder = CLIPTextModel.from_pretrained( + args.pretrained_model_name_or_path, subfolder="text_encoder", revision=args.revision +) +vae = AutoencoderKL.from_pretrained( + args.pretrained_model_name_or_path, subfolder="vae", revision=args.revision, variant=args.variant +) +unet = UNet2DConditionModel.from_pretrained( + args.pretrained_model_name_or_path, subfolder="unet", revision=args.revision, variant=args.variant +) +``` + +The special [placeholder token](https://github.com/mindspore-lab/mindone/blob/master/examples/diffusers/textual_inversion/textual_inversion.py#L559) is added next to the tokenizer, and the embedding is readjusted to account for the new token. + +Then, the script [creates a dataset](https://github.com/mindspore-lab/mindone/blob/master/examples/diffusers/textual_inversion/textual_inversion.py#L635) from the `TextualInversionDataset`: + +```py +train_dataset = TextualInversionDataset( + data_root=args.train_data_dir, + tokenizer=tokenizer, + size=args.resolution, + placeholder_token=" ".join(tokenizer.convert_ids_to_tokens(placeholder_token_ids)), + repeats=args.repeats, + learnable_property=args.learnable_property, + center_crop=args.center_crop, + set="train", +) +train_dataloader = GeneratorDataset( + train_dataset, + column_names=["pixel_values", "input_ids"], + shuffle=True, + shard_id=args.rank, + num_shards=args.world_size, + num_parallel_workers=args.dataloader_num_workers, +).batch( + batch_size=args.train_batch_size, + num_parallel_workers=args.dataloader_num_workers, +) +``` + +Finally, the [training loop](https://github.com/mindspore-lab/mindone/blob/master/examples/diffusers/textual_inversion/textual_inversion.py#L758) handles everything else from predicting the noisy residual to updating the embedding weights of the special placeholder token. + +If you want to learn more about how the training loop works, check out the [Understanding pipelines, models and schedulers](../using-diffusers/write_own_pipeline.md) tutorial which breaks down the basic pattern of the denoising process. + +## Launch the script + +Once you've made all your changes or you're okay with the default configuration, you're ready to launch the training script! 🚀 + +For this guide, you'll download some images of a [cat toy](https://huggingface.co/datasets/diffusers/cat_toy_example) and store them in a directory. But remember, you can create and use your own dataset if you want (see the [Create a dataset for training](create_dataset.md) guide). + +```py +from huggingface_hub import snapshot_download + +local_dir = "./cat" +snapshot_download( + "diffusers/cat_toy_example", local_dir=local_dir, repo_type="dataset", ignore_patterns=".gitattributes" +) +``` + +Set the environment variable `MODEL_NAME` to a model id on the Hub or a path to a local model, and `DATA_DIR` to the path where you just downloaded the cat images to. The script creates and saves the following files to your repository: + +- `learned_embeds.bin`: the learned embedding vectors corresponding to your example images +- `token_identifier.txt`: the special placeholder token +- `type_of_concept.txt`: the type of concept you're training on (either "object" or "style") + +One more thing before you launch the script. If you're interested in following along with the training process, you can periodically save generated images as training progresses. Add the following parameters to the training command: + +```bash +--validation_prompt="A train" +--num_validation_images=4 +--validation_steps=100 +``` + +```bash +export MODEL_NAME="stable-diffusion-v1-5/stable-diffusion-v1-5" +export DATA_DIR="./cat" + +python textual_inversion.py \ + --pretrained_model_name_or_path=$MODEL_NAME \ + --train_data_dir=$DATA_DIR \ + --learnable_property="object" \ + --placeholder_token="" \ + --initializer_token="toy" \ + --resolution=512 \ + --train_batch_size=1 \ + --gradient_accumulation_steps=4 \ + --max_train_steps=3000 \ + --learning_rate=5.0e-04 \ + --scale_lr \ + --lr_scheduler="constant" \ + --lr_warmup_steps=0 \ + --output_dir="textual_inversion_cat" \ + --push_to_hub +``` + +After training is complete, you can use your newly trained model for inference like: + +```py +from mindone.diffusers import StableDiffusionPipeline +import mindspore as ms + +pipeline = StableDiffusionPipeline.from_pretrained("stable-diffusion-v1-5/stable-diffusion-v1-5", mindspore_dtype=ms.float16) +pipeline.load_textual_inversion("sd-concepts-library/cat-toy") +image = pipeline("A train", num_inference_steps=50)[0][0] +image.save("cat-train.png") +``` + +## Next steps + +Congratulations on training your own Textual Inversion model! 🎉 To learn more about how to use your new model, the following guides may be helpful: + +- Learn how to [load Textual Inversion embeddings](../using-diffusers/loading_adapters.md) and also use them as negative embeddings. +- Learn how to use [Textual Inversion](../using-diffusers/textual_inversion_inference.md) for inference with Stable Diffusion 1/2 and Stable Diffusion XL. diff --git a/docs/diffusers/training/unconditional_training.md b/docs/diffusers/training/unconditional_training.md new file mode 100644 index 0000000000..b53c9d3557 --- /dev/null +++ b/docs/diffusers/training/unconditional_training.md @@ -0,0 +1,145 @@ + + +# Unconditional image generation + +Unconditional image generation models are not conditioned on text or images during training. It only generates images that resemble its training data distribution. + +This guide will explore the [train_unconditional.py](https://github.com/mindspore-lab/mindone/blob/master/examples/diffusers/unconditional_image_generation/train_unconditional.py) training script to help you become familiar with it, and how you can adapt it for your own use-case. + +Before running the script, make sure you install the library from source: + +```bash +git clone https://github.com/mindspore-lab/mindone.git +cd mindone +pip install . +``` + +Lastly, if you want to train a model on your own dataset, take a look at the [Create a dataset for training](create_dataset.md) guide to learn how to create a dataset that works with the training script. + +## Script parameters + +!!! tip + + The following sections highlight parts of the training script that are important for understanding how to modify it, but it doesn't cover every aspect of the script in detail. If you're interested in learning more, feel free to read through the [script](https://github.com/mindspore-lab/mindone/blob/master/examples/diffusers/unconditional_image_generation/train_unconditional.py) and let us know if you have any questions or concerns. + +The training script provides many parameters to help you customize your training run. All of the parameters and their descriptions are found in the [`parse_args()`](https://github.com/mindspore-lab/mindone/blob/master/examples/diffusers/unconditional_image_generation/train_unconditional.py#L26) function. It provides default values for each parameter, such as the training batch size and learning rate, but you can also set your own values in the training command if you'd like. + +For example, to speedup training with mixed precision using the bf16 format, add the `--mixed_precision` parameter to the training command: + +```bash +python train_unconditional.py \ + --mixed_precision="bf16" +``` + +Some basic and important parameters to specify include: + +- `--dataset_name`: the name of the dataset on the Hub or a local path to the dataset to train on +- `--output_dir`: where to save the trained model +- `--push_to_hub`: whether to push the trained model to the Hub +- `--checkpointing_steps`: frequency of saving a checkpoint as the model trains; this is useful if training is interrupted, you can continue training from that checkpoint by adding `--resume_from_checkpoint` to your training command + +Bring your dataset, and let the training script handle everything else! + +## Training script + +The code for preprocessing the dataset and the training loop is found in the [`main()`](https://github.com/mindspore-lab/mindone/blob/master/examples/diffusers/unconditional_image_generation/train_unconditional.py#L250) function. If you need to adapt the training script, this is where you'll need to make your changes. + +The `train_unconditional` script [initializes a `UNet2DModel`](https://github.com/mindspore-lab/mindone/blob/master/examples/diffusers/unconditional_image_generation/train_unconditional.py#L271) if you don't provide a model configuration. You can configure the UNet here if you'd like: + +```py +model = UNet2DModel( + sample_size=args.resolution, + in_channels=3, + out_channels=3, + layers_per_block=2, + block_out_channels=(128, 128, 256, 256, 512, 512), + down_block_types=( + "DownBlock2D", + "DownBlock2D", + "DownBlock2D", + "DownBlock2D", + "AttnDownBlock2D", + "DownBlock2D", + ), + up_block_types=( + "UpBlock2D", + "AttnUpBlock2D", + "UpBlock2D", + "UpBlock2D", + "UpBlock2D", + "UpBlock2D", + ), +) +``` + +Next, the script initializes a [scheduler](https://github.com/mindspore-lab/mindone/blob/master/examples/diffusers/unconditional_image_generation/train_unconditional.py#L309) and [optimizer](https://github.com/mindspore-lab/mindone/blob/master/examples/diffusers/unconditional_image_generation/train_unconditional.py#L385): + +```py +# Initialize the scheduler +noise_scheduler = DDPMScheduler( + num_train_timesteps=args.ddpm_num_steps, + beta_schedule=args.ddpm_beta_schedule, + prediction_type=args.prediction_type, +) + + +# Initialize the optimizer +optimizer = nn.AdamWeightDecay( + unet.trainable_params(), + learning_rate=lr_scheduler, + beta1=args.adam_beta1, + beta2=args.adam_beta2, + weight_decay=args.adam_weight_decay, + eps=args.adam_epsilon, +) +``` + +Then it [loads a dataset](https://github.com/mindspore-lab/mindone/blob/master/examples/diffusers/unconditional_image_generation/train_unconditional.py#L331) and you can specify how to [preprocess](https://github.com/mindspore-lab/mindone/blob/master/examples/diffusers/unconditional_image_generation/train_unconditional.py#L335) it: + +```py +dataset = load_dataset("imagefolder", data_dir=args.train_data_dir, cache_dir=args.cache_dir, split="train") + +augmentations = transforms.Compose( + [ + vision.Resize(args.resolution, interpolation=vision.Inter.BILINEAR), + vision.CenterCrop(args.resolution) if args.center_crop else vision.RandomCrop(args.resolution), + vision.RandomHorizontalFlip() if args.random_flip else lambda x: x, + vision.ToTensor(), + vision.Normalize([0.5], [0.5], is_hwc=False), + ] +) +``` + +Finally, the [training loop](https://github.com/mindspore-lab/mindone/blob/master/examples/diffusers/unconditional_image_generation/train_unconditional.py#L471) handles everything else such as adding noise to the images, predicting the noise residual, calculating the loss, saving checkpoints at specified steps, and saving and pushing the model to the Hub. If you want to learn more about how the training loop works, check out the [Understanding pipelines, models and schedulers](../using-diffusers/write_own_pipeline.md) tutorial which breaks down the basic pattern of the denoising process. + +## Launch the script + +Once you've made all your changes or you're okay with the default configuration, you're ready to launch the training script! 🚀 + +```bash +python train_unconditional.py \ + --dataset_name="huggan/flowers-102-categories" \ + --output_dir="ddpm-ema-flowers-64" \ + --mixed_precision="fp16" \ + --push_to_hub +``` + +The training script creates and saves a checkpoint file in your repository. Now you can load and use your trained model for inference: + +```py +from mindone.diffusers import DiffusionPipeline +import mindspore as ms + +pipeline = DiffusionPipeline.from_pretrained("anton-l/ddpm-butterflies-128") +image = pipeline()[0][0] +``` diff --git a/docs/diffusers/tutorials/basic_training.md b/docs/diffusers/tutorials/basic_training.md index 1883cb9931..88b994cd60 100644 --- a/docs/diffusers/tutorials/basic_training.md +++ b/docs/diffusers/tutorials/basic_training.md @@ -14,14 +14,14 @@ specific language governing permissions and limitations under the License. Unconditional image generation is a popular application of diffusion models that generates images that look like those in the dataset used for training. Typically, the best results are obtained from finetuning a pretrained model on a specific dataset. You can find many of these checkpoints on the [Hub](https://huggingface.co/search/full-text?q=unconditional-image-generation&type=model), but if you can't find one you like, you can always train your own! -This tutorial will teach you how to train a [`UNet2DModel`] from scratch on a subset of the [Smithsonian Butterflies](https://huggingface.co/datasets/huggan/smithsonian_butterflies_subset) dataset to generate your own 🦋 butterflies 🦋. +This tutorial will teach you how to train a [`UNet2DModel`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/models/unet2d/#mindone.diffusers.UNet2DModel) from scratch on a subset of the [Smithsonian Butterflies](https://huggingface.co/datasets/huggan/smithsonian_butterflies_subset) dataset to generate your own 🦋 butterflies 🦋. !!! tip 💡 This training tutorial is based on the [Training with 🧨 Diffusers](https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/training_example.ipynb) notebook. For additional details and context about diffusion models like how they work, check out the notebook! -Before you begin, make sure you have 🤗 Datasets installed to load and preprocess image datasets, and 🤗 Accelerate, to simplify training on any number of GPUs. The following command will also install [TensorBoard](https://www.tensorflow.org/tensorboard) to visualize training metrics (you can also use [Weights & Biases](https://docs.wandb.ai/) to track your training). +Before you begin, make sure you have 🤗 Datasets installed to load and preprocess image datasets. The following command will also install [TensorBoard](https://www.tensorflow.org/tensorboard) to visualize training metrics (you can also use [Weights & Biases](https://docs.wandb.ai/) to track your training). ```py # uncomment to install the necessary libraries in Colab @@ -30,10 +30,10 @@ Before you begin, make sure you have 🤗 Datasets installed to load and preproc We encourage you to share your model with the community, and in order to do that, you'll need to login to your Hugging Face account (create one [here](https://hf.co/join) if you don't already have one!). You can login from a notebook and enter your token when prompted. Make sure your token has the write role. -```pycon ->>> from huggingface_hub import notebook_login +```python +from huggingface_hub import notebook_login ->>> notebook_login() +notebook_login() ``` Or login in from the terminal: @@ -53,41 +53,41 @@ Since the model checkpoints are quite large, install [Git-LFS](https://git-lfs.c For convenience, create a `TrainingConfig` class containing the training hyperparameters (feel free to adjust them): -```pycon ->>> from dataclasses import dataclass - ->>> @dataclass -... class TrainingConfig: -... image_size = 128 # the generated image resolution -... train_batch_size = 16 -... eval_batch_size = 16 # how many images to sample during evaluation -... num_epochs = 50 -... gradient_accumulation_steps = 1 -... learning_rate = 1e-4 -... lr_warmup_steps = 500 -... save_image_epochs = 10 -... save_model_epochs = 30 -... mixed_precision = "fp16" # `no` for float32, `fp16` for automatic mixed precision -... output_dir = "ddpm-butterflies-128" # the model name locally and on the HF Hub -... -... push_to_hub = True # whether to upload the saved model to the HF Hub -... hub_model_id = "/" # the name of the repository to create on the HF Hub -... hub_private_repo = False -... overwrite_output_dir = True # overwrite the old model when re-running the notebook -... seed = 0 - ->>> config = TrainingConfig() +```python +from dataclasses import dataclass + +@dataclass +class TrainingConfig: + image_size = 128 # the generated image resolution + train_batch_size = 16 + eval_batch_size = 16 # how many images to sample during evaluation + num_epochs = 50 + gradient_accumulation_steps = 1 + learning_rate = 1e-4 + lr_warmup_steps = 500 + save_image_epochs = 10 + save_model_epochs = 30 + mixed_precision = "fp16" # `no` for float32, `fp16` for automatic mixed precision + output_dir = "ddpm-butterflies-128" # the model name locally and on the HF Hub + + push_to_hub = False # whether to upload the saved model to the HF Hub + hub_model_id = "/" # the name of the repository to create on the HF Hub + hub_private_repo = False + overwrite_output_dir = True # overwrite the old model when re-running the notebook + seed = 0 + +config = TrainingConfig() ``` ## Load the dataset You can easily load the [Smithsonian Butterflies](https://huggingface.co/datasets/huggan/smithsonian_butterflies_subset) dataset with the 🤗 Datasets library: -```pycon ->>> from datasets import load_dataset +```python +from datasets import load_dataset ->>> config.dataset_name = "huggan/smithsonian_butterflies_subset" ->>> dataset = load_dataset(config.dataset_name, split="train") +config.dataset_name = "huggan/smithsonian_butterflies_subset" +dataset = load_dataset(config.dataset_name, split="train") ``` !!! tip @@ -96,17 +96,17 @@ You can easily load the [Smithsonian Butterflies](https://huggingface.co/dataset 🤗 Datasets uses the [`~datasets.Image`] feature to automatically decode the image data and load it as a [`PIL.Image`](https://pillow.readthedocs.io/en/stable/reference/Image.html) which we can visualize: -```pycon ->>> import matplotlib.pyplot as plt +```python +import matplotlib.pyplot as plt ->>> fig, axs = plt.subplots(1, 4, figsize=(16, 4)) ->>> for i, image in enumerate(dataset[:4]["image"]): -... axs[i].imshow(image) -... axs[i].set_axis_off() ->>> fig.show() +fig, axs = plt.subplots(1, 4, figsize=(16, 4)) +for i, image in enumerate(dataset[:4]["image"]): + axs[i].imshow(image) + axs[i].set_axis_off() +fig.show() ``` -
+
@@ -116,90 +116,90 @@ The images are all different sizes though, so you'll need to preprocess them fir * `RandomHorizontalFlip` augments the dataset by randomly mirroring the images. * `Normalize` is important to rescale the pixel values into a [-1, 1] range, which is what the model expects. -```pycon ->>> from mindspore.dataset import transforms, vision - ->>> preprocess = transforms.Compose( -... [ -... vision.Resize((config.image_size, config.image_size)), -... vision.RandomHorizontalFlip(), -... vision.ToTensor(), -... vision.Normalize([0.5], [0.5], is_hwc=False), -... ] -... ) +```python +from mindspore.dataset import transforms, vision + +preprocess = transforms.Compose( + [ + vision.Resize((config.image_size, config.image_size)), + vision.RandomHorizontalFlip(), + vision.ToTensor(), + vision.Normalize([0.5], [0.5], is_hwc=False), + ] +) ``` Use 🤗 Datasets' [`~datasets.Dataset.set_transform`] method to apply the `preprocess` function on the fly during training: -```pycon ->>> def transform(examples): -... images = [preprocess(image.convert("RGB"))[0] for image in examples["image"]] -... return {"images": images} +```python +def transform(examples): + images = [preprocess(image.convert("RGB"))[0] for image in examples["image"]] + return {"images": images} ->>> dataset.set_transform(transform) +dataset.set_transform(transform) ``` Feel free to visualize the images again to confirm that they've been resized. Now you're ready to wrap the dataset in a [DataLoader](https://www.mindspore.cn/docs/zh-CN/master/api_python/dataset/mindspore.dataset.GeneratorDataset.html) for training! -```pycon ->>> from mindspore.dataset import GeneratorDataset - ->>> class DatasetForMindData: -... def __init__(self, data): -... self.data = data -... -... def __getitem__(self, idx): -... idx = idx.item() if isinstance(idx, np.integer) else idx -... return np.array(self.data[idx]["images"], dtype=np.float32) -... -... def __len__(self): -... return len(self.data) - ->>> train_dataloader = GeneratorDataset(DatasetForMindData(dataset), batch_size=config.train_batch_size, shuffle=True) +```python +from mindspore.dataset import GeneratorDataset + +class DatasetForMindData: + def __init__(self, data): + self.data = data + + def __getitem__(self, idx): + idx = idx.item() if isinstance(idx, np.integer) else idx + return np.array(self.data[idx]["images"], dtype=np.float32) + + def __len__(self): + return len(self.data) + +train_dataloader = GeneratorDataset(DatasetForMindData(dataset), column_names=["images"], shuffle=True).batch(batch_size=config.train_batch_size) ``` ## Create a UNet2DModel -Pretrained models in 🧨 Diffusers are easily created from their model class with the parameters you want. For example, to create a [`UNet2DModel`]: - -```pycon ->>> from mindone.diffusers import UNet2DModel - ->>> model = UNet2DModel( -... sample_size=config.image_size, # the target image resolution -... in_channels=3, # the number of input channels, 3 for RGB images -... out_channels=3, # the number of output channels -... layers_per_block=2, # how many ResNet layers to use per UNet block -... block_out_channels=(128, 128, 256, 256, 512, 512), # the number of output channels for each UNet block -... down_block_types=( -... "DownBlock2D", # a regular ResNet downsampling block -... "DownBlock2D", -... "DownBlock2D", -... "DownBlock2D", -... "AttnDownBlock2D", # a ResNet downsampling block with spatial self-attention -... "DownBlock2D", -... ), -... up_block_types=( -... "UpBlock2D", # a regular ResNet upsampling block -... "AttnUpBlock2D", # a ResNet upsampling block with spatial self-attention -... "UpBlock2D", -... "UpBlock2D", -... "UpBlock2D", -... "UpBlock2D", -... ), -... ) +Pretrained models in 🧨 Diffusers are easily created from their model class with the parameters you want. For example, to create a [`UNet2DModel`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/models/unet2d/#mindone.diffusers.UNet2DModel): + +```python +from mindone.diffusers import UNet2DModel + +model = UNet2DModel( + sample_size=config.image_size, # the target image resolution + in_channels=3, # the number of input channels, 3 for RGB images + out_channels=3, # the number of output channels + layers_per_block=2, # how many ResNet layers to use per UNet block + block_out_channels=(128, 128, 256, 256, 512, 512), # the number of output channels for each UNet block + down_block_types=( + "DownBlock2D", # a regular ResNet downsampling block + "DownBlock2D", + "DownBlock2D", + "DownBlock2D", + "AttnDownBlock2D", # a ResNet downsampling block with spatial self-attention + "DownBlock2D", + ), + up_block_types=( + "UpBlock2D", # a regular ResNet upsampling block + "AttnUpBlock2D", # a ResNet upsampling block with spatial self-attention + "UpBlock2D", + "UpBlock2D", + "UpBlock2D", + "UpBlock2D", + ), +) ``` It is often a good idea to quickly check the sample image shape matches the model output shape: -```pycon ->>> sample_image = dataset[0]["images"].unsqueeze(0) ->>> print("Input shape:", sample_image.shape) -Input shape: [1, 3, 128, 128] +```python +sample_image = mindspore.Tensor(dataset[0]["images"]).unsqueeze(0) +print("Input shape:", sample_image.shape) +Input shape: (1, 3, 128, 128) ->>> print("Output shape:", model(mindspore.Tensor(sample_image), timestep=0)[0].shape) -Output shape: [1, 3, 128, 128] +print("Output shape:", model(mindspore.Tensor(sample_image), timestep=0)[0].shape) +Output shape: (1, 3, 128, 128) ``` Great! Next, you'll need a scheduler to add some noise to the image. @@ -208,32 +208,32 @@ Great! Next, you'll need a scheduler to add some noise to the image. The scheduler behaves differently depending on whether you're using the model for training or inference. During inference, the scheduler generates image from the noise. During training, the scheduler takes a model output - or a sample - from a specific point in the diffusion process and applies noise to the image according to a *noise schedule* and an *update rule*. -Let's take a look at the [`DDPMScheduler`] and use the `add_noise` method to add some random noise to the `sample_image` from before: +Let's take a look at the [`DDPMScheduler`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/schedulers/ddpm/#mindone.diffusers.DDPMScheduler) and use the `add_noise` method to add some random noise to the `sample_image` from before: -```pycon ->>> import mindspore ->>> from PIL import Image ->>> from mindone.diffusers import DDPMScheduler +```python +import mindspore +from PIL import Image +from mindone.diffusers import DDPMScheduler ->>> noise_scheduler = DDPMScheduler(num_train_timesteps=1000) ->>> noise = mindspore.ops.randn(sample_image.shape) ->>> timesteps = mindspore.Tensor([50]) ->>> noisy_image = noise_scheduler.add_noise(sample_image, noise, timesteps) +noise_scheduler = DDPMScheduler(num_train_timesteps=1000) +noise = mindspore.ops.randn(sample_image.shape) +timesteps = mindspore.Tensor([50]) +noisy_image = noise_scheduler.add_noise(sample_image, noise, timesteps) ->>> Image.fromarray(((noisy_image.permute(0, 2, 3, 1) + 1.0) * 127.5).type(mindspore.uint8).numpy()[0]) +Image.fromarray(((noisy_image.permute(0, 2, 3, 1) + 1.0) * 127.5).type(mindspore.uint8).numpy()[0]) ``` -
- +
+
The training objective of the model is to predict the noise added to the image. The loss at this step can be calculated by: -```pycon ->>> from mindspore import ops +```python +from mindspore import ops ->>> noise_pred = model(noisy_image, timesteps)[0] ->>> loss = ops.mse_loss(noise_pred, noise) +noise_pred = model(noisy_image, timesteps)[0] +loss = ops.mse_loss(noise_pred, noise) ``` ## Train the model @@ -242,40 +242,41 @@ By now, you have most of the pieces to start training the model and all that's l First, you'll need an optimizer and a learning rate scheduler: -```pycon ->>> from mindspore import nn ->>> from mindone.diffusers.optimization import get_cosine_schedule_with_warmup +```python +from mindspore import nn +from mindone.diffusers.optimization import get_cosine_schedule_with_warmup ->>> lr_scheduler = get_cosine_schedule_with_warmup( -... config.learning_rate -... num_warmup_steps=config.lr_warmup_steps, -... num_training_steps=(len(train_dataloader) * config.num_epochs), -... ) ->>> optimizer = nn.AdamWeightDecay(model.trainable_params(), learning_rate=lr_scheduler) +lr_scheduler = get_cosine_schedule_with_warmup( + config.learning_rate, + num_warmup_steps=config.lr_warmup_steps, + num_training_steps=(len(train_dataloader) * config.num_epochs), +) +optimizer = nn.AdamWeightDecay(model.trainable_params(), learning_rate=lr_scheduler) ``` -Then, you'll need a way to evaluate the model. For evaluation, you can use the [`DDPMPipeline`] to generate a batch of sample images and save it as a grid: - -```pycon ->>>import numpy as np from mindone.diffusers import DDPMPipeline ->>> from mindone.diffusers.utils import make_image_grid ->>> import os - ->>> def evaluate(config, epoch, pipeline): -... # Sample some images from random noise (this is the backward diffusion process). -... # The default pipeline output type is `List[PIL.Image]` -... images = pipeline( -... batch_size=config.eval_batch_size, -... generator=np.random.Generator(np.random.PCG64(config.seed)), -... )[0] -... -... # Make a grid out of the images -... image_grid = make_image_grid(images, rows=4, cols=4) -... -... # Save the images -... test_dir = os.path.join(config.output_dir, "samples") -... os.makedirs(test_dir, exist_ok=True) -... image_grid.save(f"{test_dir}/{epoch:04d}.png") +Then, you'll need a way to evaluate the model. For evaluation, you can use the [`DDPMPipeline`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/models/unet2d/#mindone.diffusers.DDPMPipeline) to generate a batch of sample images and save it as a grid: + +```python +import numpy as np +from mindone.diffusers import DDPMPipeline +from mindone.diffusers.utils import make_image_grid +import os + +def evaluate(config, epoch, pipeline): + # Sample some images from random noise (this is the backward diffusion process). + # The default pipeline output type is `List[PIL.Image]` + images = pipeline( + batch_size=config.eval_batch_size, + generator=np.random.Generator(np.random.PCG64(config.seed)), + )[0] + + # Make a grid out of the images + image_grid = make_image_grid(images, rows=4, cols=4) + + # Save the images + test_dir = os.path.join(config.output_dir, "samples") + os.makedirs(test_dir, exist_ok=True) + image_grid.save(f"{test_dir}/{epoch:04d}.png") ``` Now you can wrap all these components together in a training loop with TensorBoard logging, gradient accumulation, and mixed precision training. To upload the model to the Hub, write a function to get your repository name and information and then push it to the Hub. @@ -284,127 +285,130 @@ Now you can wrap all these components together in a training loop with TensorBoa 💡 The training loop below may look intimidating and long, but it'll be worth it later when you launch your training in just one line of code! If you can't wait and want to start generating images, feel free to copy and run the code below. You can always come back and examine the training loop more closely later, like when you're waiting for your model to finish training. 🤗 -```pycon ->>> from huggingface_hub import create_repo, upload_folder ->>> from tqdm.auto import tqdm ->>> from pathlib import Path ->>> import os ->>> from mindone.diffusers.training_utils import TrainStep - ->>> # Write your train step ->>> class MyTrainStep(TrainStep): -... def __init__( -... self, -... model: nn.Cell, -... optimizer: nn.Optimizer, -... noise_scheduler, -... gradient_accumulation_steps, -... length_of_dataloader, -... ): -... super().__init__( -... model, -... optimizer, -... StaticLossScaler(65536), -... 1.0, -... gradient_accumulation_steps, -... gradient_accumulation_kwargs=dict(length_of_dataloader=length_of_dataloader), -... ) -... self.model = model -... self.noise_scheduler = noise_scheduler -... self.noise_scheduler_num_train_timesteps = noise_scheduler.config.num_train_timesteps -... -... def forward(self, clean_images): -... # Sample noise to add to the images -... noise = ops.randn(clean_images.shape) -... bs = clean_images.shape[0] -... -... # Sample a random timestep for each image -... timesteps = ops.randint( -... 0, noise_scheduler_num_train_timesteps, (bs,), dtype=mindspore.int64 -... ) -... -... # Add noise to the clean images according to the noise magnitude at each timestep -... # (this is the forward diffusion process) -... noisy_images = self.noise_scheduler.add_noise(clean_images, noise, timesteps) -... -... # Predict the noise residual -... noise_pred = self.model(noisy_images, timesteps, return_dict=False)[0] -... loss = ops.mse_loss(noise_pred, noise) -... loss = self.scale_loss(loss) -... return loss, noise_pred - ->>> is_main_process, is_local_main_process = True, True ->>> train_step = MyTrainStep(model, optimizer, noise_scheduler, config.gradient_accumulation_steps, len(train_dataloader)) ->>> pipeline = DDPMPipeline(unet=model, scheduler=noise_scheduler) - ->>> def train_loop(config, model, noise_scheduler, optimizer, train_dataloader, lr_scheduler): -... if is_main_process: -... if config.output_dir is not None: -... os.makedirs(config.output_dir, exist_ok=True) -... if config.push_to_hub: -... repo_id = create_repo( -... repo_id=config.hub_model_id or Path(config.output_dir).name, exist_ok=True -... ).repo_id -... -... global_step = 0 -... -... # Now you train the model -... for epoch in range(config.num_epochs): -... progress_bar = tqdm(total=len(train_dataloader), disable=not is_local_main_process) -... progress_bar.set_description(f"Epoch {epoch}") -... -... for step, batch in enumerate(train_dataloader): -... loss, model_pred = train_step(*batch) -... -... progress_bar.update(1) -... logs = {"loss": loss.detach().item(), "lr": lr_scheduler.get_last_lr()[0], "step": global_step} -... progress_bar.set_postfix(**logs) -... accelerator.log(logs, step=global_step) -... global_step += 1 -... -... # After each epoch you optionally sample some demo images with evaluate() and save the model -... if is_main_process: -... if (epoch + 1) % config.save_image_epochs == 0 or epoch == config.num_epochs - 1: -... evaluate(config, epoch, pipeline) -... -... if (epoch + 1) % config.save_model_epochs == 0 or epoch == config.num_epochs - 1: -... if config.push_to_hub: -... upload_folder( -... repo_id=repo_id, -... folder_path=config.output_dir, -... commit_message=f"Epoch {epoch}", -... ignore_patterns=["step_*", "epoch_*"], -... ) -... else: -... pipeline.save_pretrained(config.output_dir) +```python +from huggingface_hub import create_repo, upload_folder +from tqdm.auto import tqdm +from pathlib import Path +import os +from mindone.diffusers.training_utils import TrainStep +from mindspore.amp import StaticLossScaler + +# Write your train step +class MyTrainStep(TrainStep): + def __init__( + self, + model: nn.Cell, + optimizer: nn.Optimizer, + noise_scheduler, + gradient_accumulation_steps, + length_of_dataloader, + ): + super().__init__( + model, + optimizer, + StaticLossScaler(65536), + 1.0, + gradient_accumulation_steps, + gradient_accumulation_kwargs=dict(length_of_dataloader=length_of_dataloader), + ) + self.model = model + self.noise_scheduler = noise_scheduler + self.noise_scheduler_num_train_timesteps = noise_scheduler.config.num_train_timesteps + + def forward(self, clean_images): + # Sample noise to add to the images + noise = ops.randn(clean_images.shape) + bs = clean_images.shape[0] + + # Sample a random timestep for each image + timesteps = ops.randint( + 0, self.noise_scheduler_num_train_timesteps, (bs,), dtype=mindspore.int64 + ) + + # Add noise to the clean images according to the noise magnitude at each timestep + # (this is the forward diffusion process) + noisy_images = self.noise_scheduler.add_noise(clean_images, noise, timesteps) + + # Predict the noise residual + noise_pred = self.model(noisy_images, timesteps, return_dict=False)[0] + loss = ops.mse_loss(noise_pred, noise) + loss = self.scale_loss(loss) + return loss, noise_pred + +is_main_process, is_local_main_process = True, True +train_step = MyTrainStep(model, optimizer, noise_scheduler, config.gradient_accumulation_steps, len(train_dataloader)).set_train() +pipeline = DDPMPipeline(unet=model, scheduler=noise_scheduler) + +def train_loop(config, model, noise_scheduler, optimizer, train_dataloader, lr_scheduler): + if is_main_process: + if config.output_dir is not None: + os.makedirs(config.output_dir, exist_ok=True) + if config.push_to_hub: + repo_id = create_repo( + repo_id=config.hub_model_id or Path(config.output_dir).name, exist_ok=True + ).repo_id + + global_step = 0 + + # Now you train the model + for epoch in range(config.num_epochs): + progress_bar = tqdm(total=len(train_dataloader), disable=not is_local_main_process) + progress_bar.set_description(f"Epoch {epoch}") + + for step, batch in enumerate(train_dataloader): + loss, model_pred = train_step(*batch) + + progress_bar.update(1) + logs = {"loss": loss.item(), "lr": optimizer.get_lr().numpy().item(), "step": global_step} + progress_bar.set_postfix(**logs) + global_step += 1 + + # After each epoch you optionally sample some demo images with evaluate() and save the model + if is_main_process: + if (epoch + 1) % config.save_image_epochs == 0 or epoch == config.num_epochs - 1: + evaluate(config, epoch, pipeline) + + if (epoch + 1) % config.save_model_epochs == 0 or epoch == config.num_epochs - 1: + if config.push_to_hub: + upload_folder( + repo_id=repo_id, + folder_path=config.output_dir, + commit_message=f"Epoch {epoch}", + ignore_patterns=["step_*", "epoch_*"], + ) + else: + pipeline.save_pretrained(config.output_dir) ``` If you want to launch a distributed training, see [tutorial](https://www.mindspore.cn/tutorials/experts/zh-CN/master/parallel/overview.html) from mindspore. And you can get the rank of process by: -```pycon ->>> from mindspore.communication import get_local_rank, get_rank ->>> rank, local_rank = get_rank(), get_local_rank() ->>> is_main_process, is_local_main_process = rank == 0, local_rank == 0 +```python +from mindspore.communication import get_local_rank, get_rank +rank, local_rank = get_rank(), get_local_rank() +is_main_process, is_local_main_process = rank == 0, local_rank == 0 + +mindspore.set_context(mode=mindspore.GRAPH_MODE, jit_syntax_level=mindspore.STRICT) +train_loop(config, model, noise_scheduler, optimizer, train_dataloader, lr_scheduler) ``` Once training is complete, take a look at the final 🦋 images 🦋 generated by your diffusion model! -```pycon ->>> import glob +```python +import glob ->>> sample_images = sorted(glob.glob(f"{config.output_dir}/samples/*.png")) ->>> Image.open(sample_images[-1]) +sample_images = sorted(glob.glob(f"{config.output_dir}/samples/*.png")) +Image.open(sample_images[-1]) ``` -
- +
+
## Next steps -Unconditional image generation is one example of a task that can be trained. You can explore other tasks and training techniques by visiting the [🧨 Diffusers Training Examples](../training/overview) page. Here are some examples of what you can learn: +Unconditional image generation is one example of a task that can be trained. You can explore other tasks and training techniques by visiting the [🧨 Diffusers Training Examples](../training/overview.md) page. Here are some examples of what you can learn: -* [Textual Inversion](../training/text_inversion), an algorithm that teaches a model a specific visual concept and integrates it into the generated image. -* [DreamBooth](../training/dreambooth), a technique for generating personalized images of a subject given several input images of the subject. -* [Guide](../training/text2image) to finetuning a Stable Diffusion model on your own dataset. -* [Guide](../training/lora) to using LoRA, a memory-efficient technique for finetuning really large models faster. +* [Textual Inversion](../training/text_inversion.md), an algorithm that teaches a model a specific visual concept and integrates it into the generated image. +* [DreamBooth](../training/dreambooth.md), a technique for generating personalized images of a subject given several input images of the subject. +* [Guide](../training/text2image.md) to finetuning a Stable Diffusion model on your own dataset. +* [Guide](../training/lora.md) to using LoRA, a memory-efficient technique for finetuning really large models faster. diff --git a/docs/diffusers/tutorials/using_peft_for_inference.md b/docs/diffusers/tutorials/using_peft_for_inference.md index 6b034badc7..05317c069c 100644 --- a/docs/diffusers/tutorials/using_peft_for_inference.md +++ b/docs/diffusers/tutorials/using_peft_for_inference.md @@ -22,7 +22,7 @@ Let's first install all the required libraries. !pip install transformers mindone ``` -Now, load a pipeline with a [Stable Diffusion XL (SDXL)](../api/pipelines/stable_diffusion/stable_diffusion_xl) checkpoint: +Now, load a pipeline with a [Stable Diffusion XL (SDXL)](../api/pipelines/stable_diffusion/stable_diffusion_xl.md) checkpoint: ```python from mindone.diffusers import DiffusionPipeline @@ -33,7 +33,7 @@ pipe_id = "stabilityai/stable-diffusion-xl-base-1.0" pipe = DiffusionPipeline.from_pretrained(pipe_id, mindspore_dtype=mindspore.float16) ``` -Next, load a [CiroN2022/toy-face](https://huggingface.co/CiroN2022/toy-face) adapter with the [`~diffusers.loaders.StableDiffusionXLLoraLoaderMixin.load_lora_weights`] method. With the 🤗 PEFT integration, you can assign a specific `adapter_name` to the checkpoint, which let's you easily switch between different LoRA checkpoints. Let's call this adapter `"toy"`. +Next, load a [CiroN2022/toy-face](https://huggingface.co/CiroN2022/toy-face) adapter with the [`load_lora_weights`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/loaders/lora#mindone.diffusers.loaders.lora.StableDiffusionXLLoraLoaderMixin.load_lora_weights) method. With the 🤗 PEFT integration, you can assign a specific `adapter_name` to the checkpoint, which let's you easily switch between different LoRA checkpoints. Let's call this adapter `"toy"`. ```python pipe.load_lora_weights("CiroN2022/toy-face", weight_name="toy_face_sdxl.safetensors", adapter_name="toy") @@ -44,18 +44,18 @@ Make sure to include the token `toy_face` in the prompt and then you can perform ```python prompt = "toy_face of a hacker with a hoodie" -lora_scale= 0.9 +lora_scale = 0.9 image = pipe( prompt, num_inference_steps=30, cross_attention_kwargs={"scale": lora_scale}, generator=np.random.Generator(np.random.PCG64(0)) )[0][0] image ``` -![toy-face](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/peft_integration/diffusers_peft_lora_inference_8_1.png) +![toy-face](https://github.com/user-attachments/assets/c1796924-ee98-49c4-829b-887874ed7f3d) With the `adapter_name` parameter, it is really easy to use another adapter for inference! Load the [nerijs/pixel-art-xl](https://huggingface.co/nerijs/pixel-art-xl) adapter that has been fine-tuned to generate pixel art images and call it `"pixel"`. -The pipeline automatically sets the first loaded adapter (`"toy"`) as the active adapter, but you can activate the `"pixel"` adapter with the [`~diffusers.loaders.UNet2DConditionLoadersMixin.set_adapters`] method: +The pipeline automatically sets the first loaded adapter (`"toy"`) as the active adapter, but you can activate the `"pixel"` adapter with the [`set_adapters`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/loaders/unet/#mindone.diffusers.loaders.unet.UNet2DConditionLoadersMixin.set_adapters) method: ```python pipe.load_lora_weights("nerijs/pixel-art-xl", weight_name="pixel-art-xl.safetensors", adapter_name="pixel") @@ -72,13 +72,13 @@ image = pipe( image ``` -![pixel-art](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/peft_integration/diffusers_peft_lora_inference_12_1.png) +![pixel-art](https://github.com/user-attachments/assets/fa0e31c8-787e-42dd-8027-a8be89884863) ## Merge adapters You can also merge different adapter checkpoints for inference to blend their styles together. -Once again, use the [`~diffusers.loaders.UNet2DConditionLoadersMixin.set_adapters`] method to activate the `pixel` and `toy` adapters and specify the weights for how they should be merged. +Once again, use the [`set_adapters`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/loaders/unet/#mindone.diffusers.loaders.unet.UNet2DConditionLoadersMixin.set_adapters) method to activate the `pixel` and `toy` adapters and specify the weights for how they should be merged. ```python pipe.set_adapters(["pixel", "toy"], adapter_weights=[0.5, 1.0]) @@ -86,58 +86,115 @@ pipe.set_adapters(["pixel", "toy"], adapter_weights=[0.5, 1.0]) !!! tip - LoRA checkpoints in the diffusion community are almost always obtained with [DreamBooth](https://huggingface.co/docs/diffusers/main/en/training/dreambooth). DreamBooth training often relies on "trigger" words in the input text prompts in order for the generation results to look as expected. When you combine multiple LoRA checkpoints, it's important to ensure the trigger words for the corresponding LoRA checkpoints are present in the input text prompts. + LoRA checkpoints in the diffusion community are almost always obtained with [DreamBooth](https://mindspore-lab.github.io/mindone/latest/diffusers/training/dreambooth/). DreamBooth training often relies on "trigger" words in the input text prompts in order for the generation results to look as expected. When you combine multiple LoRA checkpoints, it's important to ensure the trigger words for the corresponding LoRA checkpoints are present in the input text prompts. Remember to use the trigger words for [CiroN2022/toy-face](https://hf.co/CiroN2022/toy-face) and [nerijs/pixel-art-xl](https://hf.co/nerijs/pixel-art-xl) (these are found in their repositories) in the prompt to generate an image. ```python prompt = "toy_face of a hacker with a hoodie, pixel art" image = pipe( - prompt, num_inference_steps=30, cross_attention_kwargs={"scale": 1.0}, generator=torch.manual_seed(0) -).images[0] + prompt, num_inference_steps=30, cross_attention_kwargs={"scale": 1.0}, generator=np.random.Generator(np.random.PCG64(0)) +)[0][0] image ``` -![toy-face-pixel-art](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/peft_integration/diffusers_peft_lora_inference_16_1.png) +![toy-face-pixel-art](https://github.com/user-attachments/assets/ee327669-3c18-4293-8eaa-0bbd93afbe02) Impressive! As you can see, the model generated an image that mixed the characteristics of both adapters. !!! tip - Through its PEFT integration, Diffusers also offers more efficient merging methods which you can learn about in the [Merge LoRAs](../using-diffusers/merge_loras) guide! + Through its PEFT integration, Diffusers also offers more efficient merging methods which you can learn about in the [Merge LoRAs](../using-diffusers/merge_loras.md) guide! -To return to only using one adapter, use the [`~diffusers.loaders.UNet2DConditionLoadersMixin.set_adapters`] method to activate the `"toy"` adapter: +To return to only using one adapter, use the [`set_adapters`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/loaders/unet/#mindone.diffusers.loaders.unet.UNet2DConditionLoadersMixin.set_adapters) method to activate the `"toy"` adapter: ```python pipe.set_adapters("toy") prompt = "toy_face of a hacker with a hoodie" -lora_scale= 0.9 +lora_scale = 0.9 image = pipe( prompt, num_inference_steps=30, cross_attention_kwargs={"scale": lora_scale}, generator=np.random.Generator(np.random.PCG64(0)) )[0][0] image ``` -Or to disable all adapters entirely, use the [`~diffusers.loaders.UNet2DConditionLoadersMixin.disable_lora`] method to return the base model. +Or to disable all adapters entirely, use the [`disable_lora`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/loaders/unet/#mindone.diffusers.loaders.unet.UNet2DConditionLoadersMixin.disable_lora) method to return the base model. ```python pipe.disable_lora() prompt = "toy_face of a hacker with a hoodie" -lora_scale= 0.9 image = pipe(prompt, num_inference_steps=30, generator=np.random.Generator(np.random.PCG64(0)))[0][0] image ``` +![no-lora](https://github.com/user-attachments/assets/c17dc29e-4a5f-4243-b5f6-18b3dc05e570) + +### Customize adapters strength +For even more customization, you can control how strongly the adapter affects each part of the pipeline. For this, pass a dictionary with the control strengths (called "scales") to [`set_adapters`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/loaders/unet/#mindone.diffusers.loaders.unet.UNet2DConditionLoadersMixin.set_adapters). + +For example, here's how you can turn on the adapter for the `down` parts, but turn it off for the `mid` and `up` parts: +```python +pipe.enable_lora() # enable lora again, after we disabled it above +prompt = "toy_face of a hacker with a hoodie, pixel art" +adapter_weight_scales = { "unet": { "down": 1, "mid": 0, "up": 0} } +pipe.set_adapters("pixel", adapter_weight_scales) +image = pipe(prompt, num_inference_steps=30, generator=np.random.Generator(np.random.PCG64(0)))[0][0] +image +``` + +![block-lora-text-and-down](https://github.com/user-attachments/assets/97822bc2-643b-44bd-837d-94b3f309cf20) + +Let's see how turning off the `down` part and turning on the `mid` and `up` part respectively changes the image. +```python +adapter_weight_scales = { "unet": { "down": 0, "mid": 1, "up": 0} } +pipe.set_adapters("pixel", adapter_weight_scales) +image = pipe(prompt, num_inference_steps=30, generator=np.random.Generator(np.random.PCG64(0)))[0][0] +image +``` + +![block-lora-text-and-mid](https://github.com/user-attachments/assets/86469036-8492-4cd3-bed7-493cf0c28da2) + +```python +adapter_weight_scales = { "unet": { "down": 0, "mid": 0, "up": 1} } +pipe.set_adapters("pixel", adapter_weight_scales) +image = pipe(prompt, num_inference_steps=30, generator=np.random.Generator(np.random.PCG64(0)))[0][0] +image +``` + +![block-lora-text-and-up](https://github.com/user-attachments/assets/b5d80d23-e463-41f3-a9b6-6a5f8f55a7b8) + +Looks cool! + +This is a really powerful feature. You can use it to control the adapter strengths down to per-transformer level. And you can even use it for multiple adapters. +```python +adapter_weight_scales_toy = 0.5 +adapter_weight_scales_pixel = { + "unet": { + "down": 0.9, # all transformers in the down-part will use scale 0.9 + # "mid" # because, in this example, "mid" is not given, all transformers in the mid part will use the default scale 1.0 + "up": { + "block_0": 0.6, # all 3 transformers in the 0th block in the up-part will use scale 0.6 + "block_1": [0.4, 0.8, 1.0], # the 3 transformers in the 1st block in the up-part will use scales 0.4, 0.8 and 1.0 respectively + } + } +} +pipe.set_adapters(["toy", "pixel"], [adapter_weight_scales_toy, adapter_weight_scales_pixel]) +image = pipe(prompt, num_inference_steps=30, generator=np.random.Generator(np.random.PCG64(0)))[0][0] +image +``` + +![block-lora-mixed](https://github.com/user-attachments/assets/c4ffe4dc-6bf9-48e1-9a9e-4ca35c24a7a1) + ## Manage active adapters -You have attached multiple adapters in this tutorial, and if you're feeling a bit lost on what adapters have been attached to the pipeline's components, use the [`~diffusers.loaders.LoraLoaderMixin.get_active_adapters`] method to check the list of active adapters: +You have attached multiple adapters in this tutorial, and if you're feeling a bit lost on what adapters have been attached to the pipeline's components, use the [`get_active_adapters`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/loaders/lora/#mindone.diffusers.loaders.lora.LoraLoaderMixin.get_active_adapters) method to check the list of active adapters: ```py active_adapters = pipe.get_active_adapters() active_adapters -["toy", "pixel"] +['toy', 'pixel'] ``` You can also get the active adapters of each pipeline component with [`~diffusers.loaders.LoraLoaderMixin.get_list_adapters`]: @@ -145,5 +202,5 @@ You can also get the active adapters of each pipeline component with [`~diffuser ```py list_adapters_component_wise = pipe.get_list_adapters() list_adapters_component_wise -{"text_encoder": ["toy", "pixel"], "unet": ["toy", "pixel"], "text_encoder_2": ["toy", "pixel"]} +{"unet": ['toy', 'pixel']} ``` diff --git a/docs/diffusers/using-diffusers/callback.md b/docs/diffusers/using-diffusers/callback.md new file mode 100644 index 0000000000..1ef67d8baa --- /dev/null +++ b/docs/diffusers/using-diffusers/callback.md @@ -0,0 +1,184 @@ + + +# Pipeline callbacks + +The denoising loop of a pipeline can be modified with custom defined functions using the `callback_on_step_end` parameter. The callback function is executed at the end of each step, and modifies the pipeline attributes and variables for the next step. This is really useful for *dynamically* adjusting certain pipeline attributes or modifying tensor variables. This versatility allows for interesting use-cases such as changing the prompt embeddings at each timestep, assigning different weights to the prompt embeddings, and editing the guidance scale. With callbacks, you can implement new features without modifying the underlying code! + +This guide will demonstrate how callbacks work by a few features you can implement with them. + +## Dynamic classifier-free guidance + +Dynamic classifier-free guidance (CFG) is a feature that allows you to disable CFG after a certain number of inference steps which can help you save compute with minimal cost to performance. The callback function for this should have the following arguments: + +- `pipeline` (or the pipeline instance) provides access to important properties such as `num_timesteps` and `guidance_scale`. You can modify these properties by updating the underlying attributes. For this example, you'll disable CFG by setting `pipeline._guidance_scale=0.0`. +- `step_index` and `timestep` tell you where you are in the denoising loop. Use `step_index` to turn off CFG after reaching 40% of `num_timesteps`. +- `callback_kwargs` is a dict that contains tensor variables you can modify during the denoising loop. It only includes variables specified in the `callback_on_step_end_tensor_inputs` argument, which is passed to the pipeline's `__call__` method. Different pipelines may use different sets of variables, so please check a pipeline's `_callback_tensor_inputs` attribute for the list of variables you can modify. Some common variables include `latents` and `prompt_embeds`. For this function, change the batch size of `prompt_embeds` after setting `guidance_scale=0.0` in order for it to work properly. + +Your callback function should look something like this: + +```python +def callback_dynamic_cfg(pipeline, step_index, timestep, callback_kwargs): + # adjust the batch_size of prompt_embeds according to guidance_scale + if step_index == int(pipeline.num_timesteps * 0.4): + prompt_embeds = callback_kwargs["prompt_embeds"] + prompt_embeds = prompt_embeds.chunk(2)[-1] + + # update guidance_scale and prompt_embeds + pipeline._guidance_scale = 0.0 + callback_kwargs["prompt_embeds"] = prompt_embeds + return callback_kwargs +``` + +Now, you can pass the callback function to the `callback_on_step_end` parameter and the `prompt_embeds` to `callback_on_step_end_tensor_inputs`. + +```py +import mindspore as ms +from mindone.diffusers import StableDiffusionPipeline +import numpy as np + +pipeline = StableDiffusionPipeline.from_pretrained("stable-diffusion-v1-5/stable-diffusion-v1-5", mindspore_dtype=ms.float16) + +prompt = "a photo of an astronaut riding a horse on mars" + +generator = np.random.Generator(np.random.PCG64(1)) +out = pipeline( + prompt, + generator=generator, + callback_on_step_end=callback_dynamic_cfg, + callback_on_step_end_tensor_inputs=['prompt_embeds'] +) + +out[0][0].save("out_custom_cfg.png") +``` + +## Interrupt the diffusion process + +!!! tip + + The interruption callback is supported for text-to-image, image-to-image, and inpainting for the [StableDiffusionPipeline](../api/pipelines/stable_diffusion/overview.md) and [StableDiffusionXLPipeline](../api/pipelines/stable_diffusion/stable_diffusion_xl.md). + +Stopping the diffusion process early is useful when building UIs that work with Diffusers because it allows users to stop the generation process if they're unhappy with the intermediate results. You can incorporate this into your pipeline with a callback. + +This callback function should take the following arguments: `pipeline`, `i`, `t`, and `callback_kwargs` (this must be returned). Set the pipeline's `_interrupt` attribute to `True` to stop the diffusion process after a certain number of steps. You are also free to implement your own custom stopping logic inside the callback. + +In this example, the diffusion process is stopped after 10 steps even though `num_inference_steps` is set to 50. + +```python +from mindone.diffusers import StableDiffusionPipeline + +pipeline = StableDiffusionPipeline.from_pretrained("stable-diffusion-v1-5/stable-diffusion-v1-5") +num_inference_steps = 50 + +def interrupt_callback(pipeline, i, t, callback_kwargs): + stop_idx = 10 + if i == stop_idx: + pipeline._interrupt = True + + return callback_kwargs + +pipeline( + "A photo of a cat", + num_inference_steps=num_inference_steps, + callback_on_step_end=interrupt_callback, +) +``` + +## Display image after each generation step + +!!! tip + + This tip was contributed by [asomoza](https://github.com/asomoza). + +Display an image after each generation step by accessing and converting the latents after each step into an image. The latent space is compressed to 128x128, so the images are also 128x128 which is useful for a quick preview. + +1. Use the function below to convert the SDXL latents (4 channels) to RGB tensors (3 channels) as explained in the [Explaining the SDXL latent space](https://huggingface.co/blog/TimothyAlexisVass/explaining-the-sdxl-latent-space) blog post. + +```py +def latents_to_rgb(latents): + weights = ( + (60, -60, 25, -70), + (60, -5, 15, -50), + (60, 10, -5, -35) + ) + + def einsum(tensor1, tensor2): + l, x, y = tensor1.shape[-3:] + l, r = tensor2.shape + res = ops.matmul(tensor2.transpose(1, 0), tensor1.view(*tensor1.shape[: -2], -1)).view(-1, r, x, y) + return res + + weights_tensor = ops.t(ms.Tensor(weights, dtype=latents.dtype)) + biases_tensor = ms.Tensor((150, 140, 130), dtype=latents.dtype) + rgb_tensor = einsum(latents, weights_tensor) + biases_tensor.unsqueeze(-1).unsqueeze(-1) + image_array = rgb_tensor.clamp(0, 255)[0].to(ms.uint8).asnumpy() + image_array = image_array.transpose(1, 2, 0) + + return Image.fromarray(image_array) +``` + +2. Create a function to decode and save the latents into an image. + +```py +def decode_tensors(pipe, step, timestep, callback_kwargs): + latents = callback_kwargs["latents"] + + image = latents_to_rgb(latents) + image.save(f"{step}.png") + + return callback_kwargs +``` + +3. Pass the `decode_tensors` function to the `callback_on_step_end` parameter to decode the tensors after each step. You also need to specify what you want to modify in the `callback_on_step_end_tensor_inputs` parameter, which in this case are the latents. + +```py +from mindone.diffusers import DiffusionPipeline +import mindspore as ms +from PIL import Image + +pipeline = DiffusionPipeline.from_pretrained( + "stabilityai/stable-diffusion-xl-base-1.0", + mindspore_dtype=ms.float16, + use_safetensors=True +) + +image = pipeline( + prompt="A croissant shaped like a cute bear.", + negative_prompt="Deformed, ugly, bad anatomy", + callback_on_step_end=decode_tensors, + callback_on_step_end_tensor_inputs=["latents"], +)[0][0] +``` + +
+
+ +
step 0
+
+
+ +
step 19 +
+
+
+ +
step 29
+
+
+ +
step 39
+
+
+ +
step 49
+
+
diff --git a/docs/diffusers/using-diffusers/conditional_image_generation.md b/docs/diffusers/using-diffusers/conditional_image_generation.md new file mode 100644 index 0000000000..668cb9110f --- /dev/null +++ b/docs/diffusers/using-diffusers/conditional_image_generation.md @@ -0,0 +1,272 @@ + + +# Text-to-image + +When you think of diffusion models, text-to-image is usually one of the first things that come to mind. Text-to-image generates an image from a text description (for example, "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k") which is also known as a *prompt*. + +From a very high level, a diffusion model takes a prompt and some random initial noise, and iteratively removes the noise to construct an image. The *denoising* process is guided by the prompt, and once the denoising process ends after a predetermined number of time steps, the image representation is decoded into an image. + +!!! tip + + Read the [How does Stable Diffusion work?](https://huggingface.co/blog/stable_diffusion#how-does-stable-diffusion-work) blog post to learn more about how a latent diffusion model works. + +1. Load a checkpoint into the [`DiffusionPipeline`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/pipelines/overview/#mindone.diffusers.DiffusionPipeline) class, which automatically detects the appropriate pipeline class to use based on the checkpoint: + +```py +from mindone.diffusers import DiffusionPipeline +import mindspore as ms + +pipeline = DiffusionPipeline.from_pretrained( + "stable-diffusion-v1-5/stable-diffusion-v1-5", mindspore_dtype=ms.float16, variant="fp16" +) +``` + +2. Pass a prompt to the pipeline to generate an image: + +```py +image = pipeline( + "stained glass of darth vader, backlight, centered composition, masterpiece, photorealistic, 8k" +)[0][0] +image +``` + +
+ +
+ +## Popular models + +The most common text-to-image models are [Stable Diffusion v1.5](https://huggingface.co/stable-diffusion-v1-5/stable-diffusion-v1-5), [Stable Diffusion XL (SDXL)](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0), and [Kandinsky 2.2](https://huggingface.co/kandinsky-community/kandinsky-2-2-decoder). There are also ControlNet models or adapters that can be used with text-to-image models for more direct control in generating images. The results from each model are slightly different because of their architecture and training process, but no matter which model you choose, their usage is more or less the same. Let's use the same prompt for each model and compare their results. + +### Stable Diffusion v1.5 + +[Stable Diffusion v1.5](https://huggingface.co/stable-diffusion-v1-5/stable-diffusion-v1-5) is a latent diffusion model initialized from [Stable Diffusion v1-4](https://huggingface.co/CompVis/stable-diffusion-v1-4), and finetuned for 595K steps on 512x512 images from the LAION-Aesthetics V2 dataset. You can use this model like: + +```py +from mindone.diffusers import DiffusionPipeline +import mindspore as ms +import numpy as np + +pipeline = DiffusionPipeline.from_pretrained( + "stable-diffusion-v1-5/stable-diffusion-v1-5", mindspore_dtype=ms.float16, variant="fp16" +) +generator = np.random.Generator(np.random.PCG64(31)) +image = pipeline("Astronaut in a jungle, cold color palette, muted colors, detailed, 8k", generator=generator)[0][0] +image +``` + +### Stable Diffusion XL + +SDXL is a much larger version of the previous Stable Diffusion models, and involves a two-stage model process that adds even more details to an image. It also includes some additional *micro-conditionings* to generate high-quality images centered subjects. Take a look at the more comprehensive [SDXL](sdxl.md) guide to learn more about how to use it. In general, you can use SDXL like: + +```py +from mindone.diffusers import DiffusionPipeline +import mindspore as ms +import numpy as np + +pipeline = DiffusionPipeline.from_pretrained( + "stabilityai/stable-diffusion-xl-base-1.0", mindspore_dtype=ms.float16, variant="fp16" +) +generator = np.random.Generator(np.random.PCG64(31)) +image = pipeline("Astronaut in a jungle, cold color palette, muted colors, detailed, 8k", generator=generator)[0][0] +image +``` + +### Kandinsky 2.2 + +The Kandinsky model is a bit different from the Stable Diffusion models because it also uses an image prior model to create embeddings that are used to better align text and images in the diffusion model. + +The easiest way to use Kandinsky 2.2 is: + +```py +from mindone.diffusers import KandinskyV22CombinedPipeline +import mindspore as ms +import numpy as np + +pipeline = KandinskyV22CombinedPipeline.from_pretrained( + "kandinsky-community/kandinsky-2-2-decoder", mindspore_dtype=ms.float16 +) +generator = np.random.Generator(np.random.PCG64(31)) +image = pipeline("Astronaut in a jungle, cold color palette, muted colors, detailed, 8k", generator=generator)[0][0] +image +``` + +### ControlNet + +ControlNet models are auxiliary models or adapters that are finetuned on top of text-to-image models, such as [Stable Diffusion v1.5](https://huggingface.co/stable-diffusion-v1-5/stable-diffusion-v1-5). Using ControlNet models in combination with text-to-image models offers diverse options for more explicit control over how to generate an image. With ControlNet, you add an additional conditioning input image to the model. For example, if you provide an image of a human pose (usually represented as multiple keypoints that are connected into a skeleton) as a conditioning input, the model generates an image that follows the pose of the image. Check out the more in-depth [ControlNet](controlnet.md) guide to learn more about other conditioning inputs and how to use them. + +In this example, let's condition the ControlNet with a human pose estimation image. Load the ControlNet model pretrained on human pose estimations: + +```py +from mindone.diffusers import ControlNetModel, StableDiffusionControlNetPipeline +from mindone.diffusers.utils import load_image +import mindspore as ms +import numpy as np + +controlnet = ControlNetModel.from_pretrained( + "lllyasviel/control_v11p_sd15_openpose", mindspore_dtype=ms.float16, variant="fp16" +) +pose_image = load_image("https://huggingface.co/lllyasviel/control_v11p_sd15_openpose/resolve/main/images/control.png") +``` + +Pass the `controlnet` to the [`DiffusionPipeline`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/pipelines/overview/#mindone.diffusers.DiffusionPipeline), and provide the prompt and pose estimation image: + +```py +pipeline = StableDiffusionControlNetPipeline.from_pretrained( + "stable-diffusion-v1-5/stable-diffusion-v1-5", controlnet=controlnet, mindspore_dtype=ms.float16, variant="fp16" +) +generator = np.random.Generator(np.random.PCG64(31)) +image = pipeline("Astronaut in a jungle, cold color palette, muted colors, detailed, 8k", image=pose_image, generator=generator)[0][0] +image +``` + +
+
+ +
Stable Diffusion v1.5
+
+
+ +
Stable Diffusion XL
+
+
+ +
Kandinsky 2.2
+
+
+ +
ControlNet (pose conditioning)
+
+
+ +## Configure pipeline parameters + +There are a number of parameters that can be configured in the pipeline that affect how an image is generated. You can change the image's output size, specify a negative prompt to improve image quality, and more. This section dives deeper into how to use these parameters. + +### Height and width + +The `height` and `width` parameters control the height and width (in pixels) of the generated image. By default, the Stable Diffusion v1.5 model outputs 512x512 images, but you can change this to any size that is a multiple of 8. For example, to create a rectangular image: + +```py +from mindone.diffusers import DiffusionPipeline +import mindspore as ms + +pipeline = DiffusionPipeline.from_pretrained( + "stable-diffusion-v1-5/stable-diffusion-v1-5", mindspore_dtype=ms.float16, variant="fp16" +) +image = pipeline( + "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k", height=768, width=512 +)[0][0] +image +``` + +
+ +
+ +!!! warning + + Other models may have different default image sizes depending on the image sizes in the training dataset. For example, SDXL's default image size is 1024x1024 and using lower `height` and `width` values may result in lower quality images. Make sure you check the model's API reference first! + +### Guidance scale + +The `guidance_scale` parameter affects how much the prompt influences image generation. A lower value gives the model "creativity" to generate images that are more loosely related to the prompt. Higher `guidance_scale` values push the model to follow the prompt more closely, and if this value is too high, you may observe some artifacts in the generated image. + +```py +from mindone.diffusers import DiffusionPipeline +import mindspore as ms + +pipeline = DiffusionPipeline.from_pretrained( + "stable-diffusion-v1-5/stable-diffusion-v1-5", mindspore_dtype=ms.float16 +) +image = pipeline( + "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k", guidance_scale=3.5 +)[0][0] +image +``` + +
+
+ +
guidance_scale = 2.5
+
+
+ +
guidance_scale = 7.5
+
+
+ +
guidance_scale = 10.5
+
+
+ +### Negative prompt + +Just like how a prompt guides generation, a *negative prompt* steers the model away from things you don't want the model to generate. This is commonly used to improve overall image quality by removing poor or bad image features such as "low resolution" or "bad details". You can also use a negative prompt to remove or modify the content and style of an image. + +```py +from mindone.diffusers import DiffusionPipeline +import mindspore as ms + +pipeline = DiffusionPipeline.from_pretrained( + "stable-diffusion-v1-5/stable-diffusion-v1-5", mindspore_dtype=ms.float16 +) +image = pipeline( + prompt="Astronaut in a jungle, cold color palette, muted colors, detailed, 8k", + negative_prompt="ugly, deformed, disfigured, poor details, bad anatomy", +)[0][0] +image +``` + +
+
+ +
negative_prompt = "ugly, deformed, disfigured, poor details, bad anatomy"
+
+
+ +
negative_prompt = "astronaut"
+
+
+ +### Generator + +A [`numpy.random.Generator`](https://numpy.org/doc/stable/reference/random/generator.html) object enables reproducibility in a pipeline by setting a manual seed. You can use a `Generator` to generate batches of images and iteratively improve on an image generated from a seed as detailed in the [Improve image quality with deterministic generation](reusing_seeds.md) guide. + +You can set a seed and `Generator` as shown below. Creating an image with a `Generator` should return the same result each time instead of randomly generating a new image. + +```py +from mindone.diffusers import DiffusionPipeline +import mindspore as ms +import numpy as np + +pipeline = DiffusionPipeline.from_pretrained( + "stable-diffusion-v1-5/stable-diffusion-v1-5", mindspore_dtype=ms.float16 +) +generator = np.random.Generator(np.random.PCG64(30)) +image = pipeline( + "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k", + generator=generator, +)[0][0] +image +``` + +## Control image generation + +There are several ways to exert more control over how an image is generated outside of configuring a pipeline's parameters, such ControlNet models. + +### ControlNet + +As you saw in the [ControlNet](#controlnet) section, these models offer a more flexible and accurate way to generate images by incorporating an additional conditioning image input. Each ControlNet model is pretrained on a particular type of conditioning image to generate new images that resemble it. For example, if you take a ControlNet model pretrained on depth maps, you can give the model a depth map as a conditioning input and it'll generate an image that preserves the spatial information in it. This is quicker and easier than specifying the depth information in a prompt. You can even combine multiple conditioning inputs with a [MultiControlNet](controlnet.md#multicontrolnet)! + +There are many types of conditioning inputs you can use, and 🤗 Diffusers supports ControlNet for Stable Diffusion and SDXL models. Take a look at the more comprehensive [ControlNet](controlnet.md) guide to learn how you can use these models. diff --git a/docs/diffusers/using-diffusers/controlling_generation.md b/docs/diffusers/using-diffusers/controlling_generation.md new file mode 100644 index 0000000000..7ea6439919 --- /dev/null +++ b/docs/diffusers/using-diffusers/controlling_generation.md @@ -0,0 +1,97 @@ + + +# Controlled generation + +Controlling outputs generated by diffusion models has been long pursued by the community and is now an active research topic. In many popular diffusion models, subtle changes in inputs, both images and text prompts, can drastically change outputs. In an ideal world we want to be able to control how semantics are preserved and changed. + +Most examples of preserving semantics reduce to being able to accurately map a change in input to a change in output. I.e. adding an adjective to a subject in a prompt preserves the entire image, only modifying the changed subject. Or, image variation of a particular subject preserves the subject's pose. + +Additionally, there are qualities of generated images that we would like to influence beyond semantic preservation. I.e. in general, we would like our outputs to be of good quality, adhere to a particular style, or be realistic. + +We will document some of the techniques `diffusers` supports to control generation of diffusion models. Much is cutting edge research and can be quite nuanced. + +We provide a high level explanation of how the generation can be controlled as well as a snippet of the technicals. For more in depth explanations on the technicals, the original papers which are linked from the pipelines are always the best resources. + +Depending on the use case, one should choose a technique accordingly. In many cases, these techniques can be combined. + +Unless otherwise mentioned, these are techniques that work with existing models and don't require their own weights. + +1. [InstructPix2Pix](#instructpix2pix) +2. [Depth2Image](#depth2image) +3. [DreamBooth](#dreambooth) +4. [Textual Inversion](#textual-inversion) +5. [ControlNet](#controlnet) +6. [DiffEdit](#diffedit) +7. [T2I-Adapter](#t2i-adapter) + +For convenience, we provide a table to denote which methods are inference-only and which require fine-tuning/training. + +| **Method** | **Inference only** | **Requires training /
fine-tuning** | **Comments** | +|:--------------------------------------------------:| :----------------: | :-------------------------------------: | :---------------------------------------------------------------------------------------------: | +| [InstructPix2Pix](#instructpix2pix) | ✅ | ❌ | Can additionally be
fine-tuned for better
performance on specific
edit instructions. | +| [Depth2Image](#depth2image) | ✅ | ❌ | | +| [DreamBooth](#dreambooth) | ❌ | ✅ | | +| [Textual Inversion](#textual-inversion) | ❌ | ✅ | | +| [ControlNet](#controlnet) | ✅ | ❌ | A ControlNet can be
trained/fine-tuned on
a custom conditioning. | +| [DiffEdit](#diffedit) | ✅ | ❌ | | +| [T2I-Adapter](#t2i-adapter) | ✅ | ❌ | | + +## InstructPix2Pix + +[Paper](https://arxiv.org/abs/2211.09800) + +[InstructPix2Pix](../api/pipelines/pix2pix.md) is fine-tuned from Stable Diffusion to support editing input images. It takes as inputs an image and a prompt describing an edit, and it outputs the edited image. +InstructPix2Pix has been explicitly trained to work well with [InstructGPT](https://openai.com/blog/instruction-following/)-like prompts. + +## Depth2Image + +[Project](https://huggingface.co/stabilityai/stable-diffusion-2-depth) + +[Depth2Image](../api/pipelines/stable_diffusion/depth2img.md) is fine-tuned from Stable Diffusion to better preserve semantics for text guided image variation. + +It conditions on a monocular depth estimate of the original image. + +## DreamBooth + +[Project](https://dreambooth.github.io/) + +[DreamBooth](../training/dreambooth.md) fine-tunes a model to teach it about a new subject. I.e. a few pictures of a person can be used to generate images of that person in different styles. + +## Textual Inversion + +[Paper](https://arxiv.org/abs/2208.01618) + +[Textual Inversion](../training/text_inversion.md) fine-tunes a model to teach it about a new concept. I.e. a few pictures of a style of artwork can be used to generate images in that style. + +## ControlNet + +[Paper](https://arxiv.org/abs/2302.05543) + +[ControlNet](../api/pipelines/controlnet.md) is an auxiliary network which adds an extra condition. +There are 8 canonical pre-trained ControlNets trained on different conditionings such as edge detection, scribbles, +depth maps, and semantic segmentations. + +## DiffEdit + +[Paper](https://arxiv.org/abs/2210.11427) + +[DiffEdit](../api/pipelines/diffedit.md) allows for semantic editing of input images along with +input prompts while preserving the original input images as much as possible. + +## T2I-Adapter + +[Paper](https://arxiv.org/abs/2302.08453) + +[T2I-Adapter](../api/pipelines/stable_diffusion/adapter.md) is an auxiliary network which adds an extra condition. +There are 8 canonical pre-trained adapters trained on different conditionings such as edge detection, sketch, +depth maps, and semantic segmentations. diff --git a/docs/diffusers/using-diffusers/controlnet.md b/docs/diffusers/using-diffusers/controlnet.md new file mode 100644 index 0000000000..876f3b292d --- /dev/null +++ b/docs/diffusers/using-diffusers/controlnet.md @@ -0,0 +1,566 @@ + + +# ControlNet + +ControlNet is a type of model for controlling image diffusion models by conditioning the model with an additional input image. There are many types of conditioning inputs (canny edge, user sketching, human pose, depth, and more) you can use to control a diffusion model. This is hugely useful because it affords you greater control over image generation, making it easier to generate specific images without experimenting with different text prompts or denoising values as much. + +!!! tip + + Check out Section 3.5 of the [ControlNet](https://huggingface.co/papers/2302.05543) paper v1 for a list of ControlNet implementations on various conditioning inputs. You can find the official Stable Diffusion ControlNet conditioned models on [lllyasviel](https://huggingface.co/lllyasviel)'s Hub profile, and more [community-trained](https://huggingface.co/models?other=stable-diffusion&other=controlnet) ones on the Hub. + + For Stable Diffusion XL (SDXL) ControlNet models, you can find them on the 🤗 [Diffusers](https://huggingface.co/diffusers) Hub organization, or you can browse [community-trained](https://huggingface.co/models?other=stable-diffusion-xl&other=controlnet) ones on the Hub. + +A ControlNet model has two sets of weights (or blocks) connected by a zero-convolution layer: + +- a *locked copy* keeps everything a large pretrained diffusion model has learned +- a *trainable copy* is trained on the additional conditioning input + +Since the locked copy preserves the pretrained model, training and implementing a ControlNet on a new conditioning input is as fast as finetuning any other model because you aren't training the model from scratch. + +This guide will show you how to use ControlNet for text-to-image, image-to-image, inpainting, and more! There are many types of ControlNet conditioning inputs to choose from, but in this guide we'll only focus on several of them. Feel free to experiment with other conditioning inputs! + +Before you begin, make sure you have the following libraries installed: + +```py +# uncomment to install the necessary libraries +#!pip install mindone transformers opencv-python +``` + +## Text-to-image + +For text-to-image, you normally pass a text prompt to the model. But with ControlNet, you can specify an additional conditioning input. Let's condition the model with a canny image, a white outline of an image on a black background. This way, the ControlNet can use the canny image as a control to guide the model to generate an image with the same outline. + +Load an image and use the [opencv-python](https://github.com/opencv/opencv-python) library to extract the canny image: + +```py +from mindone.diffusers.utils import load_image, make_image_grid +from PIL import Image +import cv2 +import numpy as np + +original_image = load_image( + "https://hf.co/datasets/huggingface/documentation-images/resolve/main/diffusers/input_image_vermeer.png" +) + +image = np.array(original_image) + +low_threshold = 100 +high_threshold = 200 + +image = cv2.Canny(image, low_threshold, high_threshold) +image = image[:, :, None] +image = np.concatenate([image, image, image], axis=2) +canny_image = Image.fromarray(image) +``` + +
+
+ +
original image
+
+
+ +
canny image
+
+
+ +Next, load a ControlNet model conditioned on canny edge detection and pass it to the [`StableDiffusionControlNetPipeline`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/pipelines/controlnet#mindone.diffusers.StableDiffusionControlNetPipeline). Use the faster [`UniPCMultistepScheduler`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/schedulers/unipc/#mindone.diffusers.UniPCMultistepScheduler) and enable model offloading to speed up inference and reduce memory usage. + +```py +from mindone.diffusers import StableDiffusionControlNetPipeline, ControlNetModel, UniPCMultistepScheduler +import mindspore as ms + +controlnet = ControlNetModel.from_pretrained("lllyasviel/sd-controlnet-canny", mindspore_dtype=ms.float16, use_safetensors=True) +pipe = StableDiffusionControlNetPipeline.from_pretrained( + "stable-diffusion-v1-5/stable-diffusion-v1-5", controlnet=controlnet, mindspore_dtype=ms.float16, use_safetensors=True +) + +pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config) +``` + +Now pass your prompt and canny image to the pipeline: + +```py +output = pipe( + "the mona lisa", image=canny_image +)[0][0] +make_image_grid([original_image, canny_image, output], rows=1, cols=3) +``` + +
+ +
+ +## Image-to-image + +!!! warning + + ⚠️ MindONE currently does not support the full process for extracting the depth map, as MindONE does not yet support depth-estimation [~transformers.Pipeline] from mindone.transformers. Therefore, you need to prepare the depth map in advance to continue the process. + +For image-to-image, you'd typically pass an initial image and a prompt to the pipeline to generate a new image. With ControlNet, you can pass an additional conditioning input to guide the model. Let's condition the model with a depth map, an image which contains spatial information. This way, the ControlNet can use the depth map as a control to guide the model to generate an image that preserves spatial information. + +You'll use the [`StableDiffusionControlNetImg2ImgPipeline`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/pipelines/controlnet#mindone.diffusers.StableDiffusionControlNetImg2ImgPipeline) for this task, which is different from the [`StableDiffusionControlNetPipeline`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/pipelines/controlnet#mindone.diffusers.StableDiffusionControlNetPipeline) because it allows you to pass an initial image as the starting point for the image generation process. + +You can process and retrieve the depth map you prepared in advance: + +```py +import mindspore as ms +import numpy as np + +from mindone.diffusers.utils import load_image, make_image_grid + +image = load_image( + "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/controlnet-img2img.jpg" +) + +def make_hint(depth_image): + image = depth_image[:, :, None] + image = np.concatenate([image, image, image], axis=2) + detected_map = ms.Tensor.from_numpy(image).float() / 255.0 + hint = detected_map.permute(2, 0, 1) + return hint + +hint = make_hint(depth_image).unsqueeze(0).half() +``` + +Next, load a ControlNet model conditioned on depth maps and pass it to the [`StableDiffusionControlNetImg2ImgPipeline`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/pipelines/controlnet#mindone.diffusers.StableDiffusionControlNetImg2ImgPipeline). Use the faster [`UniPCMultistepScheduler`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/schedulers/unipc/#mindone.diffusers.UniPCMultistepScheduler) and enable model offloading to speed up inference and reduce memory usage. + +```py +from mindone.diffusers import StableDiffusionControlNetImg2ImgPipeline, ControlNetModel, UniPCMultistepScheduler +import mindspore as ms + +controlnet = ControlNetModel.from_pretrained("lllyasviel/control_v11f1p_sd15_depth", mindspore_dtype=ms.float16, use_safetensors=True) +pipe = StableDiffusionControlNetImg2ImgPipeline.from_pretrained( + "stable-diffusion-v1-5/stable-diffusion-v1-5", controlnet=controlnet, mindspore_dtype=ms.float16, use_safetensors=True +) + +pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config) +``` + +Now pass your prompt, initial image, and depth map to the pipeline: + +```py +output = pipe( + "lego batman and robin", image=image, control_image=depth_map, +)[0][0] +make_image_grid([image, output], rows=1, cols=2) +``` + +
+
+ +
original image
+
+
+ +
generated image
+
+
+ +## Inpainting + +For inpainting, you need an initial image, a mask image, and a prompt describing what to replace the mask with. ControlNet models allow you to add another control image to condition a model with. Let’s condition the model with an inpainting mask. This way, the ControlNet can use the inpainting mask as a control to guide the model to generate an image within the mask area. + +Load an initial image and a mask image: + +```py +from mindone.diffusers.utils import load_image, make_image_grid + +init_image = load_image( + "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/controlnet-inpaint.jpg" +) +init_image = init_image.resize((512, 512)) + +mask_image = load_image( + "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/controlnet-inpaint-mask.jpg" +) +mask_image = mask_image.resize((512, 512)) +make_image_grid([init_image, mask_image], rows=1, cols=2) +``` + +Create a function to prepare the control image from the initial and mask images. This'll create a tensor to mark the pixels in `init_image` as masked if the corresponding pixel in `mask_image` is over a certain threshold. + +```py +import numpy as np +import mindspore as ms + +def make_inpaint_condition(image, image_mask): + image = np.array(image.convert("RGB")).astype(np.float32) / 255.0 + image_mask = np.array(image_mask.convert("L")).astype(np.float32) / 255.0 + + assert image.shape[0:1] == image_mask.shape[0:1] + image[image_mask > 0.5] = -1.0 # set as masked pixel + image = np.expand_dims(image, 0).transpose(0, 3, 1, 2) + image = ms.Tensor.from_numpy(image) + return image + +control_image = make_inpaint_condition(init_image, mask_image) +``` + +
+
+ +
original image
+
+
+ +
mask image
+
+
+ +Load a ControlNet model conditioned on inpainting and pass it to the [`StableDiffusionControlNetInpaintPipeline`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/pipelines/controlnet/#mindone.diffusers.StableDiffusionControlNetInpaintPipeline). Use the faster [`UniPCMultistepScheduler`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/schedulers/unipc/#mindone.diffusers.UniPCMultistepScheduler) and enable model offloading to speed up inference and reduce memory usage. + +```py +from mindone.diffusers import StableDiffusionControlNetInpaintPipeline, ControlNetModel, UniPCMultistepScheduler + +controlnet = ControlNetModel.from_pretrained("lllyasviel/control_v11p_sd15_inpaint", mindspore_dtype=ms.float16, use_safetensors=True) +pipe = StableDiffusionControlNetInpaintPipeline.from_pretrained( + "stable-diffusion-v1-5/stable-diffusion-v1-5", controlnet=controlnet, mindspore_dtype=ms.float16, use_safetensors=True +) + +pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config) +``` + +Now pass your prompt, initial image, mask image, and control image to the pipeline: + +```py +output = pipe( + "corgi face with large ears, detailed, pixar, animated, disney", + num_inference_steps=20, + eta=1.0, + image=init_image, + mask_image=mask_image, + control_image=control_image, +)[0][0] +make_image_grid([init_image, mask_image, output], rows=1, cols=3) +``` + +
+ +
+ +## Guess mode + +[Guess mode](https://github.com/lllyasviel/ControlNet/discussions/188) does not require supplying a prompt to a ControlNet at all! This forces the ControlNet encoder to do it's best to "guess" the contents of the input control map (depth map, pose estimation, canny edge, etc.). + +Guess mode adjusts the scale of the output residuals from a ControlNet by a fixed ratio depending on the block depth. The shallowest `DownBlock` corresponds to 0.1, and as the blocks get deeper, the scale increases exponentially such that the scale of the `MidBlock` output becomes 1.0. + +!!! tip + + Guess mode does not have any impact on prompt conditioning and you can still provide a prompt if you want. + +Set `guess_mode=True` in the pipeline, and it is [recommended](https://github.com/lllyasviel/ControlNet#guess-mode--non-prompt-mode) to set the `guidance_scale` value between 3.0 and 5.0. + +```py +from mindone.diffusers import StableDiffusionControlNetPipeline, ControlNetModel +from mindone.diffusers.utils import load_image, make_image_grid +import numpy as np +import mindspore as ms +from PIL import Image +import cv2 + +controlnet = ControlNetModel.from_pretrained("lllyasviel/sd-controlnet-canny", use_safetensors=True) +pipe = StableDiffusionControlNetPipeline.from_pretrained("stable-diffusion-v1-5/stable-diffusion-v1-5", controlnet=controlnet, use_safetensors=True) + +original_image = load_image("https://huggingface.co/takuma104/controlnet_dev/resolve/main/bird_512x512.png") + +image = np.array(original_image) + +low_threshold = 100 +high_threshold = 200 + +image = cv2.Canny(image, low_threshold, high_threshold) +image = image[:, :, None] +image = np.concatenate([image, image, image], axis=2) +canny_image = Image.fromarray(image) + +image = pipe("", image=canny_image, guess_mode=True, guidance_scale=3.0)[0][0] +make_image_grid([original_image, canny_image, image], rows=1, cols=3) +``` + +
+
+ +
regular mode with prompt
+
+
+ +
guess mode without prompt
+
+
+ +## ControlNet with Stable Diffusion XL + +There aren't too many ControlNet models compatible with Stable Diffusion XL (SDXL) at the moment, but diffusers have trained two full-sized ControlNet models for SDXL conditioned on canny edge detection and depth maps. We're also experimenting with creating smaller versions of these SDXL-compatible ControlNet models so it is easier to run on resource-constrained hardware. You can find these checkpoints on the [🤗 Diffusers Hub organization](https://huggingface.co/diffusers)! + +Let's use a SDXL ControlNet conditioned on canny images to generate an image. Start by loading an image and prepare the canny image: + +```py +from mindone.diffusers import StableDiffusionXLControlNetPipeline, ControlNetModel, AutoencoderKL +from mindone.diffusers.utils import load_image, make_image_grid +from PIL import Image +import cv2 +import numpy as np +import mindspore as ms + +original_image = load_image( + "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/sd_controlnet/hf-logo.png" +) + +image = np.array(original_image) + +low_threshold = 100 +high_threshold = 200 + +image = cv2.Canny(image, low_threshold, high_threshold) +image = image[:, :, None] +image = np.concatenate([image, image, image], axis=2) +canny_image = Image.fromarray(image) +make_image_grid([original_image, canny_image], rows=1, cols=2) +``` + +
+
+ +
original image
+
+
+ +
canny image
+
+
+ +Load a SDXL ControlNet model conditioned on canny edge detection and pass it to the [`StableDiffusionXLControlNetPipeline`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/pipelines/controlnet_sdxl/#mindone.diffusers.StableDiffusionXLControlNetPipeline). You can also enable model offloading to reduce memory usage. + +```py +controlnet = ControlNetModel.from_pretrained( + "diffusers/controlnet-canny-sdxl-1.0", + mindspore_dtype=ms.float16, + use_safetensors=True +) +vae = AutoencoderKL.from_pretrained("madebyollin/sdxl-vae-fp16-fix", mindspore_dtype=ms.float16, use_safetensors=True) +pipe = StableDiffusionXLControlNetPipeline.from_pretrained( + "stabilityai/stable-diffusion-xl-base-1.0", + controlnet=controlnet, + vae=vae, + mindspore_dtype=ms.float16, + use_safetensors=True +) +``` + +Now pass your prompt (and optionally a negative prompt if you're using one) and canny image to the pipeline: + +!!! tip + + The [`controlnet_conditioning_scale`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/pipelines/controlnet/#mindone.diffusers.StableDiffusionControlNetPipeline.__call__.controlnet_conditioning_scale) parameter determines how much weight to assign to the conditioning inputs. A value of 0.5 is recommended for good generalization, but feel free to experiment with this number! + +```py +prompt = "aerial view, a futuristic research complex in a bright foggy jungle, hard lighting" +negative_prompt = 'low quality, bad quality, sketches' + +image = pipe( + prompt, + negative_prompt=negative_prompt, + image=canny_image, + controlnet_conditioning_scale=0.5, +)[0][0] +make_image_grid([original_image, canny_image, image], rows=1, cols=3) +``` + +
+ +
+ +You can use [`StableDiffusionXLControlNetPipeline`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/pipelines/controlnet_sdxl/#mindone.diffusers.StableDiffusionXLControlNetPipeline) in guess mode as well by setting the parameter to `True`: + +```py +from mindone.diffusers import StableDiffusionXLControlNetPipeline, ControlNetModel, AutoencoderKL +from mindone.diffusers.utils import load_image, make_image_grid +import numpy as np +import mindspore as ms +import cv2 +from PIL import Image + +prompt = "aerial view, a futuristic research complex in a bright foggy jungle, hard lighting" +negative_prompt = "low quality, bad quality, sketches" + +original_image = load_image( + "https://hf.co/datasets/hf-internal-testing/diffusers-images/resolve/main/sd_controlnet/hf-logo.png" +) + +controlnet = ControlNetModel.from_pretrained( + "diffusers/controlnet-canny-sdxl-1.0", mindspore_dtype=ms.float16, use_safetensors=True +) +vae = AutoencoderKL.from_pretrained("madebyollin/sdxl-vae-fp16-fix", mindspore_dtype=ms.float16, use_safetensors=True) +pipe = StableDiffusionXLControlNetPipeline.from_pretrained( + "stabilityai/stable-diffusion-xl-base-1.0", controlnet=controlnet, vae=vae, mindspore_dtype=ms.float16, use_safetensors=True +) + +image = np.array(original_image) +image = cv2.Canny(image, 100, 200) +image = image[:, :, None] +image = np.concatenate([image, image, image], axis=2) +canny_image = Image.fromarray(image) + +image = pipe( + prompt, negative_prompt=negative_prompt, controlnet_conditioning_scale=0.5, image=canny_image, guess_mode=True, +)[0][0] +make_image_grid([original_image, canny_image, image], rows=1, cols=3) +``` + +!!! tip + + You can use a refiner model with `StableDiffusionXLControlNetPipeline` to improve image quality, just like you can with a regular `StableDiffusionXLPipeline`. + See the [Refine image quality](./sdxl.md#refine-image-quality) section to learn how to use the refiner model. + Make sure to use `StableDiffusionXLControlNetPipeline` and pass `image` and `controlnet_conditioning_scale`. + + ```py + base = StableDiffusionXLControlNetPipeline(...) + image = base( + prompt=prompt, + controlnet_conditioning_scale=0.5, + image=canny_image, + num_inference_steps=40, + denoising_end=0.8, + output_type="latent", + )[0] + # rest exactly as with StableDiffusionXLPipeline + ``` + +## MultiControlNet + +!!! warning + + ⚠️ MindONE currently does not support the full process for human pose estimation, as MindONE does not yet support `OpenposeDetector` from controlnet_aux. Therefore, you need to prepare the `human pose image` in advance to continue the process. + +!!! tip + + Replace the SDXL model with a model like [stable-diffusion-v1-5/stable-diffusion-v1-5](https://huggingface.co/stable-diffusion-v1-5/stable-diffusion-v1-5) to use multiple conditioning inputs with Stable Diffusion models. + +You can compose multiple ControlNet conditionings from different image inputs to create a *MultiControlNet*. To get better results, it is often helpful to: + +1. mask conditionings such that they don't overlap (for example, mask the area of a canny image where the pose conditioning is located) +2. experiment with the [`controlnet_conditioning_scale`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/pipelines/controlnet/#mindone.diffusers.StableDiffusionControlNetPipeline) parameter to determine how much weight to assign to each conditioning input + +In this example, you'll combine a canny image and a human pose estimation image to generate a new image. + +Prepare the canny image conditioning: + +```py +from mindone.diffusers.utils import load_image, make_image_grid +from PIL import Image +import numpy as np +import cv2 + +original_image = load_image( + "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/landscape.png" +) +image = np.array(original_image) + +low_threshold = 100 +high_threshold = 200 + +image = cv2.Canny(image, low_threshold, high_threshold) + +# zero out middle columns of image where pose will be overlaid +zero_start = image.shape[1] // 4 +zero_end = zero_start + image.shape[1] // 2 +image[:, zero_start:zero_end] = 0 + +image = image[:, :, None] +image = np.concatenate([image, image, image], axis=2) +canny_image = Image.fromarray(image) +make_image_grid([original_image, canny_image], rows=1, cols=2) +``` + +
+
+ +
original image
+
+
+ +
canny image
+
+
+ +For human pose estimation, prepare the human pose estimation conditioning: + +```py +original_image = load_image( + "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/person.png" +) +openpose_image = load_image("path/to/openpose_image") +make_image_grid([original_image, openpose_image], rows=1, cols=2) +``` + +
+
+ +
original image
+
+
+ +
human pose image
+
+
+ +Load a list of ControlNet models that correspond to each conditioning, and pass them to the [`StableDiffusionXLControlNetPipeline`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/pipelines/controlnet_sdxl/#mindone.diffusers.StableDiffusionXLControlNetPipeline). Use the faster [`UniPCMultistepScheduler`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/schedulers/unipc/#mindone.diffusers.UniPCMultistepScheduler) and enable model offloading to reduce memory usage. + +```py +from mindone.diffusers import StableDiffusionXLControlNetPipeline, ControlNetModel, AutoencoderKL, UniPCMultistepScheduler +import mindspore as ms +import numpy as np + +controlnets = [ + ControlNetModel.from_pretrained( + "thibaud/controlnet-openpose-sdxl-1.0", mindspore_dtype=ms.float16 + ), + ControlNetModel.from_pretrained( + "diffusers/controlnet-canny-sdxl-1.0", mindspore_dtype=ms.float16, use_safetensors=True + ), +] + +vae = AutoencoderKL.from_pretrained("madebyollin/sdxl-vae-fp16-fix", mindspore_dtype=ms.float16, use_safetensors=True) +pipe = StableDiffusionXLControlNetPipeline.from_pretrained( + "stabilityai/stable-diffusion-xl-base-1.0", controlnet=controlnets, vae=vae, mindspore_dtype=ms.float16, use_safetensors=True +) +pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config) +``` + +Now you can pass your prompt (an optional negative prompt if you're using one), canny image, and pose image to the pipeline: + +```py +prompt = "a giant standing in a fantasy landscape, best quality" +negative_prompt = "monochrome, lowres, bad anatomy, worst quality, low quality" + +generator = generator = np.random.Generator(np.random.PCG64(1)) + +images = [openpose_image.resize((1024, 1024)), canny_image.resize((1024, 1024))] + +images = pipe( + prompt, + image=images, + num_inference_steps=25, + generator=generator, + negative_prompt=negative_prompt, + num_images_per_prompt=3, + controlnet_conditioning_scale=[1.0, 0.8], +)[0] +make_image_grid([original_image, canny_image, openpose_image, + images[0].resize((512, 512)), images[1].resize((512, 512)), images[2].resize((512, 512))], rows=2, cols=3) +``` + +
+ +
diff --git a/docs/diffusers/using-diffusers/depth2img.md b/docs/diffusers/using-diffusers/depth2img.md new file mode 100644 index 0000000000..58592baade --- /dev/null +++ b/docs/diffusers/using-diffusers/depth2img.md @@ -0,0 +1,44 @@ + + +# Text-guided depth-to-image generation + +The [`StableDiffusionDepth2ImgPipeline`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/pipelines/stable_diffusion/depth2img/#mindone.diffusers.StableDiffusionDepth2ImgPipeline) lets you pass a text prompt and an initial image to condition the generation of new images. In addition, you can also pass a `depth_map` to preserve the image structure. If no `depth_map` is provided, the pipeline automatically predicts the depth via an integrated [depth-estimation model](https://github.com/isl-org/MiDaS). + +Start by creating an instance of the [`StableDiffusionDepth2ImgPipeline`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/pipelines/stable_diffusion/depth2img/#mindone.diffusers.StableDiffusionDepth2ImgPipeline): + +```python +import mindspore as ms +from mindone.diffusers import StableDiffusionDepth2ImgPipeline +from mindone.diffusers.utils import load_image, make_image_grid + +pipeline = StableDiffusionDepth2ImgPipeline.from_pretrained( + "stabilityai/stable-diffusion-2-depth", + mindspore_dtype=ms.float16, + use_safetensors=True, +) +``` + +Now pass your prompt to the pipeline. You can also pass a `negative_prompt` to prevent certain words from guiding how an image is generated: + +```python +url = "http://images.cocodataset.org/val2017/000000039769.jpg" +init_image = load_image(url) +prompt = "two tigers" +negative_prompt = "bad, deformed, ugly, bad anatomy" +image = pipeline(prompt=prompt, image=init_image, negative_prompt=negative_prompt, strength=0.7)[0][0] +make_image_grid([init_image, image], rows=1, cols=2) +``` + +| Input | Output | +|:-------------------------------------------------------------------------------------------------------------------------------:|:---------------------------------------------------------------------------------------------------------------------------------------:| +| | | diff --git a/docs/diffusers/using-diffusers/diffedit.md b/docs/diffusers/using-diffusers/diffedit.md new file mode 100644 index 0000000000..65873d1ec7 --- /dev/null +++ b/docs/diffusers/using-diffusers/diffedit.md @@ -0,0 +1,108 @@ + + +# DiffEdit + +Image editing typically requires providing a mask of the area to be edited. DiffEdit automatically generates the mask for you based on a text query, making it easier overall to create a mask without image editing software. The DiffEdit algorithm works in three steps: + +1. the diffusion model denoises an image conditioned on some query text and reference text which produces different noise estimates for different areas of the image; the difference is used to infer a mask to identify which area of the image needs to be changed to match the query text +2. the input image is encoded into latent space with DDIM +3. the latents are decoded with the diffusion model conditioned on the text query, using the mask as a guide such that pixels outside the mask remain the same as in the input image + +This guide will show you how to use DiffEdit to edit images without manually creating a mask. + +Before you begin, make sure you have the following libraries installed: + +```py +# uncomment to install the necessary libraries +#!pip install mindone transformers +``` + +The [`StableDiffusionDiffEditPipeline`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/pipelines/diffedit/#mindone.diffusers.StableDiffusionDiffEditPipeline) requires an image mask and a set of partially inverted latents. The image mask is generated from the [`generate_mask`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/pipelines/diffedit/#mindone.diffusers.StableDiffusionDiffEditPipeline.generate_mask) function, and includes two parameters, `source_prompt` and `target_prompt`. These parameters determine what to edit in the image. For example, if you want to change a bowl of *fruits* to a bowl of *pears*, then: + +```py +source_prompt = "a bowl of fruits" +target_prompt = "a bowl of pears" +``` + +The partially inverted latents are generated from the [`invert`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/pipelines/diffedit/#mindone.diffusers.StableDiffusionDiffEditPipeline.invert) function, and it is generally a good idea to include a `prompt` or *caption* describing the image to help guide the inverse latent sampling process. The caption can often be your `source_prompt`, but feel free to experiment with other text descriptions! + +Let's load the pipeline, scheduler, inverse scheduler, and enable some optimizations to reduce memory usage: + +```py +import mindspore as ms +from mindone.diffusers import DDIMScheduler, DDIMInverseScheduler, StableDiffusionDiffEditPipeline + +pipeline = StableDiffusionDiffEditPipeline.from_pretrained( + "stabilityai/stable-diffusion-2-1", + mindspore_dtype=ms.float16, + safety_checker=None, + use_safetensors=True, +) +pipeline.scheduler = DDIMScheduler.from_config(pipeline.scheduler.config) +pipeline.inverse_scheduler = DDIMInverseScheduler.from_config(pipeline.scheduler.config) +``` + +Load the image to edit: + +```py +from mindone.diffusers.utils import load_image, make_image_grid + +img_url = "https://github.com/Xiang-cd/DiffEdit-stable-diffusion/raw/main/assets/origin.png" +raw_image = load_image(img_url).resize((768, 768)) +raw_image +``` + +Use the [`generate_mask`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/pipelines/diffedit/#mindone.diffusers.StableDiffusionDiffEditPipeline.generate_mask) function to generate the image mask. You'll need to pass it the `source_prompt` and `target_prompt` to specify what to edit in the image: + +```py +from PIL import Image + +source_prompt = "a bowl of fruits" +target_prompt = "a basket of pears" +mask_image = pipeline.generate_mask( + image=raw_image, + source_prompt=source_prompt, + target_prompt=target_prompt, +) +Image.fromarray((mask_image.squeeze()*255).astype("uint8"), "L").resize((768, 768)) +``` + +Next, create the inverted latents and pass it a caption describing the image: + +```py +inv_latents = pipeline.invert(prompt=source_prompt, image=raw_image)[0] +``` + +Finally, pass the image mask and inverted latents to the pipeline. The `target_prompt` becomes the `prompt` now, and the `source_prompt` is used as the `negative_prompt`: + +```py +output_image = pipeline( + prompt=target_prompt, + mask_image=mask_image, + image_latents=inv_latents, + negative_prompt=source_prompt, +)[0][0] +mask_image = Image.fromarray((mask_image.squeeze()*255).astype("uint8"), "L").resize((768, 768)) +make_image_grid([raw_image, mask_image, output_image], rows=1, cols=3) +``` + +
+
+ +
original image
+
+
+ +
edited image
+
+
diff --git a/docs/diffusers/using-diffusers/img2img.md b/docs/diffusers/using-diffusers/img2img.md new file mode 100644 index 0000000000..929b2038e6 --- /dev/null +++ b/docs/diffusers/using-diffusers/img2img.md @@ -0,0 +1,376 @@ + + +# Image-to-image + +Image-to-image is similar to [text-to-image](conditional_image_generation.md), but in addition to a prompt, you can also pass an initial image as a starting point for the diffusion process. The initial image is encoded to latent space and noise is added to it. Then the latent diffusion model takes a prompt and the noisy latent image, predicts the added noise, and removes the predicted noise from the initial latent image to get the new latent image. Lastly, a decoder decodes the new latent image back into an image. + +With 🤗 Diffusers, this is as easy as 1-2-3: + +1. Load a checkpoint into the [`KandinskyV22Img2ImgCombinedPipeline`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/pipelines/kandinsky_v22/#mindone.diffusers.KandinskyV22Img2ImgCombinedPipeline) class: + +```py +import mindspore as ms +from mindone.diffusers import KandinskyV22Img2ImgCombinedPipeline +from mindone.diffusers.utils import load_image, make_image_grid + +pipeline = KandinskyV22Img2ImgCombinedPipeline.from_pretrained( + "kandinsky-community/kandinsky-2-2-decoder", mindspore_dtype=ms.float16, use_safetensors=True +) +``` + +2. Load an image to pass to the pipeline: + +```py +init_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/cat.png") +``` + +3. Pass a prompt and image to the pipeline to generate an image: + +```py +prompt = "cat wizard, gandalf, lord of the rings, detailed, fantasy, cute, adorable, Pixar, Disney, 8k" +image = pipeline(prompt, image=init_image)[0][0] +make_image_grid([init_image, image], rows=1, cols=2) +``` + +
+
+ +
initial image
+
+
+ +
generated image
+
+
+ +## Popular models + +The most popular image-to-image models are [Stable Diffusion v1.5](https://huggingface.co/stable-diffusion-v1-5/stable-diffusion-v1-5), [Stable Diffusion XL (SDXL)](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0), and [Kandinsky 2.2](https://huggingface.co/kandinsky-community/kandinsky-2-2-decoder). The results from the Stable Diffusion and Kandinsky models vary due to their architecture differences and training process; you can generally expect SDXL to produce higher quality images than Stable Diffusion v1.5. Let's take a quick look at how to use each of these models and compare their results. + +### Stable Diffusion v1.5 + +Stable Diffusion v1.5 is a latent diffusion model initialized from an earlier checkpoint, and further finetuned for 595K steps on 512x512 images. To use this pipeline for image-to-image, you'll need to prepare an initial image to pass to the pipeline. Then you can pass a prompt and the image to the pipeline to generate a new image: + +```py +import mindspore as ms +from mindone.diffusers import StableDiffusionImg2ImgPipeline +from mindone.diffusers.utils import make_image_grid, load_image + +pipeline = StableDiffusionImg2ImgPipeline.from_pretrained( + "stable-diffusion-v1-5/stable-diffusion-v1-5", mindspore_dtype=ms.float16, variant="fp16", use_safetensors=True +) + +# prepare image +url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/img2img-init.png" +init_image = load_image(url) + +prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k" + +# pass prompt and image to pipeline +image = pipeline(prompt, image=init_image)[0][0] +make_image_grid([init_image, image], rows=1, cols=2) +``` + +
+
+ +
initial image
+
+
+ +
generated image
+
+
+ +### Stable Diffusion XL (SDXL) + +SDXL is a more powerful version of the Stable Diffusion model. It uses a larger base model, and an additional refiner model to increase the quality of the base model's output. Read the [SDXL](sdxl.md) guide for a more detailed walkthrough of how to use this model, and other techniques it uses to produce high quality images. + +```py +import mindspore as ms +from mindone.diffusers import DiffusionPipeline +from mindone.diffusers.utils import make_image_grid, load_image + +pipeline = DiffusionPipeline.from_pretrained( + "stabilityai/stable-diffusion-xl-refiner-1.0", mindspore_dtype=ms.float16, variant="fp16", use_safetensors=True +) + +# prepare image +url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/img2img-sdxl-init.png" +init_image = load_image(url) + +prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k" + +# pass prompt and image to pipeline +image = pipeline(prompt, image=init_image, strength=0.5)[0][0] +make_image_grid([init_image, image], rows=1, cols=2) +``` + +
+
+ +
initial image
+
+
+ +
generated image
+
+
+ +### Kandinsky 2.2 + +The Kandinsky model is different from the Stable Diffusion models because it uses an image prior model to create image embeddings. The embeddings help create a better alignment between text and images, allowing the latent diffusion model to generate better images. + +The simplest way to use Kandinsky 2.2 is: + +```py +import mindspore as ms +from mindone.diffusers import KandinskyV22Img2ImgCombinedPipeline +from mindone.diffusers.utils import make_image_grid, load_image + +pipeline = KandinskyV22Img2ImgCombinedPipeline.from_pretrained( + "kandinsky-community/kandinsky-2-2-decoder", mindspore_dtype=ms.float16, use_safetensors=True +) + +# prepare image +url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/img2img-init.png" +init_image = load_image(url) + +prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k" + +# pass prompt and image to pipeline +image = pipeline(prompt, image=init_image)[0][0] +make_image_grid([init_image, image], rows=1, cols=2) +``` + +
+
+ +
initial image
+
+
+ +
generated image
+
+
+ +## Configure pipeline parameters + +There are several important parameters you can configure in the pipeline that'll affect the image generation process and image quality. Let's take a closer look at what these parameters do and how changing them affects the output. + +### Strength + +`strength` is one of the most important parameters to consider and it'll have a huge impact on your generated image. It determines how much the generated image resembles the initial image. In other words: + +- 📈 a higher `strength` value gives the model more "creativity" to generate an image that's different from the initial image; a `strength` value of 1.0 means the initial image is more or less ignored +- 📉 a lower `strength` value means the generated image is more similar to the initial image + +The `strength` and `num_inference_steps` parameters are related because `strength` determines the number of noise steps to add. For example, if the `num_inference_steps` is 50 and `strength` is 0.8, then this means adding 40 (50 * 0.8) steps of noise to the initial image and then denoising for 40 steps to get the newly generated image. + +```py +import mindspore as ms +from mindone.diffusers import StableDiffusionImg2ImgPipeline +from mindone.diffusers.utils import make_image_grid, load_image + +pipeline = StableDiffusionImg2ImgPipeline.from_pretrained( + "stable-diffusion-v1-5/stable-diffusion-v1-5", mindspore_dtype=ms.float16, variant="fp16", use_safetensors=True +) + +# prepare image +url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/img2img-init.png" +init_image = load_image(url) + +prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k" + +# pass prompt and image to pipeline +image = pipeline(prompt, image=init_image, strength=0.8)[0][0] +make_image_grid([init_image, image], rows=1, cols=2) +``` + +
+
+ +
strength = 0.4
+
+
+ +
strength = 0.6
+
+
+ +
strength = 1.0
+
+
+ +### Guidance scale + +The `guidance_scale` parameter is used to control how closely aligned the generated image and text prompt are. A higher `guidance_scale` value means your generated image is more aligned with the prompt, while a lower `guidance_scale` value means your generated image has more space to deviate from the prompt. + +You can combine `guidance_scale` with `strength` for even more precise control over how expressive the model is. For example, combine a high `strength + guidance_scale` for maximum creativity or use a combination of low `strength` and low `guidance_scale` to generate an image that resembles the initial image but is not as strictly bound to the prompt. + +```py +import mindspore as ms +from mindone.diffusers import StableDiffusionImg2ImgPipeline +from mindone.diffusers.utils import make_image_grid, load_image + +pipeline = StableDiffusionImg2ImgPipeline.from_pretrained( + "stable-diffusion-v1-5/stable-diffusion-v1-5", mindspore_dtype=ms.float16, variant="fp16", use_safetensors=True +) + +# prepare image +url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/img2img-init.png" +init_image = load_image(url) + +prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k" + +# pass prompt and image to pipeline +image = pipeline(prompt, image=init_image, guidance_scale=8.0)[0][0] +make_image_grid([init_image, image], rows=1, cols=2) +``` + +
+
+ +
guidance_scale = 0.1
+
+
+ +
guidance_scale = 5.0
+
+
+ +
guidance_scale = 10.0
+
+
+ +### Negative prompt + +A negative prompt conditions the model to *not* include things in an image, and it can be used to improve image quality or modify an image. For example, you can improve image quality by including negative prompts like "poor details" or "blurry" to encourage the model to generate a higher quality image. Or you can modify an image by specifying things to exclude from an image. + +```py +import mindspore as ms +from mindone.diffusers import DiffusionPipeline +from mindone.diffusers.utils import make_image_grid, load_image + +pipeline = DiffusionPipeline.from_pretrained( + "stabilityai/stable-diffusion-xl-refiner-1.0", mindspore_dtype=ms.float16, variant="fp16", use_safetensors=True +) + +# prepare image +url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/img2img-init.png" +init_image = load_image(url) + +prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k" +negative_prompt = "ugly, deformed, disfigured, poor details, bad anatomy" + +# pass prompt and image to pipeline +image = pipeline(prompt, negative_prompt=negative_prompt, image=init_image)[0][0] +make_image_grid([init_image, image], rows=1, cols=2) +``` + +
+
+ +
negative_prompt = "ugly, deformed, disfigured, poor details, bad anatomy"
+
+
+ +
negative_prompt = "jungle"
+
+
+ +## Chained image-to-image pipelines + +There are some other interesting ways you can use an image-to-image pipeline aside from just generating an image (although that is pretty cool too). You can take it a step further and chain it with other pipelines. + +### Text-to-image-to-image + +Chaining a text-to-image and image-to-image pipeline allows you to generate an image from text and use the generated image as the initial image for the image-to-image pipeline. This is useful if you want to generate an image entirely from scratch. For example, let's chain a Stable Diffusion and a Kandinsky model. + +Start by generating an image with the text-to-image pipeline: + +```py +from mindone.diffusers import DiffusionPipeline, KandinskyV22Img2ImgCombinedPipeline +import mindspore as ms +from mindone.diffusers.utils import make_image_grid + +pipeline = DiffusionPipeline.from_pretrained( + "stable-diffusion-v1-5/stable-diffusion-v1-5", mindspore_dtype=ms.float16, variant="fp16", use_safetensors=True +) + +text2image = pipeline("Astronaut in a jungle, cold color palette, muted colors, detailed, 8k")[0][0] +text2image +``` + +Now you can pass this generated image to the image-to-image pipeline: + +```py +pipeline = KandinskyV22Img2ImgCombinedPipeline.from_pretrained( + "kandinsky-community/kandinsky-2-2-decoder", mindspore_dtype=ms.float16, use_safetensors=True +) + +image2image = pipeline("Astronaut in a jungle, cold color palette, muted colors, detailed, 8k", image=text2image)[0][0] +make_image_grid([text2image, image2image], rows=1, cols=2) +``` + +### Image-to-image-to-image + +You can also chain multiple image-to-image pipelines together to create more interesting images. This can be useful for iteratively performing style transfer on an image, generating short GIFs, restoring color to an image, or restoring missing areas of an image. + +Start by generating an image: + +```py +import mindspore as ms +from mindone.diffusers import DiffusionPipeline +from mindone.diffusers.utils import make_image_grid, load_image + +pipeline = DiffusionPipeline.from_pretrained( + "stable-diffusion-v1-5/stable-diffusion-v1-5", mindspore_dtype=ms.float16, variant="fp16", use_safetensors=True +) + +# prepare image +url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/img2img-init.png" +init_image = load_image(url) + +prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k" + +# pass prompt and image to pipeline +image = pipeline(prompt, image=init_image, output_type="latent")[0][0] +``` + +!!! tip + + It is important to specify `output_type="latent"` in the pipeline to keep all the outputs in latent space to avoid an unnecessary decode-encode step. This only works if the chained pipelines are using the same VAE. + +Pass the latent output from this pipeline to the next pipeline to generate an image in a [comic book art style](https://huggingface.co/ogkalu/Comic-Diffusion): + +```py +pipeline = DiffusionPipeline.from_pretrained( + "ogkalu/Comic-Diffusion", mindspore_dtype=ms.float16, use_safetensors=True +) + +# need to include the token "charliebo artstyle" in the prompt to use this checkpoint +image = pipeline("Astronaut in a jungle, charliebo artstyle", image=image, output_type="latent")[0][0] +``` + +Repeat one more time to generate the final image in a [pixel art style](https://huggingface.co/kohbanye/pixel-art-style): + +```py +pipeline = DiffusionPipeline.from_pretrained( + "kohbanye/pixel-art-style", mindspore_dtype=ms.float16, use_safetensors=True +) + +# need to include the token "pixelartstyle" in the prompt to use this checkpoint +image = pipeline("Astronaut in a jungle, pixelartstyle", image=image)[0][0] +make_image_grid([init_image, image], rows=1, cols=2) +``` diff --git a/docs/diffusers/using-diffusers/inference_with_lcm.md b/docs/diffusers/using-diffusers/inference_with_lcm.md new file mode 100644 index 0000000000..51c5f2dffc --- /dev/null +++ b/docs/diffusers/using-diffusers/inference_with_lcm.md @@ -0,0 +1,612 @@ + + +# Latent Consistency Model + +[Latent Consistency Models (LCMs)](https://hf.co/papers/2310.04378) enable fast high-quality image generation by directly predicting the reverse diffusion process in the latent rather than pixel space. In other words, LCMs try to predict the noiseless image from the noisy image in contrast to typical diffusion models that iteratively remove noise from the noisy image. By avoiding the iterative sampling process, LCMs are able to generate high-quality images in 2-4 steps instead of 20-30 steps. + +LCMs are distilled from pretrained models which requires ~32 hours of A100 compute. To speed this up, [LCM-LoRAs](https://hf.co/papers/2311.05556) train a [LoRA adapter](https://huggingface.co/docs/peft/conceptual_guides/adapter#low-rank-adaptation-lora) which have much fewer parameters to train compared to the full model. The LCM-LoRA can be plugged into a diffusion model once it has been trained. + +This guide will show you how to use LCMs and LCM-LoRAs for fast inference on tasks and how to use them with other adapters like ControlNet or T2I-Adapter. + +!!! tip + + LCMs and LCM-LoRAs are available for Stable Diffusion v1.5, Stable Diffusion XL, and the SSD-1B model. You can find their checkpoints on the [Latent Consistency](https://hf.co/collections/latent-consistency/latent-consistency-models-weights-654ce61a95edd6dffccef6a8) Collections. + +## Text-to-image + +=== "LCM" + + To use LCMs, you need to load the LCM checkpoint for your supported model into [`UNet2DConditionModel`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/models/unet2d-cond/#mindone.diffusers.UNet2DConditionModel) and replace the scheduler with the [`LCMScheduler`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/schedulers/lcm/#mindone.diffusers.LCMScheduler). Then you can use the pipeline as usual, and pass a text prompt to generate an image in just 4 steps. + + A couple of notes to keep in mind when using LCMs are: + + * Typically, batch size is doubled inside the pipeline for classifier-free guidance. But LCM applies guidance with guidance embeddings and doesn't need to double the batch size, which leads to faster inference. The downside is that negative prompts don't work with LCM because they don't have any effect on the denoising process. + * The ideal range for `guidance_scale` is [3., 13.] because that is what the UNet was trained with. However, disabling `guidance_scale` with a value of 1.0 is also effective in most cases. + + ```python + from mindone.diffusers import StableDiffusionXLPipeline, UNet2DConditionModel, LCMScheduler + import mindspore as ms + import numpy as np + + unet = UNet2DConditionModel.from_pretrained( + "latent-consistency/lcm-sdxl", + mindspore_dtype=ms.float16, + variant="fp16", + ) + pipe = StableDiffusionXLPipeline.from_pretrained( + "stabilityai/stable-diffusion-xl-base-1.0", unet=unet, mindspore_dtype=ms.float16, + ) + pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config) + + prompt = "Self-portrait oil painting, a beautiful cyborg with golden hair, 8k" + generator = np.random.Generator(np.random.PCG64(0)) + image = pipe( + prompt=prompt, num_inference_steps=4, generator=generator, guidance_scale=8.0 + )[0][0] + image + ``` + +
+ +
+ +=== "LCM-LoRA" + + To use LCM-LoRAs, you need to replace the scheduler with the [`LCMScheduler`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/schedulers/lcm/#mindone.diffusers.LCMScheduler) and load the LCM-LoRA weights with the [`load_lora_weights`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/loaders/lora/#mindone.diffusers.loaders.lora.LoraLoaderMixin.load_lora_weights) method. Then you can use the pipeline as usual, and pass a text prompt to generate an image in just 4 steps. + + A couple of notes to keep in mind when using LCM-LoRAs are: + + * Typically, batch size is doubled inside the pipeline for classifier-free guidance. But LCM applies guidance with guidance embeddings and doesn't need to double the batch size, which leads to faster inference. The downside is that negative prompts don't work with LCM because they don't have any effect on the denoising process. + * You could use guidance with LCM-LoRAs, but it is very sensitive to high `guidance_scale` values and can lead to artifacts in the generated image. The best values we've found are between [1.0, 2.0]. + * Replace [stabilityai/stable-diffusion-xl-base-1.0](https://hf.co/stabilityai/stable-diffusion-xl-base-1.0) with any finetuned model. For example, try using the [animagine-xl](https://huggingface.co/Linaqruf/animagine-xl) checkpoint to generate anime images with SDXL. + + ```py + import mindspore as ms + from mindone.diffusers import DiffusionPipeline, LCMScheduler + import numpy as np + + pipe = DiffusionPipeline.from_pretrained( + "stabilityai/stable-diffusion-xl-base-1.0", + mindspore_dtype=ms.float16 + ) + pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config) + pipe.load_lora_weights("latent-consistency/lcm-lora-sdxl") + + prompt = "Self-portrait oil painting, a beautiful cyborg with golden hair, 8k" + generator = np.random.Generator(np.random.PCG64(42)) + image = pipe( + prompt=prompt, num_inference_steps=4, generator=generator, guidance_scale=1.0 + )[0][0] + image + ``` + +
+ +
+ +## Image-to-image + +=== "LCM" + + To use LCMs for image-to-image, you need to load the LCM checkpoint for your supported model into [`UNet2DConditionModel`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/models/unet2d-cond/#mindone.diffusers.UNet2DConditionModel) and replace the scheduler with the [`LCMScheduler`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/schedulers/lcm/#mindone.diffusers.LCMScheduler). Then you can use the pipeline as usual, and pass a text prompt and initial image to generate an image in just 4 steps. + + !!! tip + + Experiment with different values for `num_inference_steps`, `strength`, and `guidance_scale` to get the best results. + + ```python + import mindspore as ms + from mindone.diffusers import StableDiffusionImg2ImgPipeline, UNet2DConditionModel, LCMScheduler + from mindone.diffusers.utils import load_image + import numpy as np + + unet = UNet2DConditionModel.from_pretrained( + "SimianLuo/LCM_Dreamshaper_v7", + subfolder="unet", + mindspore_dtype=ms.float16, + ) + + pipe = StableDiffusionImg2ImgPipeline.from_pretrained( + "Lykon/dreamshaper-7", + unet=unet, + mindspore_dtype=ms.float16, + variant="fp16", + ) + pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config) + + init_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/img2img-init.png") + prompt = "Astronauts in a jungle, cold color palette, muted colors, detailed, 8k" + generator = np.random.Generator(np.random.PCG64(42)) + image = pipe( + prompt, + image=init_image, + num_inference_steps=4, + guidance_scale=7.5, + strength=0.5, + generator=generator + )[0][0] + image + ``` + +
+
+ +
initial image
+
+
+ +
generated image
+
+
+ +=== "LCM-LoRA" + + To use LCM-LoRAs for image-to-image, you need to replace the scheduler with the [`LCMScheduler`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/schedulers/lcm/#mindone.diffusers.LCMScheduler) and load the LCM-LoRA weights with the [`load_lora_weights`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/loaders/lora/#mindone.diffusers.loaders.lora.LoraLoaderMixin.load_lora_weights) method. Then you can use the pipeline as usual, and pass a text prompt and initial image to generate an image in just 4 steps. + + !!! tip + + Experiment with different values for `num_inference_steps`, `strength`, and `guidance_scale` to get the best results. + + ```py + import mindspore as ms + from mindone.diffusers import StableDiffusionImg2ImgPipeline, LCMScheduler + from mindone.diffusers.utils import make_image_grid, load_image + import numpy as np + + pipe = StableDiffusionImg2ImgPipeline.from_pretrained( + "Lykon/dreamshaper-7", + mindspore_dtype=ms.float16, + variant="fp16", + ) + + pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config) + + pipe.load_lora_weights("latent-consistency/lcm-lora-sdv1-5") + + init_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/img2img-init.png") + prompt = "Astronauts in a jungle, cold color palette, muted colors, detailed, 8k" + + generator = np.random.Generator(np.random.PCG64(0)) + image = pipe( + prompt, + image=init_image, + num_inference_steps=4, + guidance_scale=1, + strength=0.6, + generator=generator + )[0][0] + image + ``` + +
+
+ +
initial image
+
+
+ +
generated image
+
+
+ +## Inpainting + +To use LCM-LoRAs for inpainting, you need to replace the scheduler with the [`LCMScheduler`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/schedulers/lcm/#mindone.diffusers.LCMScheduler) and load the LCM-LoRA weights with the [`load_lora_weights`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/loaders/lora/#mindone.diffusers.loaders.lora.LoraLoaderMixin.load_lora_weights) method. Then you can use the pipeline as usual, and pass a text prompt, initial image, and mask image to generate an image in just 4 steps. + +```py +import mindspore as ms +from mindone.diffusers import StableDiffusionInpaintPipeline, LCMScheduler +from mindone.diffusers.utils import load_image, make_image_grid +import numpy as np + +pipe = StableDiffusionInpaintPipeline.from_pretrained( + "stable-diffusion-v1-5/stable-diffusion-inpainting", + mindspore_dtype=ms.float16, + variant="fp16", +) + +pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config) + +pipe.load_lora_weights("latent-consistency/lcm-lora-sdv1-5") + +init_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/inpaint.png") +mask_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/inpaint_mask.png") + +prompt = "concept art digital painting of an elven castle, inspired by lord of the rings, highly detailed, 8k" +generator = np.random.Generator(np.random.PCG64(42)) +image = pipe( + prompt=prompt, + image=init_image, + mask_image=mask_image, + generator=generator, + num_inference_steps=4, + guidance_scale=4, +)[0][0] +image +``` + +
+
+ +
initial image
+
+
+ +
generated image
+
+
+ +## Adapters + +LCMs are compatible with adapters like LoRA, ControlNet, T2I-Adapter, and AnimateDiff. You can bring the speed of LCMs to these adapters to generate images in a certain style or condition the model on another input like a canny image. + +### LoRA + +[LoRA](../using-diffusers/loading_adapters.md#lora) adapters can be rapidly finetuned to learn a new style from just a few images and plugged into a pretrained model to generate images in that style. + +=== "LCM" + + Load the LCM checkpoint for your supported model into [`UNet2DConditionModel`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/models/unet2d-cond/#mindone.diffusers.UNet2DConditionModel) and replace the scheduler with the [`LCMScheduler`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/schedulers/lcm/#mindone.diffusers.LCMScheduler). Then you can use the [`load_lora_weights`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/loaders/lora/#mindone.diffusers.loaders.lora.LoraLoaderMixin.load_lora_weights) method to load the LoRA weights into the LCM and generate a styled image in a few steps. + + ```python + from mindone.diffusers import StableDiffusionXLPipeline, UNet2DConditionModel, LCMScheduler + import mindspore as ms + import numpy as np + + unet = UNet2DConditionModel.from_pretrained( + "latent-consistency/lcm-sdxl", + mindspore_dtype=ms.float16, + variant="fp16", + ) + pipe = StableDiffusionXLPipeline.from_pretrained( + "stabilityai/stable-diffusion-xl-base-1.0", unet=unet, mindspore_dtype=ms.float16 + ) + pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config) + pipe.load_lora_weights("TheLastBen/Papercut_SDXL", weight_name="papercut.safetensors", adapter_name="papercut") + + prompt = "papercut, a cute fox" + generator = np.random.Generator(np.random.PCG64(0)) + image = pipe( + prompt=prompt, num_inference_steps=4, generator=generator, guidance_scale=8.0 + )[0][0] + image + ``` + +
+ +
+ +=== "LCM-LoRA" + + Replace the scheduler with the [`LCMScheduler`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/schedulers/lcm/#mindone.diffusers.LCMScheduler). Then you can use the [`load_lora_weights`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/loaders/lora/#mindone.diffusers.loaders.lora.LoraLoaderMixin.load_lora_weights) method to load the LCM-LoRA weights and the style LoRA you want to use. Combine both LoRA adapters with the [`set_adapters`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/loaders/unet/#mindone.diffusers.loaders.unet.UNet2DConditionLoadersMixin.set_adapters) method and generate a styled image in a few steps. + + ```py + import mindspore as ms + from mindone.diffusers import DiffusionPipeline, LCMScheduler + import numpy as np + + pipe = DiffusionPipeline.from_pretrained( + "stabilityai/stable-diffusion-xl-base-1.0", + mindspore_dtype=ms.float16 + ) + + pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config) + + pipe.load_lora_weights("latent-consistency/lcm-lora-sdxl", adapter_name="lcm") + pipe.load_lora_weights("TheLastBen/Papercut_SDXL", weight_name="papercut.safetensors", adapter_name="papercut") + + pipe.set_adapters(["lcm", "papercut"], adapter_weights=[1.0, 0.8]) + + prompt = "papercut, a cute fox" + generator = np.random.Generator(np.random.PCG64(0)) + image = pipe(prompt, num_inference_steps=4, guidance_scale=1, generator=generator)[0][0] + image + ``` + +
+ +
+ +### ControlNet + +[ControlNet](./controlnet.md) are adapters that can be trained on a variety of inputs like canny edge, pose estimation, or depth. The ControlNet can be inserted into the pipeline to provide additional conditioning and control to the model for more accurate generation. + +You can find additional ControlNet models trained on other inputs in [lllyasviel's](https://hf.co/lllyasviel) repository. + +=== "LCM" + + Load a ControlNet model trained on canny images and pass it to the [`ControlNetModel`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/models/controlnet/#mindone.diffusers.ControlNetModel). Then you can load a LCM model into [`StableDiffusionControlNetPipeline`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/pipelines/controlnet#mindone.diffusers.StableDiffusionControlNetPipeline) and replace the scheduler with the [`LCMScheduler`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/schedulers/lcm/#mindone.diffusers.LCMScheduler). Now pass the canny image to the pipeline and generate an image. + + !!! tip + + Experiment with different values for `num_inference_steps`, `controlnet_conditioning_scale`, `cross_attention_kwargs`, and `guidance_scale` to get the best results. + + ```python + import mindspore as ms + import cv2 + import numpy as np + from PIL import Image + + from mindone.diffusers import StableDiffusionControlNetPipeline, ControlNetModel, LCMScheduler + from mindone.diffusers.utils import load_image, make_image_grid + + image = load_image( + "https://hf.co/datasets/huggingface/documentation-images/resolve/main/diffusers/input_image_vermeer.png" + ).resize((512, 512)) + + image = np.array(image) + + low_threshold = 100 + high_threshold = 200 + + image = cv2.Canny(image, low_threshold, high_threshold) + image = image[:, :, None] + image = np.concatenate([image, image, image], axis=2) + canny_image = Image.fromarray(image) + + controlnet = ControlNetModel.from_pretrained("lllyasviel/sd-controlnet-canny", mindspore_dtype=ms.float16) + pipe = StableDiffusionControlNetPipeline.from_pretrained( + "SimianLuo/LCM_Dreamshaper_v7", + controlnet=controlnet, + mindspore_dtype=ms.float16, + safety_checker=None, + ) + pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config) + + generator = np.random.Generator(np.random.PCG64(0)) + image = pipe( + "the mona lisa", + image=canny_image, + num_inference_steps=4, + generator=generator, + )[0][0] + make_image_grid([canny_image, image], rows=1, cols=2) + ``` + +
+ +
+ +=== "LCM-LoRA" + + Load a ControlNet model trained on canny images and pass it to the [`ControlNetModel`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/models/controlnet/#mindone.diffusers.ControlNetModel). Then you can load a Stable Diffusion v1.5 model into [`StableDiffusionControlNetPipeline`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/pipelines/controlnet#mindone.diffusers.StableDiffusionControlNetPipeline) and replace the scheduler with the [`LCMScheduler`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/schedulers/lcm/#mindone.diffusers.LCMScheduler). Use the [`load_lora_weights`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/loaders/lora/#mindone.diffusers.loaders.lora.LoraLoaderMixin.load_lora_weights) method to load the LCM-LoRA weights, and pass the canny image to the pipeline and generate an image. + + !!! tip + + Experiment with different values for `num_inference_steps`, `controlnet_conditioning_scale`, `cross_attention_kwargs`, and `guidance_scale` to get the best results. + + ```py + import mindspore as ms + import cv2 + import numpy as np + from PIL import Image + + from mindone.diffusers import StableDiffusionControlNetPipeline, ControlNetModel, LCMScheduler + from mindone.diffusers.utils import load_image + + image = load_image( + "https://hf.co/datasets/huggingface/documentation-images/resolve/main/diffusers/input_image_vermeer.png" + ).resize((512, 512)) + + image = np.array(image) + + low_threshold = 100 + high_threshold = 200 + + image = cv2.Canny(image, low_threshold, high_threshold) + image = image[:, :, None] + image = np.concatenate([image, image, image], axis=2) + canny_image = Image.fromarray(image) + + controlnet = ControlNetModel.from_pretrained("lllyasviel/sd-controlnet-canny", mindspore_dtype=ms.float16) + pipe = StableDiffusionControlNetPipeline.from_pretrained( + "stable-diffusion-v1-5/stable-diffusion-v1-5", + controlnet=controlnet, + mindspore_dtype=ms.float16, + safety_checker=None, + variant="fp16" + ) + + pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config) + + pipe.load_lora_weights("latent-consistency/lcm-lora-sdv1-5") + + generator = np.random.Generator(np.random.PCG64(0)) + image = pipe( + "the mona lisa", + image=canny_image, + num_inference_steps=4, + guidance_scale=1.5, + controlnet_conditioning_scale=0.8, + generator=generator, + )[0][0] + image + ``` + +
+ +
+ +### T2I-Adapter + +[T2I-Adapter](./t2i_adapter.md) is an even more lightweight adapter than ControlNet, that provides an additional input to condition a pretrained model with. It is faster than ControlNet but the results may be slightly worse. + +You can find additional T2I-Adapter checkpoints trained on other inputs in [TencentArc's](https://hf.co/TencentARC) repository. + +=== "LCM" + + Load a T2IAdapter trained on canny images and pass it to the [`StableDiffusionXLAdapterPipeline`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/pipelines/stable_diffusion/adapter/#mindone.diffusers.StableDiffusionXLAdapterPipeline). Then load a LCM checkpoint into [`UNet2DConditionModel`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/models/unet2d-cond/#mindone.diffusers.UNet2DConditionModel) and replace the scheduler with the [`LCMScheduler`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/schedulers/lcm/#mindone.diffusers.LCMScheduler). Now pass the canny image to the pipeline and generate an image. + + ```python + import mindspore as ms + import cv2 + import numpy as np + from PIL import Image + + from mindone.diffusers import StableDiffusionXLAdapterPipeline, UNet2DConditionModel, T2IAdapter, LCMScheduler + from mindone.diffusers.utils import load_image, make_image_grid + + # detect the canny map in low resolution to avoid high-frequency details + image = load_image( + "https://hf.co/datasets/huggingface/documentation-images/resolve/main/diffusers/input_image_vermeer.png" + ).resize((384, 384)) + + image = np.array(image) + + low_threshold = 100 + high_threshold = 200 + + image = cv2.Canny(image, low_threshold, high_threshold) + image = image[:, :, None] + image = np.concatenate([image, image, image], axis=2) + canny_image = Image.fromarray(image).resize((1024, 1216)) + + adapter = T2IAdapter.from_pretrained("TencentARC/t2i-adapter-canny-sdxl-1.0", mindspore_dtype=ms.float16, varient="fp16") + + unet = UNet2DConditionModel.from_pretrained( + "latent-consistency/lcm-sdxl", + mindspore_dtype=ms.float16, + variant="fp16", + ) + pipe = StableDiffusionXLAdapterPipeline.from_pretrained( + "stabilityai/stable-diffusion-xl-base-1.0", + unet=unet, + adapter=adapter, + mindspore_dtype=ms.float16, + ) + + pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config) + + prompt = "the mona lisa, 4k picture, high quality" + negative_prompt = "extra digit, fewer digits, cropped, worst quality, low quality, glitch, deformed, mutated, ugly, disfigured" + + generator = np.random.Generator(np.random.PCG64(0)) + image = pipe( + prompt=prompt, + negative_prompt=negative_prompt, + image=canny_image, + num_inference_steps=4, + guidance_scale=5, + adapter_conditioning_scale=0.8, + adapter_conditioning_factor=1, + generator=generator, + )[0][0] + ``` + +
+ +
+ +=== "LCM-LoRA" + + Load a T2IAdapter trained on canny images and pass it to the [`StableDiffusionXLAdapterPipeline`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/pipelines/stable_diffusion/adapter/#mindone.diffusers.StableDiffusionXLAdapterPipeline). Replace the scheduler with the [`LCMScheduler`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/schedulers/lcm/#mindone.diffusers.LCMScheduler), and use the [`load_lora_weights`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/loaders/lora/#mindone.diffusers.loaders.lora.LoraLoaderMixin.load_lora_weights) method to load the LCM-LoRA weights. Pass the canny image to the pipeline and generate an image. + + ```py + import mindspore as ms + import cv2 + import numpy as np + from PIL import Image + + from mindone.diffusers import StableDiffusionXLAdapterPipeline, UNet2DConditionModel, T2IAdapter, LCMScheduler + from mindone.diffusers.utils import load_image, make_image_grid + + # detect the canny map in low resolution to avoid high-frequency details + image = load_image( + "https://hf.co/datasets/huggingface/documentation-images/resolve/main/diffusers/input_image_vermeer.png" + ).resize((384, 384)) + + image = np.array(image) + + low_threshold = 100 + high_threshold = 200 + + image = cv2.Canny(image, low_threshold, high_threshold) + image = image[:, :, None] + image = np.concatenate([image, image, image], axis=2) + canny_image = Image.fromarray(image).resize((1024, 1024)) + + adapter = T2IAdapter.from_pretrained("TencentARC/t2i-adapter-canny-sdxl-1.0", mindspore_dtype=ms.float16, varient="fp16") + + pipe = StableDiffusionXLAdapterPipeline.from_pretrained( + "stabilityai/stable-diffusion-xl-base-1.0", + adapter=adapter, + mindspore_dtype=ms.float16, + ) + + pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config) + + pipe.load_lora_weights("latent-consistency/lcm-lora-sdxl") + + prompt = "the mona lisa, 4k picture, high quality" + negative_prompt = "extra digit, fewer digits, cropped, worst quality, low quality, glitch, deformed, mutated, ugly, disfigured" + + generator = np.random.Generator(np.random.PCG64(0)) + image = pipe( + prompt=prompt, + negative_prompt=negative_prompt, + image=canny_image, + num_inference_steps=4, + guidance_scale=1.5, + adapter_conditioning_scale=0.8, + adapter_conditioning_factor=1, + generator=generator, + )[0][0] + ``` + +
+ +
+ +### AnimateDiff + +[AnimateDiff](../api/pipelines/animatediff.md) is an adapter that adds motion to an image. It can be used with most Stable Diffusion models, effectively turning them into "video generation" models. Generating good results with a video model usually requires generating multiple frames (16-24), which can be very slow with a regular Stable Diffusion model. LCM-LoRA can speed up this process by only taking 4-8 steps for each frame. + +Load a [`AnimateDiffPipeline`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/pipelines/animatediff/#mindone.diffusers.AnimateDiffPipeline) and pass a [`MotionAdapter`] to it. Then replace the scheduler with the [`LCMScheduler`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/schedulers/lcm/#mindone.diffusers.LCMScheduler), and combine both LoRA adapters with the [`set_adapters`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/loaders/unet/#mindone.diffusers.loaders.unet.UNet2DConditionLoadersMixin.set_adapters) method. Now you can pass a prompt to the pipeline and generate an animated image. + +```py +import mindspore as ms +from mindone.diffusers import MotionAdapter, AnimateDiffPipeline, DDIMScheduler, LCMScheduler +from mindone.diffusers.utils import export_to_gif +import numpy as np + +adapter = MotionAdapter.from_pretrained("guoyww/animatediff-motion-adapter-v1-5", mindspore_dtype=ms.float16) +pipe = AnimateDiffPipeline.from_pretrained( + "frankjoshua/toonyou_beta6", + motion_adapter=adapter, + mindspore_dtype=ms.float16, +) + +# set scheduler +pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config) + +# load LCM-LoRA +pipe.load_lora_weights("latent-consistency/lcm-lora-sdv1-5", adapter_name="lcm") +pipe.load_lora_weights("guoyww/animatediff-motion-lora-zoom-in", weight_name="diffusion_pytorch_model.safetensors", adapter_name="motion-lora") + +pipe.set_adapters(["lcm", "motion-lora"], adapter_weights=[0.55, 1.2]) + +prompt = "best quality, masterpiece, 1girl, looking at viewer, blurry background, upper body, contemporary, dress" +generator = np.random.Generator(np.random.PCG64(0)) +frames = pipe( + prompt=prompt, + num_inference_steps=5, + guidance_scale=1.25, + num_frames=24, + generator=generator +)[0][0] +export_to_gif(frames, "animation.gif") +``` + +
+ +
diff --git a/docs/diffusers/using-diffusers/inference_with_tcd_lora.md b/docs/diffusers/using-diffusers/inference_with_tcd_lora.md new file mode 100644 index 0000000000..4edecbea73 --- /dev/null +++ b/docs/diffusers/using-diffusers/inference_with_tcd_lora.md @@ -0,0 +1,420 @@ + + +# Trajectory Consistency Distillation-LoRA + +Trajectory Consistency Distillation (TCD) enables a model to generate higher quality and more detailed images with fewer steps. Moreover, owing to the effective error mitigation during the distillation process, TCD demonstrates superior performance even under conditions of large inference steps. + +The major advantages of TCD are: + +- Better than Teacher: TCD demonstrates superior generative quality at both small and large inference steps and exceeds the performance of [DPM-Solver++(2S)](../../api/schedulers/multistep_dpm_solver) with Stable Diffusion XL (SDXL). There is no additional discriminator or LPIPS supervision included during TCD training. + +- Flexible Inference Steps: The inference steps for TCD sampling can be freely adjusted without adversely affecting the image quality. + +- Freely change detail level: During inference, the level of detail in the image can be adjusted with a single hyperparameter, *gamma*. + +!!! tip + + For more technical details of TCD, please refer to the [paper](https://arxiv.org/abs/2402.19159) or official [project page](https://mhh0318.github.io/tcd/)). + +For large models like SDXL, TCD is trained with [LoRA](https://huggingface.co/docs/peft/conceptual_guides/adapter#low-rank-adaptation-lora) to reduce memory usage. This is also useful because you can reuse LoRAs between different finetuned models, as long as they share the same base model, without further training. + + + +This guide will show you how to perform inference with TCD-LoRAs for a variety of tasks like text-to-image and inpainting, as well as how you can easily combine TCD-LoRAs with other adapters. Choose one of the supported base model and it's corresponding TCD-LoRA checkpoint from the table below to get started. + +| Base model | TCD-LoRA checkpoint | +|-------------------------------------------------------------------------------------------------|----------------------------------------------------------------| +| [stable-diffusion-v1-5](https://huggingface.co/stable-diffusion-v1-5/stable-diffusion-v1-5) | [TCD-SD15](https://huggingface.co/h1t/TCD-SD15-LoRA) | +| [stable-diffusion-2-1-base](https://huggingface.co/stabilityai/stable-diffusion-2-1-base) | [TCD-SD21-base](https://huggingface.co/h1t/TCD-SD21-base-LoRA) | +| [stable-diffusion-xl-base-1.0](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0) | [TCD-SDXL](https://huggingface.co/h1t/TCD-SDXL-LoRA) | + +## General tasks + +In this guide, let's use the [`StableDiffusionXLPipeline`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/pipelines/stable_diffusion/stable_diffusion_xl/#mindone.diffusers.StableDiffusionXLPipeline) and the [`TCDScheduler`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/schedulers/tcd/#mindone.diffusers.TCDScheduler). Use the [`load_lora_weights`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/pipelines/stable_diffusion/text2img/#mindone.diffusers.StableDiffusionPipeline) method to load the SDXL-compatible TCD-LoRA weights. + +A few tips to keep in mind for TCD-LoRA inference are to: + +- Keep the `num_inference_steps` between 4 and 50 +- Set `eta` (used to control stochasticity at each step) between 0 and 1. You should use a higher `eta` when increasing the number of inference steps, but the downside is that a larger `eta` in [`TCDScheduler`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/schedulers/tcd/#mindone.diffusers.TCDScheduler) leads to blurrier images. A value of 0.3 is recommended to produce good results. + +=== "text-to-image" + + ```python + import mindspore as ms + from mindone.diffusers import StableDiffusionXLPipeline, TCDScheduler + import numpy as np + + base_model_id = "stabilityai/stable-diffusion-xl-base-1.0" + tcd_lora_id = "h1t/TCD-SDXL-LoRA" + + pipe = StableDiffusionXLPipeline.from_pretrained(base_model_id, mindspore_dtype=ms.float16) + pipe.scheduler = TCDScheduler.from_config(pipe.scheduler.config) + + pipe.load_lora_weights(tcd_lora_id) + pipe.fuse_lora() + + prompt = "Painting of the orange cat Otto von Garfield, Count of Bismarck-Schönhausen, Duke of Lauenburg, Minister-President of Prussia. Depicted wearing a Prussian Pickelhaube and eating his favorite meal - lasagna." + + image = pipe( + prompt=prompt, + num_inference_steps=4, + guidance_scale=0, + eta=0.3, + generator=np.random.Generator(np.random.PCG64(0)), + )[0][0] + ``` + +
+ +
+ +=== "inpainting" + + ```python + import mindspore as ms + from mindone.diffusers import StableDiffusionXLInpaintPipeline, TCDScheduler + from mindone.diffusers.utils import load_image, make_image_grid + import numpy as np + + base_model_id = "diffusers/stable-diffusion-xl-1.0-inpainting-0.1" + tcd_lora_id = "h1t/TCD-SDXL-LoRA" + + pipe = StableDiffusionXLInpaintPipeline.from_pretrained(base_model_id, mindspore_dtype=ms.float16, variant="fp16") + pipe.scheduler = TCDScheduler.from_config(pipe.scheduler.config) + + pipe.load_lora_weights(tcd_lora_id) + pipe.fuse_lora() + + img_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo.png" + mask_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo_mask.png" + + init_image = load_image(img_url).resize((1024, 1024)) + mask_image = load_image(mask_url).resize((1024, 1024)) + + prompt = "a tiger sitting on a park bench" + + image = pipe( + prompt=prompt, + image=init_image, + mask_image=mask_image, + num_inference_steps=8, + guidance_scale=0, + eta=0.3, + strength=0.99, # make sure to use `strength` below 1.0 + generator=np.random.Generator(np.random.PCG64(0)), + )[0][0] + + grid_image = make_image_grid([init_image, mask_image, image], rows=1, cols=3) + ``` + +
+ +
+ +## Community models + +TCD-LoRA also works with many community finetuned models and plugins. For example, load the [animagine-xl-3.0](https://huggingface.co/cagliostrolab/animagine-xl-3.0) checkpoint which is a community finetuned version of SDXL for generating anime images. + +```python +import mindspore as ms +from mindone.diffusers import StableDiffusionXLPipeline, TCDScheduler +import numpy as np + +base_model_id = "cagliostrolab/animagine-xl-3.0" +tcd_lora_id = "h1t/TCD-SDXL-LoRA" + +pipe = StableDiffusionXLPipeline.from_pretrained(base_model_id, mindspore_dtype=ms.float16, variant="fp16") +pipe.scheduler = TCDScheduler.from_config(pipe.scheduler.config) + +pipe.load_lora_weights(tcd_lora_id) +pipe.fuse_lora() + +prompt = "A man, clad in a meticulously tailored military uniform, stands with unwavering resolve. The uniform boasts intricate details, and his eyes gleam with determination. Strands of vibrant, windswept hair peek out from beneath the brim of his cap." + +image = pipe( + prompt=prompt, + num_inference_steps=8, + guidance_scale=0, + eta=0.3, + generator=np.random.Generator(np.random.PCG64(0)), +)[0][0] +``` + +
+ +
+ +TCD-LoRA also supports other LoRAs trained on different styles. For example, let's load the [TheLastBen/Papercut_SDXL](https://huggingface.co/TheLastBen/Papercut_SDXL) LoRA and fuse it with the TCD-LoRA with the [`set_adapters`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/loaders/unet/#mindone.diffusers.loaders.unet.UNet2DConditionLoadersMixin.set_adapters) method. + +!!! tip + + Check out the [Merge LoRAs](merge_loras.md) guide to learn more about efficient merging methods. + +```python +import mindspore as ms +from mindone.diffusers import StableDiffusionXLPipeline, TCDScheduler +import numpy as np + +base_model_id = "stabilityai/stable-diffusion-xl-base-1.0" +tcd_lora_id = "h1t/TCD-SDXL-LoRA" +styled_lora_id = "TheLastBen/Papercut_SDXL" + +pipe = StableDiffusionXLPipeline.from_pretrained(base_model_id, mindspore_dtype=ms.float16) +pipe.scheduler = TCDScheduler.from_config(pipe.scheduler.config) + +pipe.load_lora_weights(tcd_lora_id, adapter_name="tcd") +pipe.load_lora_weights(styled_lora_id, adapter_name="style") +pipe.set_adapters(["tcd", "style"], adapter_weights=[1.0, 1.0]) + +prompt = "papercut of a winter mountain, snow" + +image = pipe( + prompt=prompt, + num_inference_steps=4, + guidance_scale=0, + eta=0.3, + generator=np.random.Generator(np.random.PCG64(0)), +)[0][0] +``` + +
+ +
+ +## Adapters + +TCD-LoRA is very versatile, and it can be combined with other adapter types like ControlNets, IP-Adapter, and AnimateDiff. + +=== "ControlNet" + + ### Depth ControlNet + + ```python + import mindspore as ms + from mindspore import ops + import numpy as np + from PIL import Image + from transformers import DPTFeatureExtractor + from mindone.transformers import DPTForDepthEstimation + from mindone.diffusers import ControlNetModel, StableDiffusionXLControlNetPipeline, TCDScheduler + from mindone.diffusers.utils import load_image, make_image_grid + + depth_estimator = DPTForDepthEstimation.from_pretrained("Intel/dpt-hybrid-midas") + feature_extractor = DPTFeatureExtractor.from_pretrained("Intel/dpt-hybrid-midas") + + def get_depth_map(image): + image = feature_extractor(images=image, return_tensors="np").pixel_values + image = ms.Tensor(image) + + depth_map = depth_estimator(image)[0] + + depth_map = ops.interpolate( + depth_map.unsqueeze(1), + size=(1024, 1024), + mode="bicubic", + align_corners=False, + ) + depth_min = ops.amin(depth_map, axis=[1, 2, 3], keepdims=True) + depth_max = ops.amax(depth_map, axis=[1, 2, 3], keepdims=True) + depth_map = (depth_map - depth_min) / (depth_max - depth_min) + image = ops.cat([depth_map] * 3, axis=1) + + image = image.permute(0, 2, 3, 1).asnumpy()[0] + image = Image.fromarray((image * 255.0).clip(0, 255).astype(np.uint8)) + return image + + base_model_id = "stabilityai/stable-diffusion-xl-base-1.0" + controlnet_id = "diffusers/controlnet-depth-sdxl-1.0" + tcd_lora_id = "h1t/TCD-SDXL-LoRA" + + controlnet = ControlNetModel.from_pretrained( + controlnet_id, + mindspore_dtype=ms.float16, + variant="fp16", + ) + pipe = StableDiffusionXLControlNetPipeline.from_pretrained( + base_model_id, + controlnet=controlnet, + mindspore_dtype=ms.float16, + ) + + pipe.scheduler = TCDScheduler.from_config(pipe.scheduler.config) + + pipe.load_lora_weights(tcd_lora_id) + pipe.fuse_lora() + + prompt = "stormtrooper lecture, photorealistic" + + image = load_image("https://huggingface.co/lllyasviel/sd-controlnet-depth/resolve/main/images/stormtrooper.png") + depth_image = get_depth_map(image) + + controlnet_conditioning_scale = 0.5 # recommended for good generalization + + image = pipe( + prompt, + image=depth_image, + num_inference_steps=4, + guidance_scale=0, + eta=0.3, + controlnet_conditioning_scale=controlnet_conditioning_scale, + generator=np.random.Generator(np.random.PCG64(42)), + )[0][0] + + grid_image = make_image_grid([depth_image, image], rows=1, cols=2) + ``` + +
+ +
+ + ### Canny ControlNet + ```python + import mindspore as ms + from mindone.diffusers import ControlNetModel, StableDiffusionXLControlNetPipeline, TCDScheduler + from mindone.diffusers.utils import load_image, make_image_grid + import numpy as np + + base_model_id = "stabilityai/stable-diffusion-xl-base-1.0" + controlnet_id = "diffusers/controlnet-canny-sdxl-1.0" + tcd_lora_id = "h1t/TCD-SDXL-LoRA" + + controlnet = ControlNetModel.from_pretrained( + controlnet_id, + mindspore_dtype=ms.float16, + ) + pipe = StableDiffusionXLControlNetPipeline.from_pretrained( + base_model_id, + controlnet=controlnet, + mindspore_dtype=ms.float16, + ) + + pipe.scheduler = TCDScheduler.from_config(pipe.scheduler.config) + + pipe.load_lora_weights(tcd_lora_id) + pipe.fuse_lora() + + prompt = "ultrarealistic shot of a furry blue bird" + + canny_image = load_image("https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/sd_controlnet/bird_canny.png") + + controlnet_conditioning_scale = 0.5 # recommended for good generalization + + image = pipe( + prompt, + image=canny_image, + num_inference_steps=4, + guidance_scale=0, + eta=0.3, + controlnet_conditioning_scale=controlnet_conditioning_scale, + generator=np.random.Generator(np.random.PCG64(0)), + )[0][0] + + grid_image = make_image_grid([canny_image, image], rows=1, cols=2) + ``` + +
+ +
+ + !!! tip + + The inference parameters in this example might not work for all examples, so we recommend you to try different values for `num_inference_steps`, `guidance_scale`, `controlnet_conditioning_scale` and `cross_attention_kwargs` parameters and choose the best one. + +=== "IP-Adapter" + + This example shows how to use the TCD-LoRA with SDXL. + + ```python + import mindspore as ms + from mindone.diffusers import StableDiffusionXLPipeline, TCDScheduler + from mindone.diffusers.utils import load_image, make_image_grid + import numpy as np + + base_model_path = "stabilityai/stable-diffusion-xl-base-1.0" + tcd_lora_id = "h1t/TCD-SDXL-LoRA" + + pipe = StableDiffusionXLPipeline.from_pretrained( + base_model_path, + mindspore_dtype=ms.float16, + ) + pipe.scheduler = TCDScheduler.from_config(pipe.scheduler.config) + + pipe.load_lora_weights(tcd_lora_id) + pipe.fuse_lora() + + pipe.load_ip_adapter("h94/IP-Adapter", subfolder="sdxl_models", weight_name="ip-adapter_sdxl.safetensors") + pipe.set_ip_adapter_scale(0.5) + + ref_image = load_image("https://raw.githubusercontent.com/tencent-ailab/IP-Adapter/main/assets/images/woman.png").resize((512, 512)) + + prompt = "best quality, high quality, wearing sunglasses" + + generator = np.random.Generator(np.random.PCG64(0)) + + image = pipe( + prompt=prompt, + ip_adapter_image=ref_image, + guidance_scale=0, + num_inference_steps=50, + generator=generator, + eta=0.3, + )[0][0] + + grid_image = make_image_grid([ref_image, image], rows=1, cols=2) + ``` + +
+ +
+ +=== "AnimateDiff" + + [`AnimateDiff`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/pipelines/animatediff/#mindone.diffusers.AnimateDiffPipeline) allows animating images using Stable Diffusion models. TCD-LoRA can substantially accelerate the process without degrading image quality. The quality of animation with TCD-LoRA and AnimateDiff has a more lucid outcome. + + ```python + import mindspore as ms + from mindone.diffusers import MotionAdapter, AnimateDiffPipeline, DDIMScheduler, TCDScheduler + from mindone.diffusers.utils import export_to_gif + import numpy as np + + adapter = MotionAdapter.from_pretrained("guoyww/animatediff-motion-adapter-v1-5") + pipe = AnimateDiffPipeline.from_pretrained( + "frankjoshua/toonyou_beta6", + motion_adapter=adapter, + ) + + # set TCDScheduler + pipe.scheduler = TCDScheduler.from_config(pipe.scheduler.config) + + # load TCD LoRA + pipe.load_lora_weights("h1t/TCD-SD15-LoRA", adapter_name="tcd") + pipe.load_lora_weights("guoyww/animatediff-motion-lora-zoom-in", weight_name="diffusion_pytorch_model.safetensors", adapter_name="motion-lora") + + pipe.set_adapters(["tcd", "motion-lora"], adapter_weights=[1.0, 1.2]) + + prompt = "best quality, masterpiece, 1girl, looking at viewer, blurry background, upper body, contemporary, dress" + generator = np.random.Generator(np.random.PCG64(0)) + frames = pipe( + prompt=prompt, + num_inference_steps=5, + guidance_scale=0, + num_frames=24, + eta=0.3, + generator=generator + )[0][0] + export_to_gif(frames, "animation.gif") + ``` + +
+ +
diff --git a/docs/diffusers/using-diffusers/inpaint.md b/docs/diffusers/using-diffusers/inpaint.md new file mode 100644 index 0000000000..acfb4624c3 --- /dev/null +++ b/docs/diffusers/using-diffusers/inpaint.md @@ -0,0 +1,570 @@ + + +# Inpainting + +Inpainting replaces or edits specific areas of an image. This makes it a useful tool for image restoration like removing defects and artifacts, or even replacing an image area with something entirely new. Inpainting relies on a mask to determine which regions of an image to fill in; the area to inpaint is represented by white pixels and the area to keep is represented by black pixels. The white pixels are filled in by the prompt. + +With 🤗 Diffusers, here is how you can do inpainting: + +1. Load an inpainting checkpoint with the [`KandinskyV22InpaintCombinedPipeline`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/pipelines/kandinsky_v22/#mindone.diffusers.KandinskyV22InpaintPipeline) class: + +```py +import mindspore as ms +from mindone.diffusers import KandinskyV22InpaintCombinedPipeline +from mindone.diffusers.utils import load_image, make_image_grid + +pipeline = KandinskyV22InpaintCombinedPipeline.from_pretrained( + "kandinsky-community/kandinsky-2-2-decoder-inpaint", mindspore_dtype=ms.float16 +) +``` + +2. Load the base and mask images: + +```py +init_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/inpaint.png") +mask_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/inpaint_mask.png") +``` + +3. Create a prompt to inpaint the image with and pass it to the pipeline with the base and mask images: + +```py +prompt = "a black cat with glowing eyes, cute, adorable, disney, pixar, highly detailed, 8k" +negative_prompt = "bad anatomy, deformed, ugly, disfigured" +image = pipeline(prompt=prompt, negative_prompt=negative_prompt, image=init_image, mask_image=mask_image)[0][0] +make_image_grid([init_image, mask_image, image], rows=1, cols=3) +``` + +
+
+ +
base image
+
+
+ +
mask image
+
+
+ +
generated image
+
+
+ +## Popular models + +[Stable Diffusion Inpainting](https://huggingface.co/stable-diffusion-v1-5/stable-diffusion-inpainting), [Stable Diffusion XL (SDXL) Inpainting](https://huggingface.co/diffusers/stable-diffusion-xl-1.0-inpainting-0.1), and [Kandinsky 2.2 Inpainting](https://huggingface.co/kandinsky-community/kandinsky-2-2-decoder-inpaint) are among the most popular models for inpainting. SDXL typically produces higher resolution images than Stable Diffusion v1.5, and Kandinsky 2.2 is also capable of generating high-quality images. + +### Stable Diffusion Inpainting + +Stable Diffusion Inpainting is a latent diffusion model finetuned on 512x512 images on inpainting. It is a good starting point because it is relatively fast and generates good quality images. To use this model for inpainting, you'll need to pass a prompt, base and mask image to the pipeline: + +```py +import mindspore as ms +from mindone.diffusers import DiffusionPipeline +from mindone.diffusers.utils import load_image, make_image_grid +import numpy as np + +pipeline = DiffusionPipeline.from_pretrained( + "stable-diffusion-v1-5/stable-diffusion-inpainting", mindspore_dtype=ms.float16, variant="fp16" +) + +# load base and mask image +init_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/inpaint.png") +mask_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/inpaint_mask.png") + +generator = np.random.Generator(np.random.PCG64(92)) +prompt = "concept art digital painting of an elven castle, inspired by lord of the rings, highly detailed, 8k" +image = pipeline(prompt=prompt, image=init_image, mask_image=mask_image, generator=generator)[0][0] +make_image_grid([init_image, mask_image, image], rows=1, cols=3) +``` + +### Stable Diffusion XL (SDXL) Inpainting + +SDXL is a larger and more powerful version of Stable Diffusion v1.5. This model can follow a two-stage model process (though each model can also be used alone); the base model generates an image, and a refiner model takes that image and further enhances its details and quality. Take a look at the [SDXL](sdxl.md) guide for a more comprehensive guide on how to use SDXL and configure it's parameters. + +```py +import mindspore as ms +from mindone.diffusers import DiffusionPipeline +from mindone.diffusers.utils import load_image, make_image_grid +import numpy as np + +pipeline = DiffusionPipeline.from_pretrained( + "diffusers/stable-diffusion-xl-1.0-inpainting-0.1", mindspore_dtype=ms.float16, variant="fp16" +) + +# load base and mask image +init_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/inpaint.png") +mask_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/inpaint_mask.png") + +generator = np.random.Generator(np.random.PCG64(92)) +prompt = "concept art digital painting of an elven castle, inspired by lord of the rings, highly detailed, 8k" +image = pipeline(prompt=prompt, image=init_image, mask_image=mask_image, generator=generator)[0][0] +make_image_grid([init_image, mask_image, image], rows=1, cols=3) +``` + +### Kandinsky 2.2 Inpainting + +The Kandinsky model family is similar to SDXL because it uses two models as well; the image prior model creates image embeddings, and the diffusion model generates images from them. You can load the image prior and diffusion model separately, but the easiest way to use Kandinsky 2.2 is to load it into the [`KandinskyV22InpaintCombinedPipeline`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/pipelines/kandinsky_v22/#mindone.diffusers.KandinskyV22InpaintCombinedPipeline) class. + +```py +import mindspore as ms +from mindone.diffusers import KandinskyV22InpaintCombinedPipeline +from mindone.diffusers.utils import load_image, make_image_grid +import numpy as np + +pipeline = KandinskyV22InpaintCombinedPipeline.from_pretrained( + "kandinsky-community/kandinsky-2-2-decoder-inpaint", mindspore_dtype=ms.float16 +) + +# load base and mask image +init_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/inpaint.png") +mask_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/inpaint_mask.png") + +generator = np.random.Generator(np.random.PCG64(92)) +prompt = "concept art digital painting of an elven castle, inspired by lord of the rings, highly detailed, 8k" +image = pipeline(prompt=prompt, image=init_image, mask_image=mask_image, generator=generator)[0][0] +make_image_grid([init_image, mask_image, image], rows=1, cols=3) +``` + +
+
+ +
base image
+
+
+ +
Stable Diffusion Inpainting
+
+
+ +
Stable Diffusion XL Inpainting
+
+
+ +
Kandinsky 2.2 Inpainting
+
+
+ +## Non-inpaint specific checkpoints + +So far, this guide has used inpaint specific checkpoints such as [stable-diffusion-v1-5/stable-diffusion-inpainting](https://huggingface.co/stable-diffusion-v1-5/stable-diffusion-inpainting). But you can also use regular checkpoints like [stable-diffusion-v1-5/stable-diffusion-v1-5](https://huggingface.co/stable-diffusion-v1-5/stable-diffusion-v1-5). Let's compare the results of the two checkpoints. + +The image on the left is generated from a regular checkpoint, and the image on the right is from an inpaint checkpoint. You'll immediately notice the image on the left is not as clean, and you can still see the outline of the area the model is supposed to inpaint. The image on the right is much cleaner and the inpainted area appears more natural. + +=== "stable-diffusion-v1-5/stable-diffusion-v1-5" + + ```py + import mindspore as ms + from mindone.diffusers import StableDiffusionInpaintPipeline + from mindone.diffusers.utils import load_image, make_image_grid + import numpy as np + + pipeline = StableDiffusionInpaintPipeline.from_pretrained( + "stable-diffusion-v1-5/stable-diffusion-v1-5", mindspore_dtype=ms.float16, variant="fp16" + ) + + # load base and mask image + init_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/inpaint.png") + mask_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/inpaint_mask.png") + + generator = np.random.Generator(np.random.PCG64(92)) + prompt = "concept art digital painting of an elven castle, inspired by lord of the rings, highly detailed, 8k" + image = pipeline(prompt=prompt, image=init_image, mask_image=mask_image, generator=generator)[0][0] + make_image_grid([init_image, image], rows=1, cols=2) + ``` + +=== "stable-diffusion-v1-5/stable-diffusion-inpainting" + + ```py + import mindspore as ms + from mindone.diffusers import DiffusionPipeline + from mindone.diffusers.utils import load_image, make_image_grid + import numpy as np + + pipeline = DiffusionPipeline.from_pretrained( + "stable-diffusion-v1-5/stable-diffusion-inpainting", mindspore_dtype=ms.float16, variant="fp16" + ) + + # load base and mask image + init_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/inpaint.png") + mask_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/inpaint_mask.png") + + generator = np.random.Generator(np.random.PCG64(92)) + prompt = "concept art digital painting of an elven castle, inspired by lord of the rings, highly detailed, 8k" + image = pipeline(prompt=prompt, image=init_image, mask_image=mask_image, generator=generator)[0][0] + make_image_grid([init_image, image], rows=1, cols=2) + ``` + +
+
+ +
stable-diffusion-v1-5/stable-diffusion-v1-5
+
+
+ +
stable-diffusion-v1-5/stable-diffusion-inpainting
+
+
+ +However, for more basic tasks like erasing an object from an image (like the rocks in the road for example), a regular checkpoint yields pretty good results. There isn't as noticeable of difference between the regular and inpaint checkpoint. + +=== "stable-diffusion-v1-5/stable-diffusion-v1-5" + + ```py + import mindspore as ms + from mindone.diffusers import StableDiffusionInpaintPipeline + from mindone.diffusers.utils import load_image, make_image_grid + import numpy as np + + pipeline = StableDiffusionInpaintPipeline.from_pretrained( + "stable-diffusion-v1-5/stable-diffusion-v1-5", mindspore_dtype=ms.float16, variant="fp16" + ) + + # load base and mask image + init_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/inpaint.png") + mask_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/road-mask.png") + + image = pipeline(prompt="road", image=init_image, mask_image=mask_image)[0][0] + make_image_grid([init_image, image], rows=1, cols=2) + ``` + +=== "stable-diffusion-v1-5/stable-diffusion-inpainting" + + ```py + import mindspore as ms + from mindone.diffusers import DiffusionPipeline + from mindone.diffusers.utils import load_image, make_image_grid + import numpy as np + + pipeline = DiffusionPipeline.from_pretrained( + "stable-diffusion-v1-5/stable-diffusion-inpainting", mindspore_dtype=ms.float16, variant="fp16" + ) + + # load base and mask image + init_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/inpaint.png") + mask_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/road-mask.png") + + image = pipeline(prompt="road", image=init_image, mask_image=mask_image)[0][0] + make_image_grid([init_image, image], rows=1, cols=2) + ``` + +
+
+ +
stable-diffusion-v1-5/stable-diffusion-v1-5
+
+
+ +
stable-diffusion-v1-5/stable-diffusion-inpainting
+
+
+ +The trade-off of using a non-inpaint specific checkpoint is the overall image quality may be lower, but it generally tends to preserve the mask area (that is why you can see the mask outline). The inpaint specific checkpoints are intentionally trained to generate higher quality inpainted images, and that includes creating a more natural transition between the masked and unmasked areas. As a result, these checkpoints are more likely to change your unmasked area. + +If preserving the unmasked area is important for your task, you can use the [`VaeImageProcessor.apply_overlay`] method to force the unmasked area of an image to remain the same at the expense of some more unnatural transitions between the masked and unmasked areas. + +```py +import PIL +import numpy as np +import mindspore as ms + +from mindone.diffusers import DiffusionPipeline +from mindone.diffusers.utils import load_image, make_image_grid + +pipeline = DiffusionPipeline.from_pretrained( + "stable-diffusion-v1-5/stable-diffusion-inpainting", + mindspore_dtype=ms.float16, +) + +img_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo.png" +mask_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo_mask.png" + +init_image = load_image(img_url).resize((512, 512)) +mask_image = load_image(mask_url).resize((512, 512)) + +prompt = "Face of a yellow cat, high resolution, sitting on a park bench" +repainted_image = pipeline(prompt=prompt, image=init_image, mask_image=mask_image)[0][0] +repainted_image.save("repainted_image.png") + +unmasked_unchanged_image = pipeline.image_processor.apply_overlay(mask_image, init_image, repainted_image) +unmasked_unchanged_image.save("force_unmasked_unchanged.png") +make_image_grid([init_image, mask_image, repainted_image, unmasked_unchanged_image], rows=2, cols=2) +``` + +## Configure pipeline parameters + +Image features - like quality and "creativity" - are dependent on pipeline parameters. Knowing what these parameters do is important for getting the results you want. Let's take a look at the most important parameters and see how changing them affects the output. + +### Strength + +`strength` is a measure of how much noise is added to the base image, which influences how similar the output is to the base image. + +* 📈 a high `strength` value means more noise is added to an image and the denoising process takes longer, but you'll get higher quality images that are more different from the base image +* 📉 a low `strength` value means less noise is added to an image and the denoising process is faster, but the image quality may not be as great and the generated image resembles the base image more + +```py +import mindspore as ms +from mindone.diffusers import DiffusionPipeline +from mindone.diffusers.utils import load_image, make_image_grid + +pipeline = DiffusionPipeline.from_pretrained( + "stable-diffusion-v1-5/stable-diffusion-inpainting", mindspore_dtype=ms.float16, variant="fp16" +) + +# load base and mask image +init_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/inpaint.png") +mask_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/inpaint_mask.png") + +prompt = "concept art digital painting of an elven castle, inspired by lord of the rings, highly detailed, 8k" +image = pipeline(prompt=prompt, image=init_image, mask_image=mask_image, strength=0.6)[0][0] +make_image_grid([init_image, mask_image, image], rows=1, cols=3) +``` + +
+
+ +
strength = 0.6
+
+
+ +
strength = 0.8
+
+
+ +
strength = 1.0
+
+
+ +### Guidance scale + +`guidance_scale` affects how aligned the text prompt and generated image are. + +* 📈 a high `guidance_scale` value means the prompt and generated image are closely aligned, so the output is a stricter interpretation of the prompt +* 📉 a low `guidance_scale` value means the prompt and generated image are more loosely aligned, so the output may be more varied from the prompt + +You can use `strength` and `guidance_scale` together for more control over how expressive the model is. For example, a combination high `strength` and `guidance_scale` values gives the model the most creative freedom. + +```py +import mindspore as ms +from mindone.diffusers import DiffusionPipeline +from mindone.diffusers.utils import load_image, make_image_grid + +pipeline = DiffusionPipeline.from_pretrained( + "stable-diffusion-v1-5/stable-diffusion-inpainting", mindspore_dtype=ms.float16, variant="fp16" +) + +# load base and mask image +init_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/inpaint.png") +mask_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/inpaint_mask.png") + +prompt = "concept art digital painting of an elven castle, inspired by lord of the rings, highly detailed, 8k" +image = pipeline(prompt=prompt, image=init_image, mask_image=mask_image, guidance_scale=2.5)[0][0] +make_image_grid([init_image, mask_image, image], rows=1, cols=3) +``` + +
+
+ +
guidance_scale = 2.5
+
+
+ +
guidance_scale = 7.5
+
+
+ +
guidance_scale = 12.5
+
+
+ +### Negative prompt + +A negative prompt assumes the opposite role of a prompt; it guides the model away from generating certain things in an image. This is useful for quickly improving image quality and preventing the model from generating things you don't want. + +```py +import mindspore as ms +from mindone.diffusers import DiffusionPipeline +from mindone.diffusers.utils import load_image, make_image_grid + +pipeline = DiffusionPipeline.from_pretrained( + "stable-diffusion-v1-5/stable-diffusion-inpainting", mindspore_dtype=ms.float16, variant="fp16" +) + +# load base and mask image +init_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/inpaint.png") +mask_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/inpaint_mask.png") + +prompt = "concept art digital painting of an elven castle, inspired by lord of the rings, highly detailed, 8k" +negative_prompt = "bad architecture, unstable, poor details, blurry" +image = pipeline(prompt=prompt, negative_prompt=negative_prompt, image=init_image, mask_image=mask_image)[0][0] +make_image_grid([init_image, mask_image, image], rows=1, cols=3) +``` + +
+
+ +
negative_prompt = "bad architecture, unstable, poor details, blurry"
+
+
+ +### Padding mask crop + +A method for increasing the inpainting image quality is to use the [`padding_mask_crop`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/pipelines/stable_diffusion/inpaint/#mindone.diffusers.StableDiffusionInpaintPipeline) parameter. When enabled, this option crops the masked area with some user-specified padding and it'll also crop the same area from the original image. Both the image and mask are upscaled to a higher resolution for inpainting, and then overlaid on the original image. This is a quick and easy way to improve image quality without using a separate pipeline like [`StableDiffusionUpscalePipeline`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/pipelines/stable_diffusion/upscale/#mindone.diffusers.StableDiffusionUpscalePipeline). + +Add the `padding_mask_crop` parameter to the pipeline call and set it to the desired padding value. + +```py +import mindspore as ms +from mindone.diffusers import StableDiffusionInpaintPipeline +from mindone.diffusers.utils import load_image +import numpy as np +from PIL import Image + +generator = np.random.Generator(np.random.PCG64(0)) +pipeline = StableDiffusionInpaintPipeline.from_pretrained("stable-diffusion-v1-5/stable-diffusion-v1-5", mindspore_dtype=ms.float16) + +base = load_image("https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/seashore.png") +mask = load_image("https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/seashore_mask.png") + +image = pipeline("boat", image=base, mask_image=mask, strength=0.75, generator=generator, padding_mask_crop=32)[0][0] +image +``` + +
+
+ +
default inpaint image
+
+
+ +
inpaint image with `padding_mask_crop` enabled
+
+
+ +## Chained inpainting pipelines + +[`DiffusionPipeline`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/pipelines/overview/#mindone.diffusers.DiffusionPipeline) can be chained with other 🤗 Diffusers pipelines to edit their outputs. This is often useful for improving the output quality from your other diffusion pipelines, and if you're using multiple pipelines, it can be more memory-efficient to chain them together to keep the outputs in latent space and reuse the same pipeline components. + +### Text-to-image-to-inpaint + +Chaining a text-to-image and inpainting pipeline allows you to inpaint the generated image, and you don't have to provide a base image to begin with. This makes it convenient to edit your favorite text-to-image outputs without having to generate an entirely new image. + +Start with the text-to-image pipeline to create a castle: + +```py +import mindspore as ms +from mindone.diffusers import DiffusionPipeline, KandinskyV22InpaintCombinedPipeline +from mindone.diffusers.utils import load_image, make_image_grid + +pipeline = DiffusionPipeline.from_pretrained( + "stable-diffusion-v1-5/stable-diffusion-v1-5", mindspore_dtype=ms.float16, variant="fp16", use_safetensors=True +) + +text2image = pipeline("concept art digital painting of an elven castle, inspired by lord of the rings, highly detailed, 8k")[0][0] +``` + +Load the mask image of the output from above: + +```py +mask_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/inpaint_text-chain-mask.png") +``` + +And let's inpaint the masked area with a waterfall: + +```py +pipeline = KandinskyV22InpaintCombinedPipeline.from_pretrained( + "kandinsky-community/kandinsky-2-2-decoder-inpaint", mindspore_dtype=ms.float16 +) + +prompt = "digital painting of a fantasy waterfall, cloudy" +image = pipeline(prompt=prompt, image=text2image, mask_image=mask_image)[0][0] +make_image_grid([text2image, mask_image, image], rows=1, cols=3) +``` + +
+
+ +
text-to-image
+
+
+ +
inpaint
+
+
+ +### Inpaint-to-image-to-image + +You can also chain an inpainting pipeline before another pipeline like image-to-image or an upscaler to improve the quality. + +Begin by inpainting an image: + +```py +import mindspore as ms +from mindone.diffusers import DiffusionPipeline, KandinskyV22InpaintCombinedPipeline, StableDiffusionInpaintPipeline +from mindone.diffusers.utils import load_image, make_image_grid + +pipeline = DiffusionPipeline.from_pretrained( + "stable-diffusion-v1-5/stable-diffusion-inpainting", mindspore_dtype=ms.float16, variant="fp16" +) + +# load base and mask image +init_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/inpaint.png") +mask_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/inpaint_mask.png") + +prompt = "concept art digital painting of an elven castle, inspired by lord of the rings, highly detailed, 8k" +image_inpainting = pipeline(prompt=prompt, image=init_image, mask_image=mask_image)[0][0] + +# resize image to 1024x1024 for SDXL +image_inpainting = image_inpainting.resize((1024, 1024)) +``` + +Now let's pass the image to another inpainting pipeline with SDXL's refiner model to enhance the image details and quality: + +```py +pipeline = StableDiffusionInpaintPipeline.from_pretrained( + "stabilityai/stable-diffusion-xl-refiner-1.0", mindspore_dtype=ms.float16, variant="fp16" +) + +image = pipeline(prompt=prompt, image=image_inpainting, mask_image=mask_image, output_type="latent")[0][0] +``` + +!!! tip + + It is important to specify `output_type="latent"` in the pipeline to keep all the outputs in latent space to avoid an unnecessary decode-encode step. This only works if the chained pipelines are using the same VAE. For example, in the [Text-to-image-to-inpaint](#text-to-image-to-inpaint.md) section, Kandinsky 2.2 uses a different VAE class than the Stable Diffusion model so it won't work. But if you use Stable Diffusion v1.5 for both pipelines, then you can keep everything in latent space because they both use [`AutoencoderKL`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/models/autoencoderkl/#mindone.diffusers.AutoencoderKL). + +Finally, you can pass this image to an image-to-image pipeline to put the finishing touches on it. + +```py +pipeline = DiffusionPipeline.from_pretrained( + "stabilityai/stable-diffusion-xl-refiner-1.0", mindspore_dtype=ms.float16, variant="fp16" +) + +image = pipeline(prompt=prompt, image=image)[0][0] +make_image_grid([init_image, mask_image, image_inpainting, image], rows=2, cols=2) +``` + +
+
+ +
initial image
+
+
+ +
inpaint
+
+
+ +
image-to-image
+
+
+ +Image-to-image and inpainting are actually very similar tasks. Image-to-image generates a new image that resembles the existing provided image. Inpainting does the same thing, but it only transforms the image area defined by the mask and the rest of the image is unchanged. You can think of inpainting as a more precise tool for making specific changes and image-to-image has a broader scope for making more sweeping changes. diff --git a/docs/diffusers/using-diffusers/ip_adapter.md b/docs/diffusers/using-diffusers/ip_adapter.md new file mode 100644 index 0000000000..4878613314 --- /dev/null +++ b/docs/diffusers/using-diffusers/ip_adapter.md @@ -0,0 +1,582 @@ + + +# IP-Adapter + +[IP-Adapter](https://hf.co/papers/2308.06721) is an image prompt adapter that can be plugged into diffusion models to enable image prompting without any changes to the underlying model. Furthermore, this adapter can be reused with other models finetuned from the same base model and it can be combined with other adapters like [ControlNet](../using-diffusers/controlnet.md). The key idea behind IP-Adapter is the *decoupled cross-attention* mechanism which adds a separate cross-attention layer just for image features instead of using the same cross-attention layer for both text and image features. This allows the model to learn more image-specific features. + +!!! tip + + Learn how to load an IP-Adapter in the [Load adapters](../using-diffusers/loading_adapters.md#ip-adapter) guide, and make sure you check out the [IP-Adapter Plus](../using-diffusers/loading_adapters.md#ip-adapter-plus) section which requires manually loading the image encoder. + +This guide will walk you through using IP-Adapter for various tasks and use cases. + +## General tasks + +Let's take a look at how to use IP-Adapter's image prompting capabilities with the [`StableDiffusionXLPipeline`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/pipelines/stable_diffusion/stable_diffusion_xl/#mindone.diffusers.StableDiffusionXLPipeline) for tasks like text-to-image, image-to-image, and inpainting. We also encourage you to try out other pipelines such as Stable Diffusion, LCM-LoRA, ControlNet, T2I-Adapter, or AnimateDiff! + +In all the following examples, you'll see the [`set_ip_adapter_scale`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/loaders/ip_adapter/#mindone.diffusers.loaders.ip_adapter.IPAdapterMixin.set_ip_adapter_scale) method. This method controls the amount of text or image conditioning to apply to the model. A value of `1.0` means the model is only conditioned on the image prompt. Lowering this value encourages the model to produce more diverse images, but they may not be as aligned with the image prompt. Typically, a value of `0.5` achieves a good balance between the two prompt types and produces good results. + +!!! tip + + In the examples below, try adding `low_cpu_mem_usage=True` to the [`load_ip_adapter`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/loaders/ip_adapter/#mindone.diffusers.loaders.ip_adapter.IPAdapterMixin.load_ip_adapter) method to speed up the loading time. + +=== "Text-to-image" + + Crafting the precise text prompt to generate the image you want can be difficult because it may not always capture what you'd like to express. Adding an image alongside the text prompt helps the model better understand what it should generate and can lead to more accurate results. + + Load a Stable Diffusion XL (SDXL) model and insert an IP-Adapter into the model with the [`load_ip_adapter`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/loaders/ip_adapter/#mindone.diffusers.loaders.ip_adapter.IPAdapterMixin.load_ip_adapter) method. Use the `subfolder` parameter to load the SDXL model weights. + + ```py + from mindone.diffusers import StableDiffusionXLPipeline + from mindone.diffusers.utils import load_image + import mindspore as ms + import numpy as np + + pipeline = StableDiffusionXLPipeline.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", mindspore_dtype=ms.float16) + pipeline.load_ip_adapter("h94/IP-Adapter", subfolder="sdxl_models", weight_name="ip-adapter_sdxl.safetensors") + pipeline.set_ip_adapter_scale(0.6) + ``` + + Create a text prompt and load an image prompt before passing them to the pipeline to generate an image. + + ```py + image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_adapter_diner.png") + generator = np.random.Generator(np.random.PCG64(0)) + images = pipeline( + prompt="a polar bear sitting in a chair drinking a milkshake", + ip_adapter_image=image, + negative_prompt="deformed, ugly, wrong proportion, low res, bad anatomy, worst quality, low quality", + num_inference_steps=100, + generator=generator, + )[0] + images[0] + ``` + +
+
+ +
IP-Adapter image
+
+
+ +
generated image
+
+
+ +=== "Image-to-image" + + IP-Adapter can also help with image-to-image by guiding the model to generate an image that resembles the original image and the image prompt. + + Load a Stable Diffusion XL (SDXL) model and insert an IP-Adapter into the model with the [`load_ip_adapter`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/loaders/ip_adapter/#mindone.diffusers.loaders.ip_adapter.IPAdapterMixin.load_ip_adapter) method. Use the `subfolder` parameter to load the SDXL model weights. + + ```py + from mindone.diffusers import StableDiffusionXLImg2ImgPipeline + from mindone.diffusers.utils import load_image + import mindspore as ms + import numpy as np + + pipeline = StableDiffusionXLImg2ImgPipeline.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", mindspore_dtype=ms.float16) + pipeline.load_ip_adapter("h94/IP-Adapter", subfolder="sdxl_models", weight_name="ip-adapter_sdxl.safetensors") + pipeline.set_ip_adapter_scale(0.6) + ``` + + Pass the original image and the IP-Adapter image prompt to the pipeline to generate an image. Providing a text prompt to the pipeline is optional, but in this example, a text prompt is used to increase image quality. + + ```py + image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_adapter_bear_1.png").resize((1470, 980)) + ip_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_adapter_bear_2.png").resize((1418, 1890)) + + generator = np.random.Generator(np.random.PCG64(4)) + images = pipeline( + prompt="best quality, high quality", + image=image, + ip_adapter_image=ip_image, + generator=generator, + strength=0.6, + )[0] + images[0] + ``` + +
+
+ +
original image
+
+
+ +
IP-Adapter image
+
+
+ +
generated image
+
+
+ +=== "Inpainting" + + IP-Adapter is also useful for inpainting because the image prompt allows you to be much more specific about what you'd like to generate. + + Load a Stable Diffusion XL (SDXL) model and insert an IP-Adapter into the model with the [`load_ip_adapter`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/loaders/ip_adapter/#mindone.diffusers.loaders.ip_adapter.IPAdapterMixin.load_ip_adapter) method. Use the `subfolder` parameter to load the SDXL model weights. + + ```py + from mindone.diffusers import StableDiffusionXLInpaintPipeline + from mindone.diffusers.utils import load_image + import mindspore as ms + import numpy as np + + pipeline = StableDiffusionXLInpaintPipeline.from_pretrained("diffusers/stable-diffusion-xl-1.0-inpainting-0.1", mindspore_dtype=ms.float16) + pipeline.load_ip_adapter("h94/IP-Adapter", subfolder="sdxl_models", weight_name="ip-adapter_sdxl.safetensors") + pipeline.set_ip_adapter_scale(0.6) + ``` + + Pass a prompt, the original image, mask image, and the IP-Adapter image prompt to the pipeline to generate an image. + + ```py + mask_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_adapter_mask.png") + image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_adapter_bear_1.png") + ip_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_adapter_gummy.png") + + generator = np.random.Generator(np.random.PCG64(4)) + images = pipeline( + prompt="a cute gummy bear waving", + image=image, + mask_image=mask_image, + ip_adapter_image=ip_image, + generator=generator, + num_inference_steps=100, + )[0] + images[0] + ``` + +
+
+ +
original image
+
+
+ +
IP-Adapter image
+
+
+ +
generated image
+
+
+ +=== "Video" + + IP-Adapter can also help you generate videos that are more aligned with your text prompt. For example, let's load [AnimateDiff](../api/pipelines/animatediff.md) with its motion adapter and insert an IP-Adapter into the model with the [`load_ip_adapter`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/loaders/ip_adapter/#mindone.diffusers.loaders.ip_adapter.IPAdapterMixin.load_ip_adapter) method. + + ```py + import mindspore as ms + from mindone.diffusers import AnimateDiffPipeline, DDIMScheduler, MotionAdapter + from mindone.diffusers.utils import export_to_gif + from mindone.diffusers.utils import load_image + import numpy as np + + adapter = MotionAdapter.from_pretrained("guoyww/animatediff-motion-adapter-v1-5-2", mindspore_dtype=ms.float16) + pipeline = AnimateDiffPipeline.from_pretrained("emilianJR/epiCRealism", motion_adapter=adapter, mindspore_dtype=ms.float16) + scheduler = DDIMScheduler.from_pretrained( + "emilianJR/epiCRealism", + subfolder="scheduler", + clip_sample=False, + timestep_spacing="linspace", + beta_schedule="linear", + steps_offset=1, + ) + pipeline.scheduler = scheduler + + pipeline.load_ip_adapter("h94/IP-Adapter", subfolder="models", weight_name="ip-adapter_sd15.safetensors") + ``` + + Pass a prompt and an image prompt to the pipeline to generate a short video. + + ```py + ip_adapter_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_adapter_inpaint.png") + + output = pipeline( + prompt="A cute gummy bear waving", + negative_prompt="bad quality, worse quality, low resolution", + ip_adapter_image=ip_adapter_image, + num_frames=16, + guidance_scale=7.5, + num_inference_steps=50, + generator=np.random.Generator(np.random.PCG64(0)), + ) + frames = output[0][0] + export_to_gif(frames, "gummy_bear.gif") + ``` + +
+
+ +
IP-Adapter image
+
+
+ +
generated video
+
+
+ +## Configure parameters + +There are a couple of IP-Adapter parameters that are useful to know about and can help you with your image generation tasks. These parameters can make your workflow more efficient or give you more control over image generation. + +### Image embeddings + +IP-Adapter enabled pipelines provide the `ip_adapter_image_embeds` parameter to accept precomputed image embeddings. This is particularly useful in scenarios where you need to run the IP-Adapter pipeline multiple times because you have more than one image. For example, [multi IP-Adapter](#multi-ip-adapter) is a specific use case where you provide multiple styling images to generate a specific image in a specific style. Loading and encoding multiple images each time you use the pipeline would be inefficient. Instead, you can precompute and save the image embeddings to disk (which can save a lot of space if you're using high-quality images) and load them when you need them. + +Call the [`prepare_ip_adapter_image_embeds`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/pipelines/stable_diffusion/text2img/#mindone.diffusers.StableDiffusionPipeline.get_guidance_scale_embedding) method to encode and generate the image embeddings. Then you can load the image embeddings by passing them to the `ip_adapter_image_embeds` parameter. + +!!! tip + + If you're using IP-Adapter with `ip_adapter_image_embedding` instead of `ip_adapter_image`', you can set `load_ip_adapter(image_encoder_folder=None,...)` because you don't need to load an encoder to generate the image embeddings. + +```py +image_embeds = pipeline.prepare_ip_adapter_image_embeds( + ip_adapter_image=image, + ip_adapter_image_embeds=None, + num_images_per_prompt=1, + do_classifier_free_guidance=True, +) + +images = pipeline( + prompt="a polar bear sitting in a chair drinking a milkshake", + ip_adapter_image_embeds=image_embeds, + negative_prompt="deformed, ugly, wrong proportion, low res, bad anatomy, worst quality, low quality", + num_inference_steps=100, + generator=generator, +)[0] +``` + +## Specific use cases + +IP-Adapter's image prompting and compatibility with other adapters and models makes it a versatile tool for a variety of use cases. This section covers some of the more popular applications of IP-Adapter, and we can't wait to see what you come up with! + +### Face model + +Generating accurate faces is challenging because they are complex and nuanced. Diffusers supports two IP-Adapter checkpoints specifically trained to generate faces from the [h94/IP-Adapter](https://huggingface.co/h94/IP-Adapter) repository: + +* [ip-adapter-full-face_sd15.safetensors](https://huggingface.co/h94/IP-Adapter/blob/main/models/ip-adapter-full-face_sd15.safetensors) is conditioned with images of cropped faces and removed backgrounds +* [ip-adapter-plus-face_sd15.safetensors](https://huggingface.co/h94/IP-Adapter/blob/main/models/ip-adapter-plus-face_sd15.safetensors) uses patch embeddings and is conditioned with images of cropped faces + +Additionally, Diffusers supports all IP-Adapter checkpoints trained with face embeddings extracted by `insightface` face models. Supported models are from the [h94/IP-Adapter-FaceID](https://huggingface.co/h94/IP-Adapter-FaceID) repository. + +For face models, use the [h94/IP-Adapter](https://huggingface.co/h94/IP-Adapter) checkpoint. It is also recommended to use [`DDIMScheduler`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/schedulers/ddim/#mindone.diffusers.DDIMScheduler) or [`EulerDiscreteScheduler`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/schedulers/euler/#mindone.diffusers.EulerDiscreteScheduler) for face models. + +```py +import mindspore as ms +from mindone.diffusers import StableDiffusionPipeline, DDIMScheduler +from mindone.diffusers.utils import load_image +import numpy as np + +pipeline = StableDiffusionPipeline.from_pretrained( + "stable-diffusion-v1-5/stable-diffusion-v1-5", + torch_dtype=ms.float16, +) +pipeline.scheduler = DDIMScheduler.from_config(pipeline.scheduler.config) +pipeline.load_ip_adapter("h94/IP-Adapter", subfolder="models", weight_name="ip-adapter-full-face_sd15.safetensors") + +pipeline.set_ip_adapter_scale(0.5) + +image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_adapter_einstein_base.png") +generator = np.random.Generator(np.random.PCG64(26)) + +image = pipeline( + prompt="A photo of Einstein as a chef, wearing an apron, cooking in a French restaurant", + ip_adapter_image=image, + negative_prompt="lowres, bad anatomy, worst quality, low quality", + num_inference_steps=100, + generator=generator, +)[0][0] +image +``` + +
+
+ +
IP-Adapter image
+
+
+ +
generated image
+
+
+ +### Multi IP-Adapter + +More than one IP-Adapter can be used at the same time to generate specific images in more diverse styles. For example, you can use IP-Adapter-Face to generate consistent faces and characters, and IP-Adapter Plus to generate those faces in a specific style. + +!!! tip + + Read the [IP-Adapter Plus](../using-diffusers/loading_adapters.md#ip-adapter-plus) section to learn why you need to manually load the image encoder. + +Load the image encoder with [`~transformers.CLIPVisionModelWithProjection`]. + +```py +import mindspore as ms +from mindone.diffusers import StableDiffusionXLPipeline, DDIMScheduler +from mindone.transformers import CLIPVisionModelWithProjection +from mindone.diffusers.utils import load_image +import numpy as np + +image_encoder = CLIPVisionModelWithProjection.from_pretrained( + "h94/IP-Adapter", + subfolder="models/image_encoder", + mindspore_dtype=ms.float16, +) +``` + +Next, you'll load a base model, scheduler, and the IP-Adapters. The IP-Adapters to use are passed as a list to the `weight_name` parameter: + +* [ip-adapter-plus_sdxl_vit-h](https://huggingface.co/h94/IP-Adapter#ip-adapter-for-sdxl-10) uses patch embeddings and a ViT-H image encoder +* [ip-adapter-plus-face_sdxl_vit-h](https://huggingface.co/h94/IP-Adapter#ip-adapter-for-sdxl-10) has the same architecture but it is conditioned with images of cropped faces + +```py +pipeline = StableDiffusionXLPipeline.from_pretrained( + "stabilityai/stable-diffusion-xl-base-1.0", + mindspore_dtype=ms.float16, + image_encoder=image_encoder, +) +pipeline.scheduler = DDIMScheduler.from_config(pipeline.scheduler.config) +pipeline.load_ip_adapter( + "h94/IP-Adapter", + subfolder="sdxl_models", + weight_name=["ip-adapter-plus_sdxl_vit-h.safetensors", "ip-adapter-plus-face_sdxl_vit-h.safetensors"] +) +pipeline.set_ip_adapter_scale([0.7, 0.3]) +``` + +Load an image prompt and a folder containing images of a certain style you want to use. + +```py +face_image = load_image("https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/women_input.png") +style_folder = "https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/style_ziggy" +style_images = [load_image(f"{style_folder}/img{i}.png") for i in range(10)] +``` + +
+
+ +
IP-Adapter image of face
+
+
+ +
IP-Adapter style images
+
+
+ +Pass the image prompt and style images as a list to the `ip_adapter_image` parameter, and run the pipeline! + +```py +generator = np.random.Generator(np.random.PCG64(0)) + +image = pipeline( + prompt="wonderwoman", + ip_adapter_image=[style_images, face_image], + negative_prompt="monochrome, lowres, bad anatomy, worst quality, low quality", + num_inference_steps=50, num_images_per_prompt=1, + generator=generator, +)[0][0] +image +``` + +
+ +
+ +### Instant generation + +[Latent Consistency Models (LCM)](../using-diffusers/inference_with_lcm_lora.md) are diffusion models that can generate images in as little as 4 steps compared to other diffusion models like SDXL that typically require way more steps. This is why image generation with an LCM feels "instantaneous". IP-Adapters can be plugged into an LCM-LoRA model to instantly generate images with an image prompt. + +The IP-Adapter weights need to be loaded first, then you can use [`load_lora_weights`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/pipelines/stable_diffusion/text2img/#mindone.diffusers.StableDiffusionPipeline) to load the LoRA style and weight you want to apply to your image. + +```py +from mindone.diffusers import DiffusionPipeline, LCMScheduler +import mindspore as ms +from mindone.diffusers.utils import load_image +import numpy as np + +model_id = "sd-dreambooth-library/herge-style" +lcm_lora_id = "latent-consistency/lcm-lora-sdv1-5" + +pipeline = DiffusionPipeline.from_pretrained(model_id, mindspore_dtype=ms.float16) + +pipeline.load_ip_adapter("h94/IP-Adapter", subfolder="models", weight_name="ip-adapter_sd15.safetensors") +pipeline.load_lora_weights(lcm_lora_id) +pipeline.scheduler = LCMScheduler.from_config(pipeline.scheduler.config) +``` + +Try using with a lower IP-Adapter scale to condition image generation more on the [herge_style](https://huggingface.co/sd-dreambooth-library/herge-style) checkpoint, and remember to use the special token `herge_style` in your prompt to trigger and apply the style. + +```py +pipeline.set_ip_adapter_scale(0.4) + +prompt = "herge_style woman in armor, best quality, high quality" +generator = np.random.Generator(np.random.PCG64(0)) + +ip_adapter_image = load_image("https://user-images.githubusercontent.com/24734142/266492875-2d50d223-8475-44f0-a7c6-08b51cb53572.png") +image = pipeline( + prompt=prompt, + ip_adapter_image=ip_adapter_image, + num_inference_steps=4, + guidance_scale=1, +)[0][0] +image +``` + +
+ +
+ +### Structural control + +To control image generation to an even greater degree, you can combine IP-Adapter with a model like [ControlNet](../using-diffusers/controlnet.md). A ControlNet is also an adapter that can be inserted into a diffusion model to allow for conditioning on an additional control image. The control image can be depth maps, edge maps, pose estimations, and more. + +Load a [`ControlNetModel`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/models/controlnet/#mindone.diffusers.ControlNetModel) checkpoint conditioned on depth maps, insert it into a diffusion model, and load the IP-Adapter. + +```py +from mindone.diffusers import StableDiffusionControlNetPipeline, ControlNetModel +import mindspore as ms +from mindone.diffusers.utils import load_image +import numpy as np + +controlnet_model_path = "lllyasviel/control_v11f1p_sd15_depth" +controlnet = ControlNetModel.from_pretrained(controlnet_model_path, mindspore_dtype=ms.float16) + +pipeline = StableDiffusionControlNetPipeline.from_pretrained( + "stable-diffusion-v1-5/stable-diffusion-v1-5", controlnet=controlnet, mindspore_dtype=ms.float16) +pipeline.load_ip_adapter("h94/IP-Adapter", subfolder="models", weight_name="ip-adapter_sd15.safetensors") +``` + +Now load the IP-Adapter image and depth map. + +```py +ip_adapter_image = load_image("https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/statue.png") +depth_map = load_image("https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/depth.png") +``` + +
+
+ +
IP-Adapter image
+
+
+ +
depth map
+
+
+ +Pass the depth map and IP-Adapter image to the pipeline to generate an image. + +```py +generator = np.random.Generator(np.random.PCG64(33)) +image = pipeline( + prompt="best quality, high quality", + image=depth_map, + ip_adapter_image=ip_adapter_image, + negative_prompt="monochrome, lowres, bad anatomy, worst quality, low quality", + num_inference_steps=50, + generator=generator, +)[0][0] +image +``` + +
+ +
+ +### Style & layout control + +[InstantStyle](https://arxiv.org/abs/2404.02733) is a plug-and-play method on top of IP-Adapter, which disentangles style and layout from image prompt to control image generation. This way, you can generate images following only the style or layout from image prompt, with significantly improved diversity. This is achieved by only activating IP-Adapters to specific parts of the model. + +By default IP-Adapters are inserted to all layers of the model. Use the [`set_ip_adapter_scale`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/loaders/ip_adapter/#mindone.diffusers.loaders.ip_adapter.IPAdapterMixin.set_ip_adapter_scale) method with a dictionary to assign scales to IP-Adapter at different layers. + +```py +from mindone.diffusers import StableDiffusionXLPipeline +from mindone.diffusers.utils import load_image +import mindspore as ms +import numpy as np + +pipeline = StableDiffusionXLPipeline.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", mindspore_dtype=ms.float16) +pipeline.load_ip_adapter("h94/IP-Adapter", subfolder="sdxl_models", weight_name="ip-adapter_sdxl.safetensors") + +scale = { + "down": {"block_2": [0.0, 1.0]}, + "up": {"block_0": [0.0, 1.0, 0.0]}, +} +pipeline.set_ip_adapter_scale(scale) +``` + +This will activate IP-Adapter at the second layer in the model's down-part block 2 and up-part block 0. The former is the layer where IP-Adapter injects layout information and the latter injects style. Inserting IP-Adapter to these two layers you can generate images following both the style and layout from image prompt, but with contents more aligned to text prompt. + +```py +style_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/0052a70beed5bf71b92610a43a52df6d286cd5f3/diffusers/rabbit.jpg") + +generator = np.random.Generator(np.random.PCG64(26)) +image = pipeline( + prompt="a cat, masterpiece, best quality, high quality", + ip_adapter_image=style_image, + negative_prompt="text, watermark, lowres, low quality, worst quality, deformed, glitch, low contrast, noisy, saturation, blurry", + guidance_scale=5, + num_inference_steps=30, + generator=generator, +)[0][0] +image +``` + +
+
+ +
IP-Adapter image
+
+
+ +
generated image
+
+
+ +In contrast, inserting IP-Adapter to all layers will often generate images that overly focus on image prompt and diminish diversity. + +Activate IP-Adapter only in the style layer and then call the pipeline again. + +```py +scale = { + "up": {"block_0": [0.0, 1.0, 0.0]}, +} +pipeline.set_ip_adapter_scale(scale) + +generator = np.random.Generator(np.random.PCG64(26)) +image = pipeline( + prompt="a cat, masterpiece, best quality, high quality", + ip_adapter_image=style_image, + negative_prompt="text, watermark, lowres, low quality, worst quality, deformed, glitch, low contrast, noisy, saturation, blurry", + guidance_scale=5, + num_inference_steps=30, + generator=generator, +)[0][0] +image +``` + +
+
+ +
IP-Adapter only in style layer
+
+
+ +
IP-Adapter in all layers
+
+
+ +Note that you don't have to specify all layers in the dictionary. Those not included in the dictionary will be set to scale 0 which means disable IP-Adapter by default. diff --git a/docs/diffusers/using-diffusers/kandinsky.md b/docs/diffusers/using-diffusers/kandinsky.md new file mode 100644 index 0000000000..88618f5e5d --- /dev/null +++ b/docs/diffusers/using-diffusers/kandinsky.md @@ -0,0 +1,673 @@ + + +# Kandinsky + +The Kandinsky models are a series of multilingual text-to-image generation models. The Kandinsky 2.0 model uses two multilingual text encoders and concatenates those results for the UNet. + +[Kandinsky 2.1](../api/pipelines/kandinsky.md) changes the architecture to include an image prior model ([`CLIP`](https://huggingface.co/docs/transformers/model_doc/clip)) to generate a mapping between text and image embeddings. The mapping provides better text-image alignment and it is used with the text embeddings during training, leading to higher quality results. Finally, Kandinsky 2.1 uses a [Modulating Quantized Vectors (MoVQ)](https://huggingface.co/papers/2209.09002) decoder - which adds a spatial conditional normalization layer to increase photorealism - to decode the latents into images. + +[Kandinsky 2.2](../api/pipelines/kandinsky_v22.md) improves on the previous model by replacing the image encoder of the image prior model with a larger CLIP-ViT-G model to improve quality. The image prior model was also retrained on images with different resolutions and aspect ratios to generate higher-resolution images and different image sizes. + +[Kandinsky 3](../api/pipelines/kandinsky3.md) simplifies the architecture and shifts away from the two-stage generation process involving the prior model and diffusion model. Instead, Kandinsky 3 uses [Flan-UL2](https://huggingface.co/google/flan-ul2) to encode text, a UNet with [BigGan-deep](https://hf.co/papers/1809.11096) blocks, and [Sber-MoVQGAN](https://github.com/ai-forever/MoVQGAN) to decode the latents into images. Text understanding and generated image quality are primarily achieved by using a larger text encoder and UNet. + +This guide will show you how to use the Kandinsky models for text-to-image, image-to-image, inpainting, interpolation, and more. + +Before you begin, make sure you have the following libraries installed: + +```py +# uncomment to install the necessary libraries +#!pip install mindone transformers +``` + +!!! warning + + Kandinsky 2.1 and 2.2 usage is very similar! The only difference is Kandinsky 2.2 doesn't accept `prompt` as an input when decoding the latents. Instead, Kandinsky 2.2 only accepts `image_embeds` during decoding. + + Kandinsky 3 has a more concise architecture and it doesn't require a prior model. This means it's usage is identical to other diffusion models like [Stable Diffusion XL](sdxl.md). + + Additionally, Kandinsky 3 has precision issues now. Please refer to the [Limitation](../limitations.md) for further details. + +## Text-to-image + +To use the Kandinsky models for any task, you always start by setting up the prior pipeline to encode the prompt and generate the image embeddings. The prior pipeline also generates `negative_image_embeds` that correspond to the negative prompt `""`. For better results, you can pass an actual `negative_prompt` to the prior pipeline, but this'll increase the effective batch size of the prior pipeline by 2x. + +=== "Kandinsky 2.1" + + ```py + from mindone.diffusers import KandinskyPriorPipeline, KandinskyPipeline + import mindspore as ms + + prior_pipeline = KandinskyPriorPipeline.from_pretrained("kandinsky-community/kandinsky-2-1-prior", mindspore_dtype=ms.float16) + pipeline = KandinskyPipeline.from_pretrained("kandinsky-community/kandinsky-2-1", mindspore_dtype=ms.float16) + + prompt = "A alien cheeseburger creature eating itself, claymation, cinematic, moody lighting" + negative_prompt = "low quality, bad quality" # optional to include a negative prompt, but results are usually better + image_embeds, negative_image_embeds = prior_pipeline(prompt, negative_prompt, guidance_scale=1.0) + ``` + + Now pass all the prompts and embeddings to the [`KandinskyPipeline`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/pipelines/kandinsky/#mindone.diffusers.KandinskyPipeline) to generate an image: + + ```py + image = pipeline(prompt, image_embeds=image_embeds, negative_prompt=negative_prompt, negative_image_embeds=negative_image_embeds, height=768, width=768)[0][0] + image + ``` + +
+ +
+ +=== "Kandinsky 2.2" + + ```py + from mindone.diffusers import KandinskyV22PriorPipeline, KandinskyV22Pipeline + import mindspore as ms + + prior_pipeline = KandinskyV22PriorPipeline.from_pretrained("kandinsky-community/kandinsky-2-2-prior", mindspore=ms.float16) + pipeline = KandinskyV22Pipeline.from_pretrained("kandinsky-community/kandinsky-2-2-decoder", mindspore_dtype=ms.float16) + + prompt = "A alien cheeseburger creature eating itself, claymation, cinematic, moody lighting" + negative_prompt = "low quality, bad quality" # optional to include a negative prompt, but results are usually better + image_embeds, negative_image_embeds = prior_pipeline(prompt, guidance_scale=1.0) + ``` + + Pass the `image_embeds` and `negative_image_embeds` to the [`KandinskyV22Pipeline`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/pipelines/kandinsky_v22/#mindone.diffusers.KandinskyV22Pipeline) to generate an image: + + ```py + image = pipeline(image_embeds=image_embeds, negative_image_embeds=negative_image_embeds, height=768, width=768)[0][0] + image + ``` + +
+ +
+ +=== "Kandinsky 3" + + Kandinsky 3 doesn't require a prior model so you can directly load the [`Kandinsky3Pipeline`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/pipelines/kandinsky3/#mindone.diffusers.Kandinsky3Pipeline) and pass a prompt to generate an image: + + ```py + from mindone.diffusers import Kandinsky3Pipeline + import mindspore as ms + + pipeline = Kandinsky3Pipeline.from_pretrained("kandinsky-community/kandinsky-3", variant="fp16", mindspore_dtype=ms.float16) + + prompt = "A alien cheeseburger creature eating itself, claymation, cinematic, moody lighting" + image = pipeline(prompt)[0][0] + image + ``` + +🤗 Diffusers also provides an end-to-end API with the [`KandinskyCombinedPipeline`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/pipelines/kandinsky/#mindone.diffusers.KandinskyCombinedPipeline) and [`KandinskyV22CombinedPipeline`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/pipelines/kandinsky_v22/#mindone.diffusers.KandinskyV22CombinedPipeline), meaning you don't have to separately load the prior and text-to-image pipeline. The combined pipeline automatically loads both the prior model and the decoder. You can still set different values for the prior pipeline with the `prior_guidance_scale` and `prior_num_inference_steps` parameters if you want. + +=== "Kandinsky 2.1" + + ```py + from mindone.diffusers import KandinskyCombinedPipeline + import mindspore as ms + + pipeline = KandinskyCombinedPipeline.from_pretrained("kandinsky-community/kandinsky-2-1", mindspore_dtype=ms.float16) + + prompt = "A alien cheeseburger creature eating itself, claymation, cinematic, moody lighting" + negative_prompt = "low quality, bad quality" + + image = pipeline(prompt=prompt, negative_prompt=negative_prompt, prior_guidance_scale=1.0, guidance_scale=4.0, height=768, width=768)[0][0] + image + ``` + +=== "Kandinsky 2.2" + + ```py + from mindone.diffusers import KandinskyV22CombinedPipeline + import mindspore as ms + + pipeline = KandinskyV22CombinedPipeline.from_pretrained("kandinsky-community/kandinsky-2-2-decoder", mindspore_dtype=ms.float16) + + prompt = "A alien cheeseburger creature eating itself, claymation, cinematic, moody lighting" + negative_prompt = "low quality, bad quality" + + image = pipeline(prompt=prompt, negative_prompt=negative_prompt, prior_guidance_scale=1.0, guidance_scale=4.0, height=768, width=768)[0][0] + image + ``` + +## Image-to-image + +For image-to-image, pass the initial image and text prompt to condition the image to the pipeline. Start by loading the prior pipeline: + +=== "Kandinsky 2.1" + + ```py + import mindspore as ms + from mindone.diffusers import KandinskyImg2ImgPipeline, KandinskyPriorPipeline + + prior_pipeline = KandinskyPriorPipeline.from_pretrained("kandinsky-community/kandinsky-2-1-prior", mindspore_dtype=ms.float16, use_safetensors=True) + pipeline = KandinskyImg2ImgPipeline.from_pretrained("kandinsky-community/kandinsky-2-1", mindspore_dtype=ms.float16, use_safetensors=True) + ``` + +=== "Kandinsky 2.2" + + ```py + import mindspore as ms + from mindone.diffusers import KandinskyV22Img2ImgPipeline, KandinskyPriorPipeline + + prior_pipeline = KandinskyPriorPipeline.from_pretrained("kandinsky-community/kandinsky-2-2-prior", mindspore_dtype=ms.float16, use_safetensors=True) + pipeline = KandinskyV22Img2ImgPipeline.from_pretrained("kandinsky-community/kandinsky-2-2-decoder", mindspore_dtype=ms.float16, use_safetensors=True) + ``` + +=== "Kandinsky 3" + + Kandinsky 3 doesn't require a prior model so you can directly load the image-to-image pipeline: + + ```py + from mindone.diffusers import Kandinsky3Img2ImgPipeline + from mindone.diffusers.utils import load_image + import mindspore as ms + + pipeline = Kandinsky3Img2ImgPipeline.from_pretrained("kandinsky-community/kandinsky-3", variant="fp16", mindspore_dtype=ms.float16) + ``` + +Download an image to condition on: + +```py +from mindone.diffusers.utils import load_image + +# download image +url = "https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg" +original_image = load_image(url) +original_image = original_image.resize((768, 512)) +``` + +
+ +
+ +Generate the `image_embeds` and `negative_image_embeds` with the prior pipeline: + +```py +prompt = "A fantasy landscape, Cinematic lighting" +negative_prompt = "low quality, bad quality" + +image_embeds, negative_image_embeds = prior_pipeline(prompt, negative_prompt) +``` + +Now pass the original image, and all the prompts and embeddings to the pipeline to generate an image: + +=== "Kandinsky 2.1" + + ```py + from mindone.diffusers.utils import make_image_grid + + image = pipeline(prompt, negative_prompt=negative_prompt, image=original_image, image_embeds=image_embeds, negative_image_embeds=negative_image_embeds, height=768, width=768, strength=0.3)[0][0] + make_image_grid([original_image.resize((512, 512)), image.resize((512, 512))], rows=1, cols=2) + ``` + +
+ +
+ +=== "Kandinsky 2.2" + + ```py + from mindone.diffusers.utils import make_image_grid + + image = pipeline(image=original_image, image_embeds=image_embeds, negative_image_embeds=negative_image_embeds, height=768, width=768, strength=0.3)[0][0] + make_image_grid([original_image.resize((512, 512)), image.resize((512, 512))], rows=1, cols=2) + ``` + +
+ +
+ +=== "Kandinsky 3" + + ```py + image = pipeline(prompt, negative_prompt=negative_prompt, image=original_image, strength=0.75, num_inference_steps=25)[0][0] + image + ``` + +🤗 Diffusers also provides an end-to-end API with the [`KandinskyImg2ImgCombinedPipeline`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/pipelines/kandinsky/#mindone.diffusers.KandinskyImg2ImgCombinedPipeline) and [`KandinskyV22Img2ImgCombinedPipeline`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/pipelines/kandinsky_v22/#mindone.diffusers.KandinskyV22Img2ImgCombinedPipeline), meaning you don't have to separately load the prior and image-to-image pipeline. The combined pipeline automatically loads both the prior model and the decoder. You can still set different values for the prior pipeline with the `prior_guidance_scale` and `prior_num_inference_steps` parameters if you want. + +=== "Kandinsky 2.1" + + ```py + from mindone.diffusers import KandinskyImg2ImgCombinedPipeline + from mindone.diffusers.utils import make_image_grid, load_image + import mindspore as ms + + pipeline = KandinskyImg2ImgCombinedPipeline.from_pretrained("kandinsky-community/kandinsky-2-1", mindspore_dtype=ms.float16, use_safetensors=True) + + prompt = "A fantasy landscape, Cinematic lighting" + negative_prompt = "low quality, bad quality" + + url = "https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg" + original_image = load_image(url) + + original_image.thumbnail((768, 768)) + + image = pipeline(prompt=prompt, negative_prompt=negative_prompt, image=original_image, strength=0.3)[0][0] + make_image_grid([original_image.resize((512, 512)), image.resize((512, 512))], rows=1, cols=2) + ``` + +=== "Kandinsky 2.2" + + ```py + from mindone.diffusers import KandinskyV22Img2ImgCombinedPipeline + from mindone.diffusers.utils import make_image_grid, load_image + import mindspore as ms + + pipeline = KandinskyV22Img2ImgCombinedPipeline.from_pretrained("kandinsky-community/kandinsky-2-2-decoder", mindspore_dtype=ms.float16) + + prompt = "A fantasy landscape, Cinematic lighting" + negative_prompt = "low quality, bad quality" + + url = "https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg" + original_image = load_image(url) + + original_image.thumbnail((768, 768)) + + image = pipeline(prompt=prompt, negative_prompt=negative_prompt, image=original_image, strength=0.3)[0][0] + make_image_grid([original_image.resize((512, 512)), image.resize((512, 512))], rows=1, cols=2) + ``` + +## Inpainting + +!!! warning + + ⚠️ The Kandinsky models use ⬜️ **white pixels** to represent the masked area now instead of black pixels. If you are using [`KandinskyInpaintPipeline`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/pipelines/kandinsky/#mindone.diffusers.KandinskyInpaintPipeline) in production, you need to change the mask to use white pixels: + + ```py + # For PIL input + import PIL.ImageOps + mask = PIL.ImageOps.invert(mask) + + # For MindSpore and NumPy input + mask = 1 - mask + ``` + +For inpainting, you'll need the original image, a mask of the area to replace in the original image, and a text prompt of what to inpaint. Load the prior pipeline: + +=== "Kandinsky 2.1" + + ```py + from mindone.diffusers import KandinskyInpaintPipeline, KandinskyPriorPipeline + from mindone.diffusers.utils import load_image, make_image_grid + import mindspore as ms + import numpy as np + from PIL import Image + + prior_pipeline = KandinskyPriorPipeline.from_pretrained("kandinsky-community/kandinsky-2-1-prior", mindspore_dtype=ms.float16, use_safetensors=True) + pipeline = KandinskyInpaintPipeline.from_pretrained("kandinsky-community/kandinsky-2-1-inpaint", mindspore_dtype=ms.float16, use_safetensors=True) + ``` + +=== "Kandinsky 2.2" + + ```py + from mindone.diffusers import KandinskyV22InpaintPipeline, KandinskyV22PriorPipeline + from mindone.diffusers.utils import load_image, make_image_grid + import mindspore as ms + import numpy as np + from PIL import Image + + prior_pipeline = KandinskyV22PriorPipeline.from_pretrained("kandinsky-community/kandinsky-2-2-prior", mindspore_dtype=ms.float16, use_safetensors=True) + pipeline = KandinskyV22InpaintPipeline.from_pretrained("kandinsky-community/kandinsky-2-2-decoder-inpaint", mindspore_dtype=ms.float16, use_safetensors=True) + ``` + +Load an initial image and create a mask: + +```py +init_image = load_image("https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinsky/cat.png") +mask = np.zeros((768, 768), dtype=np.float32) +# mask area above cat's head +mask[:250, 250:-250] = 1 +``` + +Generate the embeddings with the prior pipeline: + +```py +prompt = "a hat" +image_emb, zero_image_emb = prior_pipeline(prompt) +``` + +Now pass the initial image, mask, and prompt and embeddings to the pipeline to generate an image: + +=== "Kandinsky 2.1" + + ```py + output_image = pipeline( + prompt, + image=init_image, + mask_image=mask, + image_embeds=image_emb, + negative_image_embeds=zero_image_emb, + height=768, + width=768, + num_inference_steps=150 + )[0][0] + mask = Image.fromarray((mask*255).astype('uint8'), 'L') + make_image_grid([init_image, mask, output_image], rows=1, cols=3) + ``` + +
+ +
+ +=== "Kandinsky 2.2" + + ```py + output_image = pipeline( + image=init_image, + mask_image=mask, + image_embeds=image_emb, + negative_image_embeds=zero_image_emb, + height=768, + width=768, + num_inference_steps=150 + )[0][0] + mask = Image.fromarray((mask*255).astype('uint8'), 'L') + make_image_grid([init_image, mask, output_image], rows=1, cols=3) + ``` + +
+ +
+ +You can also use the end-to-end [`KandinskyInpaintCombinedPipeline`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/pipelines/kandinsky/#mindone.diffusers.KandinskyInpaintCombinedPipeline) and [`KandinskyV22InpaintCombinedPipeline`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/pipelines/kandinsky_v22/#mindone.diffusers.KandinskyV22InpaintPipeline) to call the prior and decoder pipelines together under the hood. + +=== "Kandinsky 2.1" + + ```py + import mindspore as ms + import numpy as np + from PIL import Image + from mindone.diffusers import KandinskyInpaintCombinedPipeline + from mindone.diffusers.utils import load_image, make_image_grid + + pipe = KandinskyInpaintCombinedPipeline.from_pretrained("kandinsky-community/kandinsky-2-1-inpaint", mindspore_dtype=ms.float16) + + init_image = load_image("https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinsky/cat.png") + mask = np.zeros((768, 768), dtype=np.float32) + # mask area above cat's head + mask[:250, 250:-250] = 1 + prompt = "a hat" + + output_image = pipe(prompt=prompt, image=init_image, mask_image=mask)[0][0] + mask = Image.fromarray((mask*255).astype('uint8'), 'L') + make_image_grid([init_image, mask, output_image], rows=1, cols=3) + ``` + +=== "Kandinsky 2.2" + + ```py + import mindspore as ms + import numpy as np + from PIL import Image + from mindone.diffusers import KandinskyV22InpaintCombinedPipeline + from mindone.diffusers.utils import load_image, make_image_grid + + pipe = KandinskyV22InpaintCombinedPipeline.from_pretrained("kandinsky-community/kandinsky-2-2-decoder-inpaint", mindspore_dtype=ms.float16) + + init_image = load_image("https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinsky/cat.png") + mask = np.zeros((768, 768), dtype=np.float32) + # mask area above cat's head + mask[:250, 250:-250] = 1 + prompt = "a hat" + + output_image = pipe(prompt=prompt, image=init_image, mask_image=mask)[0][0] + mask = Image.fromarray((mask*255).astype('uint8'), 'L') + make_image_grid([init_image, mask, output_image], rows=1, cols=3) + ``` + +## Interpolation + +Interpolation allows you to explore the latent space between the image and text embeddings which is a cool way to see some of the prior model's intermediate outputs. Load the prior pipeline and two images you'd like to interpolate: + +=== "Kandinsky 2.1" + + ```py + from mindone.diffusers import KandinskyPriorPipeline, KandinskyPipeline + from mindone.diffusers.utils import load_image, make_image_grid + import mindspore as ms + + prior_pipeline = KandinskyPriorPipeline.from_pretrained("kandinsky-community/kandinsky-2-1-prior", mindspore_dtype=ms.float16, use_safetensors=True) + img_1 = load_image("https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinsky/cat.png") + img_2 = load_image("https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinsky/starry_night.jpeg") + make_image_grid([img_1.resize((512,512)), img_2.resize((512,512))], rows=1, cols=2) + ``` + +=== "Kandinsky 2.2" + + ```py + from mindone.diffusers import KandinskyV22PriorPipeline, KandinskyV22Pipeline + from mindone.diffusers.utils import load_image, make_image_grid + import mindspore as ms + + prior_pipeline = KandinskyV22PriorPipeline.from_pretrained("kandinsky-community/kandinsky-2-2-prior", mindspore_dtype=ms.float16, use_safetensors=True) + img_1 = load_image("https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinsky/cat.png") + img_2 = load_image("https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinsky/starry_night.jpeg") + make_image_grid([img_1.resize((512,512)), img_2.resize((512,512))], rows=1, cols=2) + ``` + +
+
+ +
a cat
+
+
+ +
Van Gogh's Starry Night painting
+
+
+ +Specify the text or images to interpolate, and set the weights for each text or image. Experiment with the weights to see how they affect the interpolation! + +```py +images_texts = ["a cat", img_1, img_2] +weights = [0.3, 0.3, 0.4] +``` + +Call the `interpolate` function to generate the embeddings, and then pass them to the pipeline to generate the image: + +=== "Kandinsky 2.1" + + ```py + # prompt can be left empty + prompt = "" + image_embeds, negative_image_embeds = prior_pipeline.interpolate(images_texts, weights) + + pipeline = KandinskyPipeline.from_pretrained("kandinsky-community/kandinsky-2-1", mindspore_dtype=ms.float16, use_safetensors=True) + + image = pipeline(prompt, image_embeds=image_embeds, negative_image_embeds=negative_image_embeds, height=768, width=768)[0][0] + image + ``` + +
+ +
+ +=== "Kandinsky 2.2" + + ```py + # prompt can be left empty + prompt = "" + image_embeds, negative_image_embeds = prior_pipeline.interpolate(images_texts, weights) + + pipeline = KandinskyV22Pipeline.from_pretrained("kandinsky-community/kandinsky-2-2-decoder", mindspore_dtype=ms.float16, use_safetensors=True) + + image = pipeline(image_embeds=image_embeds, negative_image_embeds=negative_image_embeds, height=768, width=768)[0][0] + image + ``` + +
+ +
+ +## ControlNet + +!!! warning + + ⚠️ ControlNet is only supported for Kandinsky 2.2! + + ⚠️ MindONE currently does not support the full process for extracting the depth map, as MindONE does not yet support depth-estimation [~transformers.Pipeline] from mindone.transformers. Therefore, you need to prepare the depth map in advance to continue the process. + +ControlNet enables conditioning large pretrained diffusion models with additional inputs such as a depth map or edge detection. For example, you can condition Kandinsky 2.2 with a depth map so the model understands and preserves the structure of the depth image. + +Let's load an image and extract it's depth map: + +```py +from mindone.diffusers.utils import load_image + +img = load_image( + "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinskyv22/cat.png" +).resize((768, 768)) +img +``` + +
+ +
+ +Then you can process and retrieve the depth map you prepared in advance: + +```py +import mindspore as ms +import numpy as np + +def make_hint(depth_image): + image = depth_image[:, :, None] + image = np.concatenate([image, image, image], axis=2) + detected_map = ms.Tensor.from_numpy(image).float() / 255.0 + hint = detected_map.permute(2, 0, 1) + return hint + +hint = make_hint(depth_image).unsqueeze(0).half() +``` + +### Text-to-image + +Load the prior pipeline and the [`KandinskyV22ControlnetPipeline`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/pipelines/kandinsky_v22/#mindone.diffusers.KandinskyV22ControlnetPipeline): + +```py +from mindone.diffusers import KandinskyV22PriorPipeline, KandinskyV22ControlnetPipeline +import mindspore as ms +import numpy as np + +prior_pipeline = KandinskyV22PriorPipeline.from_pretrained( + "kandinsky-community/kandinsky-2-2-prior", mindspore_dtype=ms.float16, use_safetensors=True +) + +pipeline = KandinskyV22ControlnetPipeline.from_pretrained( + "kandinsky-community/kandinsky-2-2-controlnet-depth", revision="refs/pr/7", mindspore_dtype=ms.float16 +) +``` + +Generate the image embeddings from a prompt and negative prompt: + +```py +prompt = "A robot, 4k photo" +negative_prior_prompt = "lowres, text, error, cropped, worst quality, low quality, jpeg artifacts, ugly, duplicate, morbid, mutilated, out of frame, extra fingers, mutated hands, poorly drawn hands, poorly drawn face, mutation, deformed, blurry, dehydrated, bad anatomy, bad proportions, extra limbs, cloned face, disfigured, gross proportions, malformed limbs, missing arms, missing legs, extra arms, extra legs, fused fingers, too many fingers, long neck, username, watermark, signature" + +generator = np.random.Generator(np.random.PCG64(43)) + +image_emb, zero_image_emb = prior_pipeline( + prompt=prompt, negative_prompt=negative_prior_prompt, generator=generator +) +``` + +Finally, pass the image embeddings and the depth image to the [`KandinskyV22ControlnetPipeline`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/pipelines/kandinsky_v22/#mindone.diffusers.KandinskyV22ControlnetPipeline) to generate an image: + +```py +image = pipeline(image_embeds=image_emb, negative_image_embeds=zero_image_emb, hint=hint, num_inference_steps=50, generator=generator, height=768, width=768)[0][0] +image +``` + +
+ +
+ +### Image-to-image + +For image-to-image with ControlNet, you'll need to use the: + +- [`KandinskyV22PriorEmb2EmbPipeline`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/pipelines/kandinsky_v22/#mindone.diffusers.KandinskyV22PriorEmb2EmbPipeline) to generate the image embeddings from a text prompt and an image +- [`KandinskyV22ControlnetImg2ImgPipeline`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/pipelines/kandinsky_v22/#mindone.diffusers.KandinskyV22ControlnetImg2ImgPipeline) to generate an image from the initial image and the image embeddings + +Process the depth map extracted from the initial image of a cat, which you prepared in advance. + +```py +import mindspore as ms +import numpy as np + +from mindone.diffusers import KandinskyV22PriorEmb2EmbPipeline, KandinskyV22ControlnetImg2ImgPipeline +from mindone.diffusers.utils import load_image + +img = load_image( + "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinskyv22/cat.png" +).resize((768, 768)) + +def make_hint(depth_image): + image = depth_image[:, :, None] + image = np.concatenate([image, image, image], axis=2) + detected_map = ms.Tensor.from_numpy(image).float() / 255.0 + hint = detected_map.permute(2, 0, 1) + return hint + +hint = make_hint(depth_image).unsqueeze(0).half() +``` + +Load the prior pipeline and the [`KandinskyV22ControlnetImg2ImgPipeline`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/pipelines/kandinsky_v22/#mindone.diffusers.KandinskyV22ControlnetImg2ImgPipeline): + +```py +prior_pipeline = KandinskyV22PriorEmb2EmbPipeline.from_pretrained( + "kandinsky-community/kandinsky-2-2-prior", mindspore_dtype=ms.float16, use_safetensors=True +) + +pipeline = KandinskyV22ControlnetImg2ImgPipeline.from_pretrained( + "kandinsky-community/kandinsky-2-2-controlnet-depth", revision="refs/pr/7", mindspore_dtype=ms.float16 +) +``` + +Pass a text prompt and the initial image to the prior pipeline to generate the image embeddings: + +```py +prompt = "A robot, 4k photo" +negative_prior_prompt = "lowres, text, error, cropped, worst quality, low quality, jpeg artifacts, ugly, duplicate, morbid, mutilated, out of frame, extra fingers, mutated hands, poorly drawn hands, poorly drawn face, mutation, deformed, blurry, dehydrated, bad anatomy, bad proportions, extra limbs, cloned face, disfigured, gross proportions, malformed limbs, missing arms, missing legs, extra arms, extra legs, fused fingers, too many fingers, long neck, username, watermark, signature" + +generator = np.random.Generator(np.random.PCG64(43)) + +img_emb = prior_pipeline(prompt=prompt, image=img, strength=0.85, generator=generator) +negative_emb = prior_pipeline(prompt=negative_prior_prompt, image=img, strength=1, generator=generator) +``` + +Now you can run the [`KandinskyV22ControlnetImg2ImgPipeline`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/pipelines/kandinsky_v22/#mindone.diffusers.KandinskyV22ControlnetImg2ImgPipeline) to generate an image from the initial image and the image embeddings: + +```py +image = pipeline(image=img, strength=0.5, image_embeds=img_emb[0], negative_image_embeds=negative_emb[0], hint=hint, num_inference_steps=50, generator=generator, height=768, width=768)[0][0] +make_image_grid([img.resize((512, 512)), image.resize((512, 512))], rows=1, cols=2) +``` + +
+ +
+ +## Optimizations + +Kandinsky is unique because it requires a prior pipeline to generate the mappings, and a second pipeline to decode the latents into an image. Optimization efforts should be focused on the second pipeline because that is where the bulk of the computation is done. Here are some tip to improve Kandinsky during inference. + +1. By default, the text-to-image pipeline uses the [`DDIMScheduler`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/schedulers/ddim/#mindone.diffusers.DDIMScheduler) but you can replace it with another scheduler like [`DDPMScheduler`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/schedulers/ddpm/#mindone.diffusers.DDPMScheduler) to see how that affects the tradeoff between inference speed and image quality: + +```py +from mindone.diffusers import DDPMScheduler +from mindone.diffusers import KandinskyCombinedPipeline +import mindspore as ms + +scheduler = DDPMScheduler.from_pretrained("kandinsky-community/kandinsky-2-1", subfolder="ddpm_scheduler") +pipe = KandinskyCombinedPipeline.from_pretrained("kandinsky-community/kandinsky-2-1", scheduler=scheduler, mindspore_dtype=ms.float16, use_safetensors=True) +``` diff --git a/docs/diffusers/using-diffusers/loading.md b/docs/diffusers/using-diffusers/loading.md new file mode 100644 index 0000000000..39e57ca453 --- /dev/null +++ b/docs/diffusers/using-diffusers/loading.md @@ -0,0 +1,388 @@ + + +# Load pipelines + +Diffusion systems consist of multiple components like parameterized models and schedulers that interact in complex ways. That is why we designed the [`DiffusionPipeline`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/pipelines/overview/#mindone.diffusers.DiffusionPipeline) to wrap the complexity of the entire diffusion system into an easy-to-use API. At the same time, the [`DiffusionPipeline`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/pipelines/overview/#mindone.diffusers.DiffusionPipeline) is entirely customizable so you can modify each component to build a diffusion system for your use case. + +This guide will show you how to load: + +- pipelines from the Hub and locally +- different components into a pipeline +- multiple pipelines without increasing memory usage +- checkpoint variants such as different floating point types or non-exponential mean averaged (EMA) weights + +## Load a pipeline + +!!! tip + + Skip to the [DiffusionPipeline explained](#diffusionpipeline-explained) section if you're interested in an explanation about how the [`DiffusionPipeline`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/pipelines/overview/#mindone.diffusers.DiffusionPipeline) class works. + +There are two ways to load a pipeline for a task: + +1. Load the generic [`DiffusionPipeline`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/pipelines/overview/#mindone.diffusers.DiffusionPipeline) class and allow it to automatically detect the correct pipeline class from the checkpoint. +2. Load a specific pipeline class for a specific task. + +=== "Generic pipeline" + + The [`DiffusionPipeline`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/pipelines/overview/#mindone.diffusers.DiffusionPipeline) class is a simple and generic way to load the latest trending diffusion model from the [Hub](https://huggingface.co/models?library=diffusers&sort=trending). It uses the [`from_pretrained`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/pipelines/overview/#mindone.diffusers.DiffusionPipeline.from_pretrained) method to automatically detect the correct pipeline class for a task from the checkpoint, downloads and caches all the required configuration and weight files, and returns a pipeline ready for inference. + + ```python + from mindone.diffusers import DiffusionPipeline + + pipeline = DiffusionPipeline.from_pretrained("stable-diffusion-v1-5/stable-diffusion-v1-5", use_safetensors=True) + ``` + + This same checkpoint can also be used for an image-to-image task. The [`DiffusionPipeline`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/pipelines/overview/#mindone.diffusers.DiffusionPipeline) class can handle any task as long as you provide the appropriate inputs. For example, for an image-to-image task, you need to pass an initial image to the pipeline. + + ```py + from mindone.diffusers import DiffusionPipeline + from mindone.diffusers.utils import load_image + + pipeline = DiffusionPipeline.from_pretrained("stable-diffusion-v1-5/stable-diffusion-v1-5", use_safetensors=True) + + init_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/img2img-init.png") + prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k" + image = pipeline("Astronaut in a jungle, cold color palette, muted colors, detailed, 8k", image=init_image)[0][0] + ``` + +=== "Specific pipeline" + + Checkpoints can be loaded by their specific pipeline class if you already know it. For example, to load a Stable Diffusion model, use the [`StableDiffusionPipeline`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/pipelines/stable_diffusion/text2img/#mindone.diffusers.StableDiffusionPipeline) class. + + ```python + from mindone.diffusers import StableDiffusionPipeline + + pipeline = StableDiffusionPipeline.from_pretrained("stable-diffusion-v1-5/stable-diffusion-v1-5", use_safetensors=True) + ``` + + This same checkpoint may also be used for another task like image-to-image. To differentiate what task you want to use the checkpoint for, you have to use the corresponding task-specific pipeline class. For example, to use the same checkpoint for image-to-image, use the [`StableDiffusionImg2ImgPipeline`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/pipelines/stable_diffusion/img2img/#mindone.diffusers.StableDiffusionImg2ImgPipeline) class. + + ```py + from mindone.diffusers import StableDiffusionImg2ImgPipeline + + pipeline = StableDiffusionImg2ImgPipeline.from_pretrained("stable-diffusion-v1-5/stable-diffusion-v1-5", use_safetensors=True) + ``` + +### Local pipeline + +To load a pipeline locally, use [git-lfs](https://git-lfs.github.com/) to manually download a checkpoint to your local disk. + +```bash +git-lfs install +git clone https://huggingface.co/stable-diffusion-v1-5/stable-diffusion-v1-5 +``` + +This creates a local folder, ./stable-diffusion-v1-5, on your disk and you should pass its path to [`from_pretrained`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/pipelines/overview/#mindone.diffusers.DiffusionPipeline.from_pretrained). + +```python +from mindone.diffusers import DiffusionPipeline + +stable_diffusion = DiffusionPipeline.from_pretrained("./stable-diffusion-v1-5", use_safetensors=True) +``` + +The [`from_pretrained`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/pipelines/overview/#mindone.diffusers.DiffusionPipeline.from_pretrained) method won't download files from the Hub when it detects a local path, but this also means it won't download and cache the latest changes to a checkpoint. + +## Customize a pipeline + +You can customize a pipeline by loading different components into it. This is important because you can: + +- change to a scheduler with faster generation speed or higher generation quality depending on your needs (call the `scheduler.compatibles` method on your pipeline to see compatible schedulers) +- change a default pipeline component to a newer and better performing one + +For example, let's customize the default [stabilityai/stable-diffusion-xl-base-1.0](https://hf.co/stabilityai/stable-diffusion-xl-base-1.0) checkpoint with: + +- The [`HeunDiscreteScheduler`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/schedulers/heun/#mindone.diffusers.HeunDiscreteScheduler) to generate higher quality images at the expense of slower generation speed. You must pass the `subfolder="scheduler"` parameter in [`from_pretrained`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/schedulers/heun/#mindone.diffusers.HeunDiscreteScheduler.from_pretrained) to load the scheduler configuration into the correct [subfolder](https://hf.co/stabilityai/stable-diffusion-xl-base-1.0/tree/main/scheduler) of the pipeline repository. +- A more stable VAE that runs in fp16. + +```py +from mindone.diffusers import StableDiffusionXLPipeline, HeunDiscreteScheduler, AutoencoderKL +import mindspore as ms + +scheduler = HeunDiscreteScheduler.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", subfolder="scheduler") +vae = AutoencoderKL.from_pretrained("madebyollin/sdxl-vae-fp16-fix", mindspore_dtype=ms.float16, use_safetensors=True) +``` + +Now pass the new scheduler and VAE to the [`StableDiffusionXLPipeline`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/pipelines/stable_diffusion/stable_diffusion_xl/#mindone.diffusers.StableDiffusionXLPipeline). + +```py +pipeline = StableDiffusionXLPipeline.from_pretrained( + "stabilityai/stable-diffusion-xl-base-1.0", + scheduler=scheduler, + vae=vae, + mindspore_dtype=ms.float16, + use_safetensors=True +) +``` + +## Safety checker + +Diffusers implements a [safety checker](https://github.com/The-truthh/mindone/blob/docs/mindone/diffusers/pipelines/stable_diffusion/safety_checker.py) for Stable Diffusion models which can generate harmful content. The safety checker screens the generated output against known hardcoded not-safe-for-work (NSFW) content. If for whatever reason you'd like to disable the safety checker, pass `safety_checker=None` to the [`from_pretrained`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/pipelines/overview/#mindone.diffusers.DiffusionPipeline.from_pretrained) method. + +```python +from mindone.diffusers import DiffusionPipeline + +pipeline = DiffusionPipeline.from_pretrained("stable-diffusion-v1-5/stable-diffusion-v1-5", safety_checker=None, use_safetensors=True) +""" +You have disabled the safety checker for by passing `safety_checker=None`. Ensure that you abide by the conditions of the Stable Diffusion license and do not expose unfiltered results in services or applications open to the public. Both the diffusers team and Hugging Face strongly recommend keeping the safety filter enabled in all public-facing circumstances, disabling it only for use cases that involve analyzing network behavior or auditing its results. For more information, please have a look at https://github.com/huggingface/diffusers/pull/254 . +""" +``` + +## Checkpoint variants + +A checkpoint variant is usually a checkpoint whose weights are: + +- Stored in a different floating point type, such as [mindspore.float16](https://www.mindspore.cn/docs/zh-CN/r2.3.1/api_python/mindspore/mindspore.dtype.html#mindspore.dtype), because it only requires half the bandwidth and storage to download. You can't use this variant if you're continuing training or using a CPU. +- Non-exponential mean averaged (EMA) weights which shouldn't be used for inference. You should use this variant to continue finetuning a model. + +!!! tip + + When the checkpoints have identical model structures, but they were trained on different datasets and with a different training setup, they should be stored in separate repositories. For example, [stabilityai/stable-diffusion-2](https://hf.co/stabilityai/stable-diffusion-2) and [stabilityai/stable-diffusion-2-1](https://hf.co/stabilityai/stable-diffusion-2-1) are stored in separate repositories. + +Otherwise, a variant is **identical** to the original checkpoint. They have exactly the same serialization format (like [safetensors](https://huggingface.co/docs/diffusers/main/en/using-diffusers/using_safetensors)), model structure, and their weights have identical tensor shapes. + +| **checkpoint type** | **weight name** | **argument for loading weights** | +|---------------------|--------------------------------------|----------------------------------| +| original | diffusion_model.safetensors | | +| floating point | diffusion_model.fp16.safetensors | `variant`, `mindspore_dtype` | +| non-EMA | diffusion_model.non_ema.safetensors | `variant` | + +There are two important arguments for loading variants: + +- `mindspore_dtype` specifies the floating point precision of the loaded checkpoint. For example, if you want to save bandwidth by loading a fp16 variant, you should set `variant="fp16"` and `mindspore_dtype=mindspore.float16` to *convert the weights* to fp16. Otherwise, the fp16 weights are converted to the default fp32 precision. + + If you only set `mindspore_dtype=mindspore.float16`, the default fp32 weights are downloaded first and then converted to fp16. + +- `variant` specifies which files should be loaded from the repository. For example, if you want to load a non-EMA variant of a UNet from [stable-diffusion-v1-5/stable-diffusion-v1-5](https://huggingface.co/stable-diffusion-v1-5/stable-diffusion-v1-5/tree/main/unet), set `variant="non_ema"` to download the `non_ema` file. + +=== "fp16" + + ```py + from mindone.diffusers import DiffusionPipeline + import mindspore as ms + + pipeline = DiffusionPipeline.from_pretrained( + "stable-diffusion-v1-5/stable-diffusion-v1-5", variant="fp16", mindspore_dtype=ms.float16, use_safetensors=True + ) + ``` + +=== "non-EMA" + + ```py + pipeline = DiffusionPipeline.from_pretrained( + "stable-diffusion-v1-5/stable-diffusion-v1-5", variant="non_ema", use_safetensors=True + ) + ``` + +Use the `variant` parameter in the [`save_pretrained`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/pipelines/overview/#mindone.diffusers.DiffusionPipeline.save_pretrained) method to save a checkpoint as a different floating point type or as a non-EMA variant. You should try save a variant to the same folder as the original checkpoint, so you have the option of loading both from the same folder. + +=== "fp16" + + ```python + from mindone.diffusers import DiffusionPipeline + + pipeline.save_pretrained("stable-diffusion-v1-5/stable-diffusion-v1-5", variant="fp16") + ``` + +=== "non_ema" + + ```py + pipeline.save_pretrained("stable-diffusion-v1-5/stable-diffusion-v1-5", variant="non_ema") + ``` + +If you don't save the variant to an existing folder, you must specify the `variant` argument otherwise it'll throw an `Exception` because it can't find the original checkpoint. + +```python +# 👎 this won't work +pipeline = DiffusionPipeline.from_pretrained( + "./stable-diffusion-v1-5", mindspore_dtype=mindspore.float16, use_safetensors=True +) +# 👍 this works +pipeline = DiffusionPipeline.from_pretrained( + "./stable-diffusion-v1-5", variant="fp16", mindspore_dtype=mindspore.float16, use_safetensors=True +) +``` + +## DiffusionPipeline explained + +As a class method, [`from_pretrained`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/pipelines/overview/#mindone.diffusers.DiffusionPipeline.from_pretrained) is responsible for two things: + +- Download the latest version of the folder structure required for inference and cache it. If the latest folder structure is available in the local cache, [`from_pretrained`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/pipelines/overview/#mindone.diffusers.DiffusionPipeline.from_pretrained) reuses the cache and won't redownload the files. +- Load the cached weights into the correct pipeline [class](../api/pipelines/overview.md#diffusers-summary) - retrieved from the `model_index.json` file - and return an instance of it. + +The pipelines' underlying folder structure corresponds directly with their class instances. For example, the [`StableDiffusionPipeline`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/pipelines/stable_diffusion/text2img/#mindone.diffusers.StableDiffusionPipeline) corresponds to the folder structure in [`stable-diffusion-v1-5/stable-diffusion-v1-5`](https://huggingface.co/stable-diffusion-v1-5/stable-diffusion-v1-5). + +```python +from mindone.diffusers import DiffusionPipeline + +repo_id = "stable-diffusion-v1-5/stable-diffusion-v1-5" +pipeline = DiffusionPipeline.from_pretrained(repo_id, use_safetensors=True) +print(pipeline) +``` + +You'll see pipeline is an instance of [`StableDiffusionPipeline`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/pipelines/stable_diffusion/text2img/#mindone.diffusers.StableDiffusionPipeline), which consists of seven components: + +- `"feature_extractor"`: a [`~transformers.CLIPImageProcessor`] from 🤗 Transformers. +- `"safety_checker"`: a [component](https://github.com/mindspore-lab/mindone/blob/master/mindone/diffusers/pipelines/stable_diffusion/safety_checker.py) for screening against harmful content. +- `"scheduler"`: an instance of [`PNDMScheduler`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/schedulers/pndm/#mindone.diffusers.PNDMScheduler). +- `"text_encoder"`: a [`~transformers.CLIPTextModel`] from 🤗 Transformers. +- `"tokenizer"`: a [`~transformers.CLIPTokenizer`] from 🤗 Transformers. +- `"unet"`: an instance of [`UNet2DConditionModel`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/models/unet2d-cond/#mindone.diffusers.UNet2DConditionModel). +- `"vae"`: an instance of [`AutoencoderKL`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/models/autoencoderkl/#mindone.diffusers.AutoencoderKL). + +```json +StableDiffusionPipeline { + "_class_name": "StableDiffusionPipeline", + "_diffusers_version": "0.29.2", + "_name_or_path": "stable-diffusion-v1-5/stable-diffusion-v1-5", + "feature_extractor": [ + "transformers", + "CLIPImageProcessor" + ], + "image_encoder": [ + null, + null + ], + "requires_safety_checker": true, + "safety_checker": [ + "stable_diffusion", + "StableDiffusionSafetyChecker" + ], + "scheduler": [ + "diffusers", + "PNDMScheduler" + ], + "text_encoder": [ + "transformers", + "CLIPTextModel" + ], + "tokenizer": [ + "transformers", + "CLIPTokenizer" + ], + "unet": [ + "diffusers", + "UNet2DConditionModel" + ], + "vae": [ + "diffusers", + "AutoencoderKL" + ] +} +``` + +Compare the components of the pipeline instance to the [`stable-diffusion-v1-5/stable-diffusion-v1-5`](https://huggingface.co/stable-diffusion-v1-5/stable-diffusion-v1-5/tree/main) folder structure, and you'll see there is a separate folder for each of the components in the repository: + +``` +. +├── feature_extractor +│   └── preprocessor_config.json +├── model_index.json +├── safety_checker +│   ├── config.json +| ├── model.fp16.safetensors +│ ├── model.safetensors +│ ├── pytorch_model.bin +| └── pytorch_model.fp16.bin +├── scheduler +│   └── scheduler_config.json +├── text_encoder +│   ├── config.json +| ├── model.fp16.safetensors +│ ├── model.safetensors +│ |── pytorch_model.bin +| └── pytorch_model.fp16.bin +├── tokenizer +│   ├── merges.txt +│   ├── special_tokens_map.json +│   ├── tokenizer_config.json +│   └── vocab.json +├── unet +│   ├── config.json +│   ├── diffusion_pytorch_model.bin +| |── diffusion_pytorch_model.fp16.bin +│ |── diffusion_pytorch_model.f16.safetensors +│ |── diffusion_pytorch_model.non_ema.bin +│ |── diffusion_pytorch_model.non_ema.safetensors +│ └── diffusion_pytorch_model.safetensors +|── vae +. ├── config.json +. ├── diffusion_pytorch_model.bin + ├── diffusion_pytorch_model.fp16.bin + ├── diffusion_pytorch_model.fp16.safetensors + └── diffusion_pytorch_model.safetensors +``` + +You can access each of the components of the pipeline as an attribute to view its configuration: + +```py +pipeline.tokenizer +CLIPTokenizer( + name_or_path='/root/.cache/huggingface/hub/models--stable-diffusion-v1-5--stable-diffusion-v1-5/snapshots/f03de327dd89b501a01da37fc5240cf4fdba85a1/tokenizer', + vocab_size=49408, + model_max_length=77, + is_fast=False, + padding_side='right', + truncation_side='right', + special_tokens={ + 'bos_token': '<|startoftext|>', + 'eos_token': '<|endoftext|>', + 'unk_token': '<|endoftext|>', + 'pad_token': '<|endoftext|>'}, + clean_up_tokenization_spaces=True +), +added_tokens_decoder={ + 49406: AddedToken("<|startoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True), + 49407: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True), +} +``` + +Every pipeline expects a [`model_index.json`](https://huggingface.co/stable-diffusion-v1-5/stable-diffusion-v1-5/blob/main/model_index.json) file that tells the [`DiffusionPipeline`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/pipelines/overview/#mindone.diffusers.DiffusionPipeline): + +- which pipeline class to load from `_class_name` +- which version of 🧨 Diffusers was used to create the model in `_diffusers_version` +- what components from which library are stored in the subfolders (`name` corresponds to the component and subfolder name, `library` corresponds to the name of the library to load the class from, and `class` corresponds to the class name) + +```json +{ + "_class_name": "StableDiffusionPipeline", + "_diffusers_version": "0.6.0", + "feature_extractor": [ + "transformers", + "CLIPImageProcessor" + ], + "safety_checker": [ + "stable_diffusion", + "StableDiffusionSafetyChecker" + ], + "scheduler": [ + "diffusers", + "PNDMScheduler" + ], + "text_encoder": [ + "transformers", + "CLIPTextModel" + ], + "tokenizer": [ + "transformers", + "CLIPTokenizer" + ], + "unet": [ + "diffusers", + "UNet2DConditionModel" + ], + "vae": [ + "diffusers", + "AutoencoderKL" + ] +} +``` diff --git a/docs/diffusers/using-diffusers/loading_adapters.md b/docs/diffusers/using-diffusers/loading_adapters.md new file mode 100644 index 0000000000..362a821e71 --- /dev/null +++ b/docs/diffusers/using-diffusers/loading_adapters.md @@ -0,0 +1,309 @@ + + +# Load adapters + +There are several [training](../training/overview.md) techniques for personalizing diffusion models to generate images of a specific subject or images in certain styles. Each of these training methods produces a different type of adapter. Some of the adapters generate an entirely new model, while other adapters only modify a smaller set of embeddings or weights. This means the loading process for each adapter is also different. + +This guide will show you how to load DreamBooth, textual inversion, and LoRA weights. + +!!! tip + + Feel free to browse the [Stable Diffusion Conceptualizer](https://huggingface.co/spaces/sd-concepts-library/stable-diffusion-conceptualizer), [LoRA the Explorer](https://huggingface.co/spaces/multimodalart/LoraTheExplorer), and the [Diffusers Models Gallery](https://huggingface.co/spaces/huggingface-projects/diffusers-gallery) for checkpoints and embeddings to use. + +## DreamBooth + +[DreamBooth](https://dreambooth.github.io/) finetunes an *entire diffusion model* on just several images of a subject to generate images of that subject in new styles and settings. This method works by using a special word in the prompt that the model learns to associate with the subject image. Of all the training methods, DreamBooth produces the largest file size (usually a few GBs) because it is a full checkpoint model. + +Let's load the [herge_style](https://huggingface.co/sd-dreambooth-library/herge-style) checkpoint, which is trained on just 10 images drawn by Hergé, to generate images in that style. For it to work, you need to include the special word `herge_style` in your prompt to trigger the checkpoint: + +```py +from mindone.diffusers import StableDiffusionPipeline +import mindspore as ms + +pipeline = StableDiffusionPipeline.from_pretrained("sd-dreambooth-library/herge-style", revision="refs/pr/9", mindspore_dtype=ms.float16) +prompt = "A cute herge_style brown bear eating a slice of pizza, stunning color scheme, masterpiece, illustration" +image = pipeline(prompt)[0][0] +image +``` + +
+ +
+ +## Textual inversion + +[Textual inversion](https://textual-inversion.github.io/) is very similar to DreamBooth and it can also personalize a diffusion model to generate certain concepts (styles, objects) from just a few images. This method works by training and finding new embeddings that represent the images you provide with a special word in the prompt. As a result, the diffusion model weights stay the same and the training process produces a relatively tiny (a few KBs) file. + +Because textual inversion creates embeddings, it cannot be used on its own like DreamBooth and requires another model. + +```py +from mindone.diffusers import StableDiffusionPipeline +import mindspore as ms + +pipeline = StableDiffusionPipeline.from_pretrained("stable-diffusion-v1-5/stable-diffusion-v1-5", revision="refs/pr/1", mindspore_dtype=ms.float16) +``` + +Now you can load the textual inversion embeddings with the [`load_textual_inversion`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/loaders/textual_inversion/#mindone.diffusers.loaders.textual_inversion.TextualInversionLoaderMixin.load_textual_inversion) method and generate some images. Let's load the [sd-concepts-library/gta5-artwork](https://huggingface.co/sd-concepts-library/gta5-artwork) embeddings and you'll need to include the special word `` in your prompt to trigger it: + +```py +pipeline.load_textual_inversion("sd-concepts-library/gta5-artwork") +prompt = "A cute brown bear eating a slice of pizza, stunning color scheme, masterpiece, illustration, style" +image = pipeline(prompt)[0][0] +image +``` + +
+ +
+ +Textual inversion can also be trained on undesirable things to create *negative embeddings* to discourage a model from generating images with those undesirable things like blurry images or extra fingers on a hand. This can be an easy way to quickly improve your prompt. You'll also load the embeddings with [`load_textual_inversion`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/loaders/textual_inversion/#mindone.diffusers.loaders.textual_inversion.TextualInversionLoaderMixin.load_textual_inversion), but this time, you'll need two more parameters: + +- `weight_name`: specifies the weight file to load if the file was saved in the 🤗 Diffusers format with a specific name or if the file is stored in the A1111 format +- `token`: specifies the special word to use in the prompt to trigger the embeddings + +Let's load the [gsdf/EasyNegative](https://huggingface.co/datasets/gsdf/EasyNegativet) embeddings: + +```py +pipeline.load_textual_inversion( + "gsdf/EasyNegative", weight_name="EasyNegative.safetensors", token="EasyNegative" +) +``` + +Now you can use the `token` to generate an image with the negative embeddings: + +```py +prompt = "A cute brown bear eating a slice of pizza, stunning color scheme, masterpiece, illustration, EasyNegative" +negative_prompt = "EasyNegative" + +image = pipeline(prompt, negative_prompt=negative_prompt, num_inference_steps=50)[0][0] +image +``` + +
+ +
+ +## LoRA + +[Low-Rank Adaptation (LoRA)](https://huggingface.co/papers/2106.09685) is a popular training technique because it is fast and generates smaller file sizes (a couple hundred MBs). Like the other methods in this guide, LoRA can train a model to learn new styles from just a few images. It works by inserting new weights into the diffusion model and then only the new weights are trained instead of the entire model. This makes LoRAs faster to train and easier to store. + +!!! tip + + LoRA is a very general training technique that can be used with other training methods. For example, it is common to train a model with DreamBooth and LoRA. It is also increasingly common to load and merge multiple LoRAs to create new and unique images. You can learn more about it in the in-depth [Merge LoRAs](merge_loras.md) guide since merging is outside the scope of this loading guide. + +LoRAs also need to be used with another model: + +```py +from mindone.diffusers import StableDiffusionXLPipeline +import mindspore as ms + +pipeline = StableDiffusionXLPipeline.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", mindspore_dtype=ms.float16) +``` + +Then use the [`load_lora_weights`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/loaders/lora/#mindone.diffusers.loaders.lora.LoraLoaderMixin.load_lora_weights) method to load the [ostris/super-cereal-sdxl-lora](https://huggingface.co/ostris/super-cereal-sdxl-lora) weights and specify the weights filename from the repository: + +```py +pipeline.load_lora_weights("ostris/super-cereal-sdxl-lora", weight_name="cereal_box_sdxl_v1.safetensors") +prompt = "bears, pizza bites" +image = pipeline(prompt)[0][0] +image +``` + +
+ +
+ +The [`load_lora_weights`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/loaders/lora/#mindone.diffusers.loaders.lora.LoraLoaderMixin.load_lora_weights) method loads LoRA weights into both the UNet and text encoder. It is the preferred way for loading LoRAs because it can handle cases where: + +- the LoRA weights don't have separate identifiers for the UNet and text encoder +- the LoRA weights have separate identifiers for the UNet and text encoder + +But if you only need to load LoRA weights into the UNet, then you can use the [`load_attn_procs`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/loaders/unet/#mindone.diffusers.loaders.unet.UNet2DConditionLoadersMixin.load_attn_procs) method. Let's load the [jbilcke-hf/sdxl-cinematic-1](https://huggingface.co/jbilcke-hf/sdxl-cinematic-1) LoRA: + +```py +from mindone.diffusers import StableDiffusionXLPipeline +import mindspore as ms + +pipeline = StableDiffusionXLPipeline.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", mindspore_dtype=ms.float16) +pipeline.unet.load_attn_procs("jbilcke-hf/sdxl-cinematic-1", weight_name="pytorch_lora_weights.safetensors") + +# use cnmt in the prompt to trigger the LoRA +prompt = "A cute cnmt eating a slice of pizza, stunning color scheme, masterpiece, illustration" +image = pipeline(prompt)[0][0] +image +``` + +
+ +
+ +To unload the LoRA weights, use the [`unload_lora_weights`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/loaders/lora/#mindone.diffusers.loaders.lora.LoraLoaderMixin.unload_lora_weights) method to discard the LoRA weights and restore the model to its original weights: + +```py +pipeline.unload_lora_weights() +``` + +### Adjust LoRA weight scale + +For both [`load_lora_weights`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/loaders/lora/#mindone.diffusers.loaders.lora.LoraLoaderMixin.load_lora_weights) and [`load_attn_procs`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/loaders/unet/#mindone.diffusers.loaders.unet.UNet2DConditionLoadersMixin.load_attn_procs), you can pass the `cross_attention_kwargs={"scale": 0.5}` parameter to adjust how much of the LoRA weights to use. A value of `0` is the same as only using the base model weights, and a value of `1` is equivalent to using the fully finetuned LoRA. + +For more granular control on the amount of LoRA weights used per layer, you can use [`set_adapters`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/loaders/lora/#mindone.diffusers.loaders.lora.LoraLoaderMixin.set_adapters) and pass a dictionary specifying by how much to scale the weights in each layer by. +```python +pipe = ... # create pipeline +pipe.load_lora_weights(..., adapter_name="my_adapter") +scales = { + "text_encoder": 0.5, + "text_encoder_2": 0.5, # only usable if pipe has a 2nd text encoder + "unet": { + "down": 0.9, # all transformers in the down-part will use scale 0.9 + # "mid" # in this example "mid" is not given, therefore all transformers in the mid part will use the default scale 1.0 + "up": { + "block_0": 0.6, # all 3 transformers in the 0th block in the up-part will use scale 0.6 + "block_1": [0.4, 0.8, 1.0], # the 3 transformers in the 1st block in the up-part will use scales 0.4, 0.8 and 1.0 respectively + } + } +} +pipe.set_adapters("my_adapter", scales) +``` + +This also works with multiple adapters - see [this guide](https://mindspore-lab.github.io/mindone/latest/diffusers/tutorials/using_peft_for_inference/#customize-adapters-strength) for how to do it. + +!!! warning + + Currently, [`set_adapters`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/loaders/lora/#mindone.diffusers.loaders.lora.LoraLoaderMixin.set_adapters) only supports scaling attention weights. If a LoRA has other parts (e.g., resnets or down-/upsamplers), they will keep a scale of 1.0. + +### Kohya and TheLastBen + +Other popular LoRA trainers from the community include those by [Kohya](https://github.com/kohya-ss/sd-scripts/) and [TheLastBen](https://github.com/TheLastBen/fast-stable-diffusion). These trainers create different LoRA checkpoints than those trained by 🤗 Diffusers, but they can still be loaded in the same way. + +=== "Kohya" + + To load a Kohya LoRA, let's download the [Blueprintify SD XL 1.0](https://civitai.com/models/150986/blueprintify-sd-xl-10) checkpoint from [Civitai](https://civitai.com/) as an example: + + ```sh + !wget https://civitai.com/api/download/models/168776 -O blueprintify-sd-xl-10.safetensors + ``` + + Load the LoRA checkpoint with the [`load_lora_weights`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/loaders/lora/#mindone.diffusers.loaders.lora.LoraLoaderMixin.load_lora_weights) method, and specify the filename in the `weight_name` parameter: + + ```py + from mindone.diffusers import StableDiffusionXLPipeline + import mindspore as ms + + pipeline = StableDiffusionXLPipeline.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", mindspore_dtype=ms.float16) + pipeline.load_lora_weights("path/to/weights", weight_name="blueprintify-sd-xl-10.safetensors") + ``` + + Generate an image: + + ```py + # use bl3uprint in the prompt to trigger the LoRA + prompt = "bl3uprint, a highly detailed blueprint of the eiffel tower, explaining how to build all parts, many txt, blueprint grid backdrop" + image = pipeline(prompt)[0][0] + image + ``` + + !!! warning + + Some limitations of using Kohya LoRAs with 🤗 Diffusers include: + + - Images may not look like those generated by UIs - like ComfyUI - for multiple reasons, which are explained [here](https://github.com/huggingface/diffusers/pull/4287/#issuecomment-1655110736). + - [LyCORIS checkpoints](https://github.com/KohakuBlueleaf/LyCORIS) aren't fully supported. The [`load_lora_weights`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/loaders/lora/#mindone.diffusers.loaders.lora.LoraLoaderMixin.load_lora_weights) method loads LyCORIS checkpoints with LoRA and LoCon modules, but Hada and LoKR are not supported. + +=== "TheLastBen" + + Loading a checkpoint from TheLastBen is very similar. For example, to load the [TheLastBen/William_Eggleston_Style_SDXL](https://huggingface.co/TheLastBen/William_Eggleston_Style_SDXL) checkpoint: + + ```py + from mindone.diffusers import StableDiffusionXLPipeline + import mindspore as ms + + pipeline = StableDiffusionXLPipeline.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", mindspore_dtype=ms.float16) + pipeline.load_lora_weights("TheLastBen/William_Eggleston_Style_SDXL", weight_name="wegg.safetensors") + + # use by william eggleston in the prompt to trigger the LoRA + prompt = "a house by william eggleston, sunrays, beautiful, sunlight, sunrays, beautiful" + image = pipeline(prompt=prompt)[0][0] + image + ``` + +## IP-Adapter + +[IP-Adapter](https://ip-adapter.github.io/) is a lightweight adapter that enables image prompting for any diffusion model. This adapter works by decoupling the cross-attention layers of the image and text features. All the other model components are frozen and only the embedded image features in the UNet are trained. As a result, IP-Adapter files are typically only ~100MBs. + +You can learn more about how to use IP-Adapter for different tasks and specific use cases in the [IP-Adapter](../using-diffusers/ip_adapter.md) guide. + +!!! tip + + Diffusers currently only supports IP-Adapter for some of the most popular pipelines. Feel free to open a feature request if you have a cool use case and want to integrate IP-Adapter with an unsupported pipeline! + Official IP-Adapter checkpoints are available from [h94/IP-Adapter](https://huggingface.co/h94/IP-Adapter). + +To start, load a Stable Diffusion checkpoint. + +```py +from mindone.diffusers import StableDiffusionPipeline +import mindspore as ms +from mindone.diffusers.utils import load_image +import numpy as np + +pipeline = StableDiffusionPipeline.from_pretrained("stable-diffusion-v1-5/stable-diffusion-v1-5", mindspore_dtype=ms.float16) +``` + +Then load the IP-Adapter weights and add it to the pipeline with the [`load_ip_adapter`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/loaders/ip_adapter/#mindone.diffusers.loaders.ip_adapter.IPAdapterMixin.load_ip_adapter) method. + +```py +pipeline.load_ip_adapter("h94/IP-Adapter", subfolder="models", weight_name="ip-adapter_sd15.safetensors") +``` + +Once loaded, you can use the pipeline with an image and text prompt to guide the image generation process. + +```py +image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/load_neg_embed.png") +generator = np.random.Generator(np.random.PCG64(33)) +images = pipeline( + prompt='best quality, high quality, wearing sunglasses', + ip_adapter_image=image, + negative_prompt="monochrome, lowres, bad anatomy, worst quality, low quality", + num_inference_steps=50, + generator=generator, +)[0][0] +images +``` + +
+ +
+ +### IP-Adapter Plus + +IP-Adapter relies on an image encoder to generate image features. If the IP-Adapter repository contains an `image_encoder` subfolder, the image encoder is automatically loaded and registered to the pipeline. Otherwise, you'll need to explicitly load the image encoder with a [`~transformers.CLIPVisionModelWithProjection`] model and pass it to the pipeline. + +This is the case for *IP-Adapter Plus* checkpoints which use the ViT-H image encoder. + +```py +from mindone.transformers import CLIPVisionModelWithProjection + +image_encoder = CLIPVisionModelWithProjection.from_pretrained( + "h94/IP-Adapter", + subfolder="models/image_encoder", + mindspore_dtype=ms.float16 +) + +pipeline = StableDiffusionXLPipeline.from_pretrained( + "stabilityai/stable-diffusion-xl-base-1.0", + image_encoder=image_encoder, + mindspore_dtype=ms.float16 +) + +pipeline.load_ip_adapter("h94/IP-Adapter", subfolder="sdxl_models", weight_name="ip-adapter-plus_sdxl_vit-h.safetensors") +``` diff --git a/docs/diffusers/using-diffusers/loading_overview.md b/docs/diffusers/using-diffusers/loading_overview.md new file mode 100644 index 0000000000..67b0bd62de --- /dev/null +++ b/docs/diffusers/using-diffusers/loading_overview.md @@ -0,0 +1,17 @@ + + +# Overview + +🧨 Diffusers offers many pipelines, models, and schedulers for generative tasks. To make loading these components as simple as possible, we provide a single and unified method - `from_pretrained()` - that loads any of these components from either the Hugging Face [Hub](https://huggingface.co/models?library=diffusers&sort=downloads) or your local machine. Whenever you load a pipeline or model, the latest files are automatically downloaded and cached so you can quickly reuse them next time without redownloading the files. + +This section will show you everything you need to know about loading pipelines, how to load different components in a pipeline, how to load checkpoint variants, and how to load community pipelines. You'll also learn how to load schedulers and compare the speed and quality trade-offs of using different schedulers. diff --git a/docs/diffusers/using-diffusers/marigold_usage.md b/docs/diffusers/using-diffusers/marigold_usage.md new file mode 100644 index 0000000000..f01184808e --- /dev/null +++ b/docs/diffusers/using-diffusers/marigold_usage.md @@ -0,0 +1,406 @@ + + +# Marigold Pipelines for Computer Vision Tasks + +[Marigold](../api/pipelines/marigold.md) is a novel diffusion-based dense prediction approach, and a set of pipelines for various computer vision tasks, such as monocular depth estimation. + +This guide will show you how to use Marigold to obtain fast and high-quality predictions for images and videos. + +Each pipeline supports one Computer Vision task, which takes an input RGB image as input and produces a *prediction* of the modality of interest, such as a depth map of the input image. +Currently, the following tasks are implemented: + +| Pipeline | Predicted Modalities | Demos | +|---------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------|:--------------------------------------------------------------------------------------------------------------------------------------------------:| +| [MarigoldDepthPipeline](https://github.com/mindspore-lab/mindone/tree/master/mindone/diffusers/pipelines/marigold/pipeline_marigold_depth.py) | [Depth](https://en.wikipedia.org/wiki/Depth_map), [Disparity](https://en.wikipedia.org/wiki/Binocular_disparity) | [Fast Demo (LCM)](https://huggingface.co/spaces/prs-eth/marigold-lcm), [Slow Original Demo (DDIM)](https://huggingface.co/spaces/prs-eth/marigold) | +| [MarigoldNormalsPipeline](https://github.com/mindspore-lab/mindone/tree/master/mindone/diffusers/pipelines/marigold/pipeline_marigold_normals.py) | [Surface normals](https://en.wikipedia.org/wiki/Normal_mapping) | [Fast Demo (LCM)](https://huggingface.co/spaces/prs-eth/marigold-normals-lcm) | + +The original checkpoints can be found under the [PRS-ETH](https://huggingface.co/prs-eth/) Hugging Face organization. +These checkpoints are meant to work with diffusers pipelines and the [original codebase](https://github.com/prs-eth/marigold). +The original code can also be used to train new checkpoints. + +| Checkpoint | Modality | Comment | +|-----------------------------------------------------------------------------------------------|----------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| [prs-eth/marigold-v1-0](https://huggingface.co/prs-eth/marigold-v1-0) | Depth | The first Marigold Depth checkpoint, which predicts *affine-invariant depth* maps. The performance of this checkpoint in benchmarks was studied in the original [paper](https://huggingface.co/papers/2312.02145). Designed to be used with the `DDIMScheduler` at inference, it requires at least 10 steps to get reliable predictions. Affine-invariant depth prediction has a range of values in each pixel between 0 (near plane) and 1 (far plane); both planes are chosen by the model as part of the inference process. See the `MarigoldImageProcessor` reference for visualization utilities. | +| [prs-eth/marigold-depth-lcm-v1-0](https://huggingface.co/prs-eth/marigold-depth-lcm-v1-0) | Depth | The fast Marigold Depth checkpoint, fine-tuned from `prs-eth/marigold-v1-0`. Designed to be used with the `LCMScheduler` at inference, it requires as little as 1 step to get reliable predictions. The prediction reliability saturates at 4 steps and declines after that. | +| [prs-eth/marigold-normals-v0-1](https://huggingface.co/prs-eth/marigold-normals-v0-1) | Normals | A preview checkpoint for the Marigold Normals pipeline. Designed to be used with the `DDIMScheduler` at inference, it requires at least 10 steps to get reliable predictions. The surface normals predictions are unit-length 3D vectors with values in the range from -1 to 1. *This checkpoint will be phased out after the release of `v1-0` version.* | +| [prs-eth/marigold-normals-lcm-v0-1](https://huggingface.co/prs-eth/marigold-normals-lcm-v0-1) | Normals | The fast Marigold Normals checkpoint, fine-tuned from `prs-eth/marigold-normals-v0-1`. Designed to be used with the `LCMScheduler` at inference, it requires as little as 1 step to get reliable predictions. The prediction reliability saturates at 4 steps and declines after that. *This checkpoint will be phased out after the release of `v1-0` version.* | + +The examples below are mostly given for depth prediction, but they can be universally applied with other supported modalities. +We showcase the predictions using the same input image of Albert Einstein generated by Midjourney. +This makes it easier to compare visualizations of the predictions across various modalities and checkpoints. + +
+
+ +
+ Example input image for all Marigold pipelines +
+
+
+ +### Depth Prediction Quick Start + +To get the first depth prediction, load `prs-eth/marigold-depth-lcm-v1-0` checkpoint into `MarigoldDepthPipeline` pipeline, put the image through the pipeline, and save the predictions: + +```python +import mindone.diffusers +import mindspore as ms + +pipe = mindone.diffusers.MarigoldDepthPipeline.from_pretrained( + "prs-eth/marigold-depth-lcm-v1-0", variant="fp16", mindspore_dtype=ms.float16 +) + +image = mindone.diffusers.utils.load_image("https://marigoldmonodepth.github.io/images/einstein.jpg") +depth = pipe(image) + +vis = pipe.image_processor.visualize_depth(depth[0]) +vis[0].save("einstein_depth.png") + +depth_16bit = pipe.image_processor.export_depth_to_16bit_png(depth[0]) +depth_16bit[0].save("einstein_depth_16bit.png") +``` + +The visualization function for depth [`~pipelines.marigold.marigold_image_processing.MarigoldImageProcessor.visualize_depth`] applies one of [matplotlib's colormaps](https://matplotlib.org/stable/users/explain/colors/colormaps.html) (`Spectral` by default) to map the predicted pixel values from a single-channel `[0, 1]` depth range into an RGB image. +With the `Spectral` colormap, pixels with near depth are painted red, and far pixels are assigned blue color. +The 16-bit PNG file stores the single channel values mapped linearly from the `[0, 1]` range into `[0, 65535]`. +Below are the raw and the visualized predictions; as can be seen, dark areas (mustache) are easier to distinguish in the visualization: + +
+
+ +
+ Predicted depth (16-bit PNG) +
+
+
+ +
+ Predicted depth visualization (Spectral) +
+
+
+ +### Surface Normals Prediction Quick Start + +Load `prs-eth/marigold-normals-lcm-v0-1` checkpoint into `MarigoldNormalsPipeline` pipeline, put the image through the pipeline, and save the predictions: + +```python +import mindone.diffusers +import mindspore as ms + +pipe = mindone.diffusers.MarigoldNormalsPipeline.from_pretrained( + "prs-eth/marigold-normals-lcm-v0-1", variant="fp16", mindspore_dtype=ms.float16 +) + +image = mindone.diffusers.utils.load_image("https://marigoldmonodepth.github.io/images/einstein.jpg") +normals = pipe(image) + +vis = pipe.image_processor.visualize_normals(normals[0]) +vis[0].save("einstein_normals.png") +``` + +The visualization function for normals [`~pipelines.marigold.marigold_image_processing.MarigoldImageProcessor.visualize_normals`] maps the three-dimensional prediction with pixel values in the range `[-1, 1]` into an RGB image. +The visualization function supports flipping surface normals axes to make the visualization compatible with other choices of the frame of reference. +Conceptually, each pixel is painted according to the surface normal vector in the frame of reference, where `X` axis points right, `Y` axis points up, and `Z` axis points at the viewer. +Below is the visualized prediction: + +
+
+ +
+ Predicted surface normals visualization +
+
+
+ +In this example, the nose tip almost certainly has a point on the surface, in which the surface normal vector points straight at the viewer, meaning that its coordinates are `[0, 0, 1]`. +This vector maps to the RGB `[128, 128, 255]`, which corresponds to the violet-blue color. +Similarly, a surface normal on the cheek in the right part of the image has a large `X` component, which increases the red hue. +Points on the shoulders pointing up with a large `Y` promote green color. + +### Speeding up inference + +The above quick start snippets are already optimized for speed: they load the LCM checkpoint, use the `fp16` variant of weights and computation, and perform just one denoising diffusion step. +The `pipe(image)` call completes in 180ms on Ascend 910B in Graph mode. +Internally, the input image is encoded with the Stable Diffusion VAE encoder, then the U-Net performs one denoising step, and finally, the prediction latent is decoded with the VAE decoder into pixel space. +In this case, two out of three module calls are dedicated to converting between pixel and latent space of LDM. +Because Marigold's latent space is compatible with the base Stable Diffusion, it is possible to speed up the pipeline call by more than 3x (85ms on RTX 3090) by using a [lightweight replacement of the SD VAE](../api/models/autoencoder_tiny.md): + +```diff + import mindone.diffusers + import mindspore as ms + + pipe = mindone.diffusers.MarigoldDepthPipeline.from_pretrained( + "prs-eth/marigold-depth-lcm-v1-0", variant="fp16", mindspore_dtype=ms.float16 + ) + ++ pipe.vae = mindone.diffusers.AutoencoderTiny.from_pretrained( ++ "madebyollin/taesd", mindspore_dtype=ms.float16 ++ ) + + image = mindone.diffusers.utils.load_image("https://marigoldmonodepth.github.io/images/einstein.jpg") + depth = pipe(image) +``` + +## Qualitative Comparison with Depth Anything + +With the above speed optimizations, Marigold delivers predictions with more details and faster than [Depth Anything](https://huggingface.co/docs/transformers/main/en/model_doc/depth_anything) with the largest checkpoint [LiheYoung/depth-anything-large-hf](https://huggingface.co/LiheYoung/depth-anything-large-hf): + +
+
+ +
+ Marigold LCM fp16 with Tiny AutoEncoder +
+
+
+ +
+ Depth Anything Large +
+
+
+ +## Maximizing Precision and Ensembling + +Marigold pipelines have a built-in ensembling mechanism combining multiple predictions from different random latents. +This is a brute-force way of improving the precision of predictions, capitalizing on the generative nature of diffusion. +The ensembling path is activated automatically when the `ensemble_size` argument is set greater than `1`. +When aiming for maximum precision, it makes sense to adjust `num_inference_steps` simultaneously with `ensemble_size`. +The recommended values vary across checkpoints but primarily depend on the scheduler type. +The effect of ensembling is particularly well-seen with surface normals: + +```python +import mindone.diffusers + +model_path = "prs-eth/marigold-normals-v0-1" + +model_paper_kwargs = { + mindone.diffusers.schedulers.DDIMScheduler: { + "num_inference_steps": 10, + "ensemble_size": 10, + }, + mindone.diffusers.schedulers.LCMScheduler: { + "num_inference_steps": 4, + "ensemble_size": 5, + }, +} + +image = mindone.diffusers.utils.load_image("https://marigoldmonodepth.github.io/images/einstein.jpg") + +pipe = mindone.diffusers.MarigoldNormalsPipeline.from_pretrained(model_path) +pipe_kwargs = model_paper_kwargs[type(pipe.scheduler)] + +depth = pipe(image, **pipe_kwargs) + +vis = pipe.image_processor.visualize_normals(depth[0]) +vis[0].save("einstein_normals.png") +``` + +
+
+ +
+ Surface normals, no ensembling +
+
+
+ +
+ Surface normals, with ensembling +
+
+
+ +As can be seen, all areas with fine-grained structurers, such as hair, got more conservative and on average more correct predictions. +Such a result is more suitable for precision-sensitive downstream tasks, such as 3D reconstruction. + +## Quantitative Evaluation + +To evaluate Marigold quantitatively in standard leaderboards and benchmarks (such as NYU, KITTI, and other datasets), follow the evaluation protocol outlined in the paper: load the full precision fp32 model and use appropriate values for `num_inference_steps` and `ensemble_size`. +Optionally seed randomness to ensure reproducibility. Maximizing `batch_size` will deliver maximum device utilization. + +```python +import mindone.diffusers +import mindspore as ms +import numpy as np + +seed = 2024 +model_path = "prs-eth/marigold-v1-0" + +model_paper_kwargs = { + mindone.diffusers.schedulers.DDIMScheduler: { + "num_inference_steps": 50, + "ensemble_size": 10, + }, + mindone.diffusers.schedulers.LCMScheduler: { + "num_inference_steps": 4, + "ensemble_size": 10, + }, +} + +image = mindone.diffusers.utils.load_image("https://marigoldmonodepth.github.io/images/einstein.jpg") + +generator = np.random.Generator(np.random.PCG64(seed)) +pipe = mindone.diffusers.MarigoldDepthPipeline.from_pretrained(model_path) +pipe_kwargs = model_paper_kwargs[type(pipe.scheduler)] + +depth = pipe(image, generator=generator, **pipe_kwargs) + +# evaluate metrics +``` + +## Frame-by-frame Video Processing with Temporal Consistency + +Due to Marigold's generative nature, each prediction is unique and defined by the random noise sampled for the latent initialization. +This becomes an obvious drawback compared to traditional end-to-end dense regression networks, as exemplified in the following videos: + +
+
+ +
Input video
+
+
+ +
Marigold Depth applied to input video frames independently
+
+
+ +To address this issue, it is possible to pass `latents` argument to the pipelines, which defines the starting point of diffusion. +Empirically, we found that a convex combination of the very same starting point noise latent and the latent corresponding to the previous frame prediction give sufficiently smooth results, as implemented in the snippet below: + +```python +import imageio +from PIL import Image +from tqdm import tqdm +import mindone.diffusers +import mindspore as ms +import numpy as np + +path_in = "obama.mp4" +path_out = "obama_depth.gif" + +pipe = mindone.diffusers.MarigoldDepthPipeline.from_pretrained( + "prs-eth/marigold-depth-lcm-v1-0", variant="fp16", mindspore_dtype=ms.float16 +) +pipe.vae = mindone.diffusers.AutoencoderTiny.from_pretrained( + "madebyollin/taesd", mindspore_dtype=ms.float16 +) +pipe.set_progress_bar_config(disable=True) + +with imageio.get_reader(path_in) as reader: + size = reader.get_meta_data()['size'] + last_frame_latent = None + latent_common = ms.Tensor(np.random.default_rng().standard_normal( + (1, 4, 768 * size[1] // (8 * max(size)), 768 * size[0] // (8 * max(size))) + ), dtype=ms.float16) + + out = [] + for frame_id, frame in tqdm(enumerate(reader), desc="Processing Video"): + frame = Image.fromarray(frame) + latents = latent_common + if last_frame_latent is not None: + latents = 0.9 * latents + 0.1 * last_frame_latent + + depth = pipe( + frame, match_input_resolution=False, latents=latents, output_latent=True + ) + last_frame_latent = depth[2] + out.append(pipe.image_processor.visualize_depth(depth[0])[0]) + + mindone.diffusers.utils.export_to_gif(out, path_out, fps=reader.get_meta_data()['fps']) +``` + +Here, the diffusion process starts from the given computed latent. +The pipeline sets `output_latent=True` to access `out.latent` and computes its contribution to the next frame's latent initialization. +The result is much more stable now: + +
+
+ +
Marigold Depth applied to input video frames independently
+
+
+ +
Marigold Depth with forced latents initialization
+
+
+ +## Marigold for ControlNet + +A very common application for depth prediction with diffusion models comes in conjunction with ControlNet. +Depth crispness plays a crucial role in obtaining high-quality results from ControlNet. +As seen in comparisons with other methods above, Marigold excels at that task. +The snippet below demonstrates how to load an image, compute depth, and pass it into ControlNet in a compatible format: + +```python +import mindspore as ms +import mindone.diffusers +import numpy as np + +generator = np.random.Generator(np.random.PCG64(2024)) +image = mindone.diffusers.utils.load_image( + "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/controlnet_depth_source.png" +) + +pipe = mindone.diffusers.MarigoldDepthPipeline.from_pretrained( + "prs-eth/marigold-depth-lcm-v1-0", mindspore_dtype=ms.float16, variant="fp16" +) + +depth_image = pipe(image, generator=generator)[0] +depth_image = pipe.image_processor.visualize_depth(depth_image, color_map="binary") +depth_image[0].save("motorcycle_controlnet_depth.png") + +controlnet = mindone.diffusers.ControlNetModel.from_pretrained( + "diffusers/controlnet-depth-sdxl-1.0", mindspore_dtype=ms.float16, variant="fp16" +) +pipe = mindone.diffusers.StableDiffusionXLControlNetPipeline.from_pretrained( + "SG161222/RealVisXL_V4.0", mindspore_dtype=ms.float16, variant="fp16", controlnet=controlnet +) +pipe.scheduler = mindone.diffusers.DPMSolverMultistepScheduler.from_config(pipe.scheduler.config, use_karras_sigmas=True) + +controlnet_out = pipe( + prompt="high quality photo of a sports bike, city", + negative_prompt="", + guidance_scale=6.5, + num_inference_steps=25, + image=depth_image, + controlnet_conditioning_scale=0.7, + control_guidance_end=0.7, + generator=generator, +)[0] +controlnet_out[0].save("motorcycle_controlnet_out.png") +``` + +
+
+ +
+ Input image +
+
+
+ +
+ Depth in the format compatible with ControlNet +
+
+
+ +
+ ControlNet generation, conditioned on depth and prompt: "high quality photo of a sports bike, city" +
+
+
+ +Hopefully, you will find Marigold useful for solving your downstream tasks, be it a part of a more broad generative workflow, or a perception task, such as 3D reconstruction. diff --git a/docs/diffusers/using-diffusers/merge_loras.md b/docs/diffusers/using-diffusers/merge_loras.md new file mode 100644 index 0000000000..954a39dbdb --- /dev/null +++ b/docs/diffusers/using-diffusers/merge_loras.md @@ -0,0 +1,98 @@ + + +# Merge LoRAs + +It can be fun and creative to use multiple [LoRAs]((https://huggingface.co/docs/peft/conceptual_guides/adapter#low-rank-adaptation-lora)) together to generate something entirely new and unique. This works by merging multiple LoRA weights together to produce images that are a blend of different styles. Diffusers provides a few methods to merge LoRAs depending on *how* you want to merge their weights, which can affect image quality. + +This guide will show you how to merge LoRAs using the [`set_adapters`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/loaders/unet/#mindone.diffusers.loaders.unet.UNet2DConditionLoadersMixin.set_adapters) and [`~peft.LoraModel.add_weighted_adapter`] methods. To improve inference speed and reduce memory-usage of merged LoRAs, you'll also see how to use the [`fuse_lora`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/loaders/lora/#mindone.diffusers.loaders.lora.LoraLoaderMixin.fuse_lora) method to fuse the LoRA weights with the original weights of the underlying model. + +For this guide, load a Stable Diffusion XL (SDXL) checkpoint and the [KappaNeuro/studio-ghibli-style](https://huggingface.co/KappaNeuro/studio-ghibli-style) and [Norod78/sdxl-chalkboarddrawing-lora](https://huggingface.co/Norod78/sdxl-chalkboarddrawing-lora) LoRAs with the [`load_lora_weights`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/loaders/lora/#mindone.diffusers.loaders.lora.LoraLoaderMixin.load_lora_weights) method. You'll need to assign each LoRA an `adapter_name` to combine them later. + +```py +from mindone.diffusers import DiffusionPipeline +import mindspore as ms +import numpy as np + +pipeline = DiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", mindspore_dtype=ms.float16) +pipeline.load_lora_weights("ostris/ikea-instructions-lora-sdxl", weight_name="ikea_instructions_xl_v1_5.safetensors", adapter_name="ikea") +pipeline.load_lora_weights("lordjia/by-feng-zikai", weight_name="fengzikai_v1.0_XL.safetensors", adapter_name="feng") +``` + +## set_adapters + +The [`set_adapters`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/loaders/unet/#mindone.diffusers.loaders.unet.UNet2DConditionLoadersMixin.set_adapters) method merges LoRA adapters by concatenating their weighted matrices. Use the adapter name to specify which LoRAs to merge, and the `adapter_weights` parameter to control the scaling for each LoRA. For example, if `adapter_weights=[0.5, 0.5]`, then the merged LoRA output is an average of both LoRAs. Try adjusting the adapter weights to see how it affects the generated image! + +```py +pipeline.set_adapters(["ikea", "feng"], adapter_weights=[0.7, 0.8]) + +generator = np.random.Generator(np.random.PCG64(0)) +prompt = "A bowl of ramen shaped like a cute kawaii bear, by Feng Zikai" +image = pipeline(prompt, generator=generator, cross_attention_kwargs={"scale": 1.0})[0][0] +image +``` + +
+ +
+ +## fuse_lora + +[`set_adapters`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/loaders/unet/#mindone.diffusers.loaders.unet.UNet2DConditionLoadersMixin.set_adapters) method require loading the base model and the LoRA adapters separately which incurs some overhead. The [`fuse_lora`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/loaders/lora/#mindone.diffusers.loaders.lora.LoraLoaderMixin.fuse_lora) method allows you to fuse the LoRA weights directly with the original weights of the underlying model. This way, you're only loading the model once which can increase inference and lower memory-usage. + +You can use PEFT to easily fuse/unfuse multiple adapters directly into the model weights (both UNet and text encoder) using the [`fuse_lora`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/loaders/lora/#mindone.diffusers.loaders.lora.LoraLoaderMixin.fuse_lora) method, which can lead to a speed-up in inference and lower VRAM usage. + +For example, if you have a base model and adapters loaded and set as active with the following adapter weights: + +```py +from mindone.diffusers import DiffusionPipeline +import mindspore as ms + +pipeline = DiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", mindspore_dtype=ms.float16) +pipeline.load_lora_weights("ostris/ikea-instructions-lora-sdxl", weight_name="ikea_instructions_xl_v1_5.safetensors", adapter_name="ikea") +pipeline.load_lora_weights("lordjia/by-feng-zikai", weight_name="fengzikai_v1.0_XL.safetensors", adapter_name="feng") + +pipeline.set_adapters(["ikea", "feng"], adapter_weights=[0.7, 0.8]) +``` + +Fuse these LoRAs into the UNet with the [`fuse_lora`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/loaders/lora/#mindone.diffusers.loaders.lora.LoraLoaderMixin.fuse_lora) method. The `lora_scale` parameter controls how much to scale the output by with the LoRA weights. It is important to make the `lora_scale` adjustments in the [`fuse_lora`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/loaders/lora/#mindone.diffusers.loaders.lora.LoraLoaderMixin.fuse_lora) method because it won’t work if you try to pass `scale` to the `cross_attention_kwargs` in the pipeline. + +```py +pipeline.fuse_lora(adapter_names=["ikea", "feng"], lora_scale=1.0) +``` + +Then you should use [`unload_lora_weights`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/loaders/lora/#mindone.diffusers.loaders.lora.LoraLoaderMixin.unload_lora_weights) to unload the LoRA weights since they've already been fused with the underlying base model. Finally, call [`save_pretrained`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/pipelines/overview/#mindone.diffusers.DiffusionPipeline.save_pretrained) to save the fused pipeline locally or you could call [`push_to_hub`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/pipelines/overview/#mindone.diffusers.DiffusionPipeline.push_to_hub) to push the fused pipeline to the Hub. + +```py +pipeline.unload_lora_weights() +# save locally +pipeline.save_pretrained("path/to/fused-pipeline") +# save to the Hub +pipeline.push_to_hub("fused-ikea-feng") +``` + +Now you can quickly load the fused pipeline and use it for inference without needing to separately load the LoRA adapters. + +```py +pipeline = DiffusionPipeline.from_pretrained( + "username/fused-ikea-feng", mindspore_dtype=ms.float16, +) + +image = pipeline("A bowl of ramen shaped like a cute kawaii bear, by Feng Zikai", generator=np.random.Generator(np.random.PCG64(0)))[0][0] +image +``` + +You can call [`unfuse_lora`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/loaders/lora/#mindone.diffusers.loaders.lora.LoraLoaderMixin.unfuse_lora) to restore the original model's weights (for example, if you want to use a different `lora_scale` value). However, this only works if you've only fused one LoRA adapter to the original model. If you've fused multiple LoRAs, you'll need to reload the model. + +```py +pipeline.unfuse_lora() +``` diff --git a/docs/diffusers/using-diffusers/other-formats.md b/docs/diffusers/using-diffusers/other-formats.md new file mode 100644 index 0000000000..873922cd0e --- /dev/null +++ b/docs/diffusers/using-diffusers/other-formats.md @@ -0,0 +1,435 @@ + + +# Model files and layouts + +Diffusion models are saved in various file types and organized in different layouts. Diffusers stores model weights as safetensors files in *Diffusers-multifolder* layout and it also supports loading files (like safetensors and ckpt files) from a *single-file* layout which is commonly used in the diffusion ecosystem. + +Each layout has its own benefits and use cases, and this guide will show you how to load the different files and layouts, and how to convert them. + +## Files + +PyTorch model weights are typically saved with Python's [pickle](https://docs.python.org/3/library/pickle.html) utility as ckpt or bin files. However, pickle is not secure and pickled files may contain malicious code that can be executed. This vulnerability is a serious concern given the popularity of model sharing. To address this security issue, the [Safetensors](https://hf.co/docs/safetensors) library was developed as a secure alternative to pickle, which saves models as safetensors files. + +### safetensors + +!!! tip + + Learn more about the design decisions and why safetensor files are preferred for saving and loading model weights in the [Safetensors audited as really safe and becoming the default](https://blog.eleuther.ai/safetensors-security-audit/) blog post. + +[Safetensors](https://hf.co/docs/safetensors) is a safe and fast file format for securely storing and loading tensors. Safetensors restricts the header size to limit certain types of attacks, supports lazy loading (useful for distributed setups), and has generally faster loading speeds. + +Make sure you have the [Safetensors](https://hf.co/docs/safetensors) library installed. + +```py +!pip install safetensors +``` + +Safetensors stores weights in a safetensors file. Diffusers loads safetensors files by default if they're available and the Safetensors library is installed. There are two ways safetensors files can be organized: + +1. Diffusers-multifolder layout: there may be several separate safetensors files, one for each pipeline component (text encoder, UNet, VAE), organized in subfolders (check out the [stable-diffusion-v1-5/stable-diffusion-v1-5](https://huggingface.co/stable-diffusion-v1-5/stable-diffusion-v1-5/tree/main) repository as an example) +2. single-file layout: all the model weights may be saved in a single file (check out the [WarriorMama777/OrangeMixs](https://hf.co/WarriorMama777/OrangeMixs/tree/main/Models/AbyssOrangeMix) repository as an example) + +=== "multifolder" + + Use the [`from_pretrained`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/pipelines/overview/#mindone.diffusers.DiffusionPipeline.from_pretrained) method to load a model with safetensors files stored in multiple folders. + + ```py + from mindone.diffusers import DiffusionPipeline + + pipeline = DiffusionPipeline.from_pretrained( + "stable-diffusion-v1-5/stable-diffusion-v1-5", + use_safetensors=True + ) + ``` + +=== "single file" + + Use the [`from_single_file`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/loaders/single_file/#mindone.diffusers.loaders.single_file.FromSingleFileMixin.from_single_file) method to load a model with all the weights stored in a single safetensors file. + + ```py + from mindone.diffusers import StableDiffusionPipeline + + pipeline = StableDiffusionPipeline.from_single_file( + "https://huggingface.co/WarriorMama777/OrangeMixs/blob/main/Models/AbyssOrangeMix/AbyssOrangeMix.safetensors" + ) + ``` + +#### LoRA files + +[LoRA](https://hf.co/docs/peft/conceptual_guides/adapter#low-rank-adaptation-lora) is a lightweight adapter that is fast and easy to train, making them especially popular for generating images in a certain way or style. These adapters are commonly stored in a safetensors file, and are widely popular on model sharing platforms like [civitai](https://civitai.com/). + +LoRAs are loaded into a base model with the [`load_lora_weights`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/loaders/lora/#mindone.diffusers.loaders.lora.LoraLoaderMixin.load_lora_weights) method. + +```py +from mindone.diffusers import StableDiffusionXLPipeline +import mindspore as ms +import numpy as np + +# base model +pipeline = StableDiffusionXLPipeline.from_pretrained( + "Lykon/dreamshaper-xl-1-0", mindspore_dtype=ms.float16, variant="fp16" +) + +# download LoRA weights +!wget https://civitai.com/api/download/models/168776 -O blueprintify.safetensors + +# load LoRA weights +pipeline.load_lora_weights(".", weight_name="blueprintify.safetensors") +prompt = "bl3uprint, a highly detailed blueprint of the empire state building, explaining how to build all parts, many txt, blueprint grid backdrop" +negative_prompt = "lowres, cropped, worst quality, low quality, normal quality, artifacts, signature, watermark, username, blurry, more than one bridge, bad architecture" + +image = pipeline( + prompt=prompt, + negative_prompt=negative_prompt, + generator=np.random.Generator(np.random.PCG64(0)), +)[0][0] +image +``` + +
+ +
+ +### Bin files + +MindONE.diffusers currently does not support loading `.bin` files using the [`from_pretrained`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/pipelines/overview/#mindone.diffusers.DiffusionPipeline.from_pretrained) method. If the models in the [Hub](https://huggingface.co/models) consist solely of `.bin` files, we recommend utilizing the following method to convert them into `safetensors` format. + +!!! tip + + For any issues with the Space, please refer to the [tutorial](https://huggingface.co/docs/safetensors/main/en/convert-weights#convert-weights-to-safetensors). + +Use the [Convert Space](https://huggingface.co/spaces/safetensors/convert). The Convert Space downloads the pickled weights, converts them, and opens a Pull Request to upload the newly converted `.safetensors` file to the repository. Then, you can set `revision` to load the model. For example, load [facebook/DiT-XL-2-256](https://huggingface.co/facebook/DiT-XL-2-256) checkpoint: + +```diff + from mindone.diffusers import DiTPipeline + import mindspore as ms + + pipe = DiTPipeline.from_pretrained( + "facebook/DiT-XL-2-256", + mindspore_dtype=ms.float16, ++ revision="refs/pr/1" + ) +``` + +## Storage layout + +There are two ways model files are organized, either in a Diffusers-multifolder layout or in a single-file layout. The Diffusers-multifolder layout is the default, and each component file (text encoder, UNet, VAE) is stored in a separate subfolder. Diffusers also supports loading models from a single-file layout where all the components are bundled together. + +### Diffusers-multifolder + +The Diffusers-multifolder layout is the default storage layout for Diffusers. Each component's (text encoder, UNet, VAE) weights are stored in a separate subfolder. The weights can be stored as safetensors or ckpt files. + +
+
+ +
multifolder layout
+
+
+ +
UNet subfolder
+
+
+ +To load from Diffusers-multifolder layout, use the [`from_pretrained`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/pipelines/overview/#mindone.diffusers.DiffusionPipeline.from_pretrained) method. + +```py +from mindone.diffusers import DiffusionPipeline + +pipeline = DiffusionPipeline.from_pretrained( + "stabilityai/stable-diffusion-xl-base-1.0", + mindspore_dtype=ms.float16, + use_safetensors=True, +) +``` + +Benefits of using the Diffusers-multifolder layout include: + +1. Faster to load each component file individually or in parallel. + +```py +import mindspore as ms +from mindone.diffusers import StableDiffusionXLPipeline, UNet2DConditionModel, EulerDiscreteScheduler + +# download one model +sdxl_pipeline = StableDiffusionXLPipeline.from_pretrained( + "stabilityai/stable-diffusion-xl-base-1.0", + mindspore_dtype=ms.float16, + use_safetensors=True, +) + +# switch UNet for another model +unet = UNet2DConditionModel.from_pretrained( + "stabilityai/sdxl-turbo", + subfolder="unet", + mindspore_dtype=ms.float16, + variant="fp16", + use_safetensors=True +) +``` + +2. Reduced storage requirements because if a component, such as the SDXL [VAE](https://hf.co/madebyollin/sdxl-vae-fp16-fix), is shared across multiple models, you only need to download and store a single copy of it instead of downloading and storing it multiple times. For 10 SDXL models, this can save ~3.5GB of storage. The storage savings is even greater for newer models like PixArt Sigma, where the [text encoder](https://hf.co/PixArt-alpha/PixArt-Sigma-XL-2-1024-MS/tree/main/text_encoder) alone is ~19GB! +3. Flexibility to replace a component in the model with a newer or better version. + +```py +from mindone.diffusers import DiffusionPipeline, AutoencoderKL + +vae = AutoencoderKL.from_pretrained("madebyollin/sdxl-vae-fp16-fix", mindspore_dtype=ms.float16, use_safetensors=True) +pipeline = DiffusionPipeline.from_pretrained( + "stabilityai/stable-diffusion-xl-base-1.0", + vae=vae, + mindspore_dtype=ms.float16, + use_safetensors=True, +) +``` + +4. More visibility and information about a model's components, which are stored in a [config.json](https://hf.co/stabilityai/stable-diffusion-xl-base-1.0/blob/main/unet/config.json) file in each component subfolder. + +### Single-file + +The single-file layout stores all the model weights in a single file. All the model components (text encoder, UNet, VAE) weights are kept together instead of separately in subfolders. This can be a safetensors or ckpt file. + +
+ +
+ +To load from a single-file layout, use the [`from_single_file`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/loaders/single_file/#mindone.diffusers.loaders.single_file.FromSingleFileMixin.from_single_file) method. + +```py +import mindspore as ms +from mindone.diffusers import StableDiffusionXLPipeline + +pipeline = StableDiffusionXLPipeline.from_single_file( + "https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0/blob/main/sd_xl_base_1.0.safetensors", + mindspore_dtype=ms.float16, + variant="fp16", + use_safetensors=True, +) +``` + +Benefits of using a single-file layout include: + +1. Easy compatibility with diffusion interfaces such as [ComfyUI](https://github.com/comfyanonymous/ComfyUI) or [Automatic1111](https://github.com/AUTOMATIC1111/stable-diffusion-webui) which commonly use a single-file layout. +2. Easier to manage (download and share) a single file. + +## Convert layout and files + +Diffusers provides many scripts and methods to convert storage layouts and file formats to enable broader support across the diffusion ecosystem. + +You can save a model to Diffusers-multifolder layout with the [`save_pretrained`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/pipelines/overview/#mindone.diffusers.DiffusionPipeline.save_pretrained) method. This creates a directory for you if it doesn't already exist, and it also saves the files as a safetensors file by default. + +```py +from mindone.diffusers import StableDiffusionXLPipeline + +pipeline = StableDiffusionXLPipeline.from_single_file( + "https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0/blob/main/sd_xl_base_1.0.safetensors", +) +pipeline.save_pretrained() +``` + +Lastly, there are also Spaces, such as [SD To Diffusers](https://hf.co/spaces/diffusers/sd-to-diffusers) and [SD-XL To Diffusers](https://hf.co/spaces/diffusers/sdxl-to-diffusers), that provide a more user-friendly interface for converting models to Diffusers-multifolder layout. This is the easiest and most convenient option for converting layouts, and it'll open a PR on your model repository with the converted files. However, this option is not as reliable as running a script, and the Space may fail for more complicated models. + +## Single-file layout usage + +Now that you're familiar with the differences between the Diffusers-multifolder and single-file layout, this section shows you how to load models and pipeline components, customize configuration options for loading, and load local files with the [`from_single_file`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/loaders/single_file/#mindone.diffusers.loaders.single_file.FromSingleFileMixin.from_single_file) method. + +### Load a pipeline or model + +Pass the file path of the pipeline or model to the [`from_single_file`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/loaders/single_file/#mindone.diffusers.loaders.single_file.FromSingleFileMixin.from_single_file) method to load it. + +=== "pipeline" + + ```py + from mindone.diffusers import StableDiffusionXLPipeline + + ckpt_path = "https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0/blob/main/sd_xl_base_1.0_0.9vae.safetensors" + pipeline = StableDiffusionXLPipeline.from_single_file(ckpt_path) + ``` + +=== "model" + + ```py + from mindone.diffusers import StableCascadeUNet + + ckpt_path = "https://huggingface.co/stabilityai/stable-cascade/blob/main/stage_b_lite.safetensors" + model = StableCascadeUNet.from_single_file(ckpt_path) + ``` + +Customize components in the pipeline by passing them directly to the [`from_single_file`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/loaders/single_file/#mindone.diffusers.loaders.single_file.FromSingleFileMixin.from_single_file) method. For example, you can use a different scheduler in a pipeline. + +```py +from mindone.diffusers import StableDiffusionXLPipeline, DDIMScheduler + +ckpt_path = "https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0/blob/main/sd_xl_base_1.0_0.9vae.safetensors" +scheduler = DDIMScheduler() +pipeline = StableDiffusionXLPipeline.from_single_file(ckpt_path, scheduler=scheduler) +``` + +Or you could use a ControlNet model in the pipeline. + +```py +from mindone.diffusers import StableDiffusionControlNetPipeline, ControlNetModel + +ckpt_path = "https://huggingface.co/stable-diffusion-v1-5/stable-diffusion-v1-5/blob/main/v1-5-pruned-emaonly.safetensors" +controlnet = ControlNetModel.from_pretrained("lllyasviel/control_v11p_sd15_canny") +pipeline = StableDiffusionControlNetPipeline.from_single_file(ckpt_path, controlnet=controlnet) +``` + +### Customize configuration options + +Models have a configuration file that define their attributes like the number of inputs in a UNet. Pipelines configuration options are available in the pipeline's class. For example, if you look at the [`StableDiffusionXLInstructPix2PixPipeline`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/pipelines/pix2pix/#mindone.diffusers.StableDiffusionXLInstructPix2PixPipeline) class, there is an option to scale the image latents with the `is_cosxl_edit` parameter. + +These configuration files can be found in the models Hub repository or another location from which the configuration file originated (for example, a GitHub repository or locally on your device). + +=== "Hub configuration file" + + !!! tip + + The [`from_single_file`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/loaders/single_file/#mindone.diffusers.loaders.single_file.FromSingleFileMixin.from_single_file) method automatically maps the checkpoint to the appropriate model repository, but there are cases where it is useful to use the `config` parameter. For example, if the model components in the checkpoint are different from the original checkpoint or if a checkpoint doesn't have the necessary metadata to correctly determine the configuration to use for the pipeline. + + The [`from_single_file`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/loaders/single_file/#mindone.diffusers.loaders.single_file.FromSingleFileMixin.from_single_file) method automatically determines the configuration to use from the configuration file in the model repository. You could also explicitly specify the configuration to use by providing the repository id to the `config` parameter. + + ```py + from mindone.diffusers import StableDiffusionXLPipeline + + ckpt_path = "https://huggingface.co/segmind/SSD-1B/blob/main/SSD-1B.safetensors" + repo_id = "segmind/SSD-1B" + + pipeline = StableDiffusionXLPipeline.from_single_file(ckpt_path, config=repo_id) + ``` + + The model loads the configuration file for the [UNet](https://huggingface.co/segmind/SSD-1B/blob/main/unet/config.json), [VAE](https://huggingface.co/segmind/SSD-1B/blob/main/vae/config.json), and [text encoder](https://huggingface.co/segmind/SSD-1B/blob/main/text_encoder/config.json) from their respective subfolders in the repository. + +=== "original configuration file" + + The [`from_single_file`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/loaders/single_file/#mindone.diffusers.loaders.single_file.FromSingleFileMixin.from_single_file) method can also load the original configuration file of a pipeline that is stored elsewhere. Pass a local path or URL of the original configuration file to the `original_config` parameter. + + ```py + from mindone.diffusers import StableDiffusionXLPipeline + + ckpt_path = "https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0/blob/main/sd_xl_base_1.0_0.9vae.safetensors" + original_config = "https://raw.githubusercontent.com/Stability-AI/generative-models/main/configs/inference/sd_xl_base.yaml" + + pipeline = StableDiffusionXLPipeline.from_single_file(ckpt_path, original_config=original_config) + ``` + + !!! tip + + Diffusers attempts to infer the pipeline components based on the type signatures of the pipeline class when you use `original_config` with `local_files_only=True`, instead of fetching the configuration files from the model repository on the Hub. This prevents backward breaking changes in code that can't connect to the internet to fetch the necessary configuration files. + + This is not as reliable as providing a path to a local model repository with the `config` parameter, and might lead to errors during pipeline configuration. To avoid errors, run the pipeline with `local_files_only=False` once to download the appropriate pipeline configuration files to the local cache. + +While the configuration files specify the pipeline or models default parameters, you can override them by providing the parameters directly to the [`from_single_file`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/loaders/single_file/#mindone.diffusers.loaders.single_file.FromSingleFileMixin.from_single_file) method. Any parameter supported by the model or pipeline class can be configured in this way. + +=== "pipeline" + + For example, to scale the image latents in [`StableDiffusionXLInstructPix2PixPipeline`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/pipelines/pix2pix/#mindone.diffusers.StableDiffusionXLInstructPix2PixPipeline) pass the `force_zeros_for_empty_prompt` parameter. + + ```python + from mindone.diffusers import StableDiffusionXLInstructPix2PixPipeline + + ckpt_path = "https://huggingface.co/stabilityai/cosxl/blob/main/cosxl_edit.safetensors" + pipeline = StableDiffusionXLInstructPix2PixPipeline.from_single_file(ckpt_path, config="diffusers/sdxl-instructpix2pix-768", force_zeros_for_empty_prompt=True) + ``` + +=== "model" + + For example, to upcast the attention dimensions in a [`UNet2DConditionModel`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/models/unet2d-cond/#mindone.diffusers.UNet2DConditionModel) pass the `upcast_attention` parameter. + + ```python + from mindone.diffusers import UNet2DConditionModel + + ckpt_path = "https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0/blob/main/sd_xl_base_1.0_0.9vae.safetensors" + model = UNet2DConditionModel.from_single_file(ckpt_path, upcast_attention=True) + ``` + +### Local files + +In Diffusers>=v0.28.0, the [`from_single_file`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/loaders/single_file/#mindone.diffusers.loaders.single_file.FromSingleFileMixin.from_single_file) method attempts to configure a pipeline or model by inferring the model type from the keys in the checkpoint file. The inferred model type is used to determine the appropriate model repository on the Hugging Face Hub to configure the model or pipeline. + +For example, any single file checkpoint based on the Stable Diffusion XL base model will use the [stabilityai/stable-diffusion-xl-base-1.0](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0) model repository to configure the pipeline. + +But if you're working in an environment with restricted internet access, you should download the configuration files with the [`snapshot_download`](https://huggingface.co/docs/huggingface_hub/main/en/package_reference/file_download#huggingface_hub.snapshot_download) function, and the model checkpoint with the [`hf_hub_download`](https://huggingface.co/docs/huggingface_hub/main/en/package_reference/file_download#huggingface_hub.hf_hub_download) function. By default, these files are downloaded to the Hugging Face Hub [cache directory](https://huggingface.co/docs/huggingface_hub/en/guides/manage-cache), but you can specify a preferred directory to download the files to with the `local_dir` parameter. + +Pass the configuration and checkpoint paths to the [`from_single_file`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/loaders/single_file/#mindone.diffusers.loaders.single_file.FromSingleFileMixin.from_single_file) method to load locally. + +=== "Hub cache directory" + + ```python + from huggingface_hub import hf_hub_download, snapshot_download + + my_local_checkpoint_path = hf_hub_download( + repo_id="segmind/SSD-1B", + filename="SSD-1B.safetensors" + ) + + my_local_config_path = snapshot_download( + repo_id="segmind/SSD-1B", + allowed_patterns=["*.json", "**/*.json", "*.txt", "**/*.txt"] + ) + + pipeline = StableDiffusionXLPipeline.from_single_file(my_local_checkpoint_path, config=my_local_config_path, local_files_only=True) + ``` + +=== "specific local directory" + + ```python + from huggingface_hub import hf_hub_download, snapshot_download + + my_local_checkpoint_path = hf_hub_download( + repo_id="segmind/SSD-1B", + filename="SSD-1B.safetensors", + local_dir="my_local_checkpoints", + ) + + my_local_config_path = snapshot_download( + repo_id="segmind/SSD-1B", + allowed_patterns=["*.json", "**/*.json", "*.txt", "**/*.txt"], + local_dir="my_local_config", + ) + + pipeline = StableDiffusionXLPipeline.from_single_file(my_local_checkpoint_path, config=my_local_config_path, local_files_only=True) + ``` + +#### Local files without symlink + +!!! tip + + In huggingface_hub>=v0.23.0, the `local_dir_use_symlinks` argument isn't necessary for the [`hf_hub_download`](https://huggingface.co/docs/huggingface_hub/main/en/package_reference/file_download#huggingface_hub.hf_hub_download) and [`snapshot_download`](https://huggingface.co/docs/huggingface_hub/main/en/package_reference/file_download#huggingface_hub.snapshot_download) functions. + +The [`from_single_file`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/loaders/single_file/#mindone.diffusers.loaders.single_file.FromSingleFileMixin.from_single_file) method relies on the [huggingface_hub](https://hf.co/docs/huggingface_hub/index) caching mechanism to fetch and store checkpoints and configuration files for models and pipelines. If you're working with a file system that does not support symlinking, you should download the checkpoint file to a local directory first, and disable symlinking with the `local_dir_use_symlink=False` parameter in the [`hf_hub_download`](https://huggingface.co/docs/huggingface_hub/main/en/package_reference/file_download#huggingface_hub.hf_hub_download) function and [`snapshot_download`](https://huggingface.co/docs/huggingface_hub/main/en/package_reference/file_download#huggingface_hub.snapshot_download) functions. + +```python +from huggingface_hub import hf_hub_download, snapshot_download + +my_local_checkpoint_path = hf_hub_download( + repo_id="segmind/SSD-1B", + filename="SSD-1B.safetensors", + local_dir="my_local_checkpoints", + local_dir_use_symlinks=False, +) +print("My local checkpoint: ", my_local_checkpoint_path) + +my_local_config_path = snapshot_download( + repo_id="segmind/SSD-1B", + allowed_patterns=["*.json", "**/*.json", "*.txt", "**/*.txt"], + local_dir_use_symlinks=False, +) +print("My local config: ", my_local_config_path) + +``` + +Then you can pass the local paths to the `pretrained_model_link_or_path` and `config` parameters. + +```python +pipeline = StableDiffusionXLPipeline.from_single_file(my_local_checkpoint_path, config=my_local_config_path, local_files_only=True) +``` diff --git a/docs/diffusers/using-diffusers/overview_techniques.md b/docs/diffusers/using-diffusers/overview_techniques.md new file mode 100644 index 0000000000..625b2c4eda --- /dev/null +++ b/docs/diffusers/using-diffusers/overview_techniques.md @@ -0,0 +1,18 @@ + + +# Overview + +The inference pipeline supports and enables a wide range of techniques that are divided into two categories: + +* Pipeline functionality: these techniques modify the pipeline or extend it for other applications. For example, pipeline callbacks add new features to a pipeline. +* Improve inference quality: these techniques increase the visual quality of the generated images. For example, you can use scheduler features to improve inference quality to create better images with lower effort. diff --git a/docs/diffusers/using-diffusers/push_to_hub.md b/docs/diffusers/using-diffusers/push_to_hub.md new file mode 100644 index 0000000000..1688e451e3 --- /dev/null +++ b/docs/diffusers/using-diffusers/push_to_hub.md @@ -0,0 +1,174 @@ + + +# Push files to the Hub + +🤗 Diffusers provides a [`PushToHubMixin`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/schedulers/overview/#mindone.diffusers.utils.PushToHubMixin) for uploading your model, scheduler, or pipeline to the Hub. It is an easy way to store your files on the Hub, and also allows you to share your work with others. Under the hood, the [`PushToHubMixin`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/schedulers/overview/#mindone.diffusers.utils.PushToHubMixin): + +1. creates a repository on the Hub +2. saves your model, scheduler, or pipeline files so they can be reloaded later +3. uploads folder containing these files to the Hub + +This guide will show you how to use the [`PushToHubMixin`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/schedulers/overview/#mindone.diffusers.utils.PushToHubMixin) to upload your files to the Hub. + +You'll need to log in to your Hub account with your access [token](https://huggingface.co/settings/tokens) first: + +```py +from huggingface_hub import notebook_login + +notebook_login() +``` + +## Models + +To push a model to the Hub, call [`push_to_hub`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/schedulers/overview/#mindone.diffusers.utils.PushToHubMixin.push_to_hub) and specify the repository id of the model to be stored on the Hub: + +```py +from mindone.diffusers import ControlNetModel + +controlnet = ControlNetModel( + block_out_channels=(32, 64), + layers_per_block=2, + in_channels=4, + down_block_types=("DownBlock2D", "CrossAttnDownBlock2D"), + cross_attention_dim=32, + conditioning_embedding_out_channels=(16, 32), +) +controlnet.push_to_hub("my-controlnet-model") +``` + +For models, you can also specify the [*variant*](loading.md#checkpoint-variants) of the weights to push to the Hub. For example, to push `fp16` weights: + +```py +controlnet.push_to_hub("my-controlnet-model", variant="fp16") +``` + +The [`push_to_hub`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/schedulers/overview/#mindone.diffusers.utils.PushToHubMixin.push_to_hub) function saves the model's `config.json` file and the weights are automatically saved in the `safetensors` format. + +Now you can reload the model from your repository on the Hub: + +```py +model = ControlNetModel.from_pretrained("your-namespace/my-controlnet-model") +``` + +## Scheduler + +To push a scheduler to the Hub, call [`push_to_hub`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/schedulers/overview/#mindone.diffusers.utils.PushToHubMixin.push_to_hub) and specify the repository id of the scheduler to be stored on the Hub: + +```py +from mindone.diffusers import DDIMScheduler + +scheduler = DDIMScheduler( + beta_start=0.00085, + beta_end=0.012, + beta_schedule="scaled_linear", + clip_sample=False, + set_alpha_to_one=False, +) +scheduler.push_to_hub("my-controlnet-scheduler") +``` + +The [`push_to_hub`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/schedulers/overview/#mindone.diffusers.utils.PushToHubMixin.push_to_hub) function saves the scheduler's `scheduler_config.json` file to the specified repository. + +Now you can reload the scheduler from your repository on the Hub: + +```py +scheduler = DDIMScheduler.from_pretrained("your-namepsace/my-controlnet-scheduler") +``` + +You can also push an entire pipeline with all it's components to the Hub. For example, initialize the components of a [`StableDiffusionPipeline`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/pipelines/stable_diffusion/text2img/#mindone.diffusers.StableDiffusionPipeline) with the parameters you want: + +```py +from mindone.diffusers import ( + UNet2DConditionModel, + AutoencoderKL, + DDIMScheduler, + StableDiffusionPipeline, +) +from mindone.transformers import CLIPTextModel +from transformers import CLIPTextConfig, CLIPTokenizer + +unet = UNet2DConditionModel( + block_out_channels=(32, 64), + layers_per_block=2, + sample_size=32, + in_channels=4, + out_channels=4, + down_block_types=("DownBlock2D", "CrossAttnDownBlock2D"), + up_block_types=("CrossAttnUpBlock2D", "UpBlock2D"), + cross_attention_dim=32, +) + +scheduler = DDIMScheduler( + beta_start=0.00085, + beta_end=0.012, + beta_schedule="scaled_linear", + clip_sample=False, + set_alpha_to_one=False, +) + +vae = AutoencoderKL( + block_out_channels=[32, 64], + in_channels=3, + out_channels=3, + down_block_types=["DownEncoderBlock2D", "DownEncoderBlock2D"], + up_block_types=["UpDecoderBlock2D", "UpDecoderBlock2D"], + latent_channels=4, +) + +text_encoder_config = CLIPTextConfig( + bos_token_id=0, + eos_token_id=2, + hidden_size=32, + intermediate_size=37, + layer_norm_eps=1e-05, + num_attention_heads=4, + num_hidden_layers=5, + pad_token_id=1, + vocab_size=1000, +) +text_encoder = CLIPTextModel(text_encoder_config) +tokenizer = CLIPTokenizer.from_pretrained("hf-internal-testing/tiny-random-clip") +``` + +Pass all of the components to the [`StableDiffusionPipeline`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/pipelines/stable_diffusion/text2img/#mindone.diffusers.StableDiffusionPipeline) and call [`push_to_hub`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/schedulers/overview/#mindone.diffusers.utils.PushToHubMixin.push_to_hub) to push the pipeline to the Hub: + +```py +components = { + "unet": unet, + "scheduler": scheduler, + "vae": vae, + "text_encoder": text_encoder, + "tokenizer": tokenizer, + "safety_checker": None, + "feature_extractor": None, +} + +pipeline = StableDiffusionPipeline(**components) +pipeline.push_to_hub("my-pipeline") +``` + +The [`push_to_hub`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/schedulers/overview/#mindone.diffusers.utils.PushToHubMixin.push_to_hub) function saves each component to a subfolder in the repository. Now you can reload the pipeline from your repository on the Hub: + +```py +pipeline = StableDiffusionPipeline.from_pretrained("your-namespace/my-pipeline") +``` + +## Privacy + +Set `private=True` in the [`push_to_hub`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/schedulers/overview/#mindone.diffusers.utils.PushToHubMixin.push_to_hub) function to keep your model, scheduler, or pipeline files private: + +```py +controlnet.push_to_hub("my-controlnet-model-private", private=True) +``` + +Private repositories are only visible to you, and other users won't be able to clone the repository and your repository won't appear in search results. Even if a user has the URL to your private repository, they'll receive a `404 - Sorry, we can't find the page you are looking for`. You must be [logged in](https://huggingface.co/docs/huggingface_hub/quick-start#login) to load a model from a private repository. diff --git a/docs/diffusers/using-diffusers/reusing_seeds.md b/docs/diffusers/using-diffusers/reusing_seeds.md new file mode 100644 index 0000000000..b25f97bdbf --- /dev/null +++ b/docs/diffusers/using-diffusers/reusing_seeds.md @@ -0,0 +1,104 @@ + + +# Reproducible pipelines + +Diffusion models are inherently random which is what allows it to generate different outputs every time it is run. But there are certain times when you want to generate the same output every time, like when you're testing, replicating results, and even [improving image quality](#deterministic-batch-generation). While you can't expect to get identical results across platforms, you can expect reproducible results across releases and platforms within a certain tolerance range (though even this may vary). + +This guide will show you how to control randomness for deterministic generation on a Ascend. + +## Control randomness + +During inference, pipelines rely heavily on random sampling operations which include creating the +Gaussian noise tensors to denoise and adding noise to the scheduling step. + +Take a look at the tensor values in the [`DDIMPipeline`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/pipelines/ddim/#mindone.diffusers.DDIMPipeline) after two inference steps. + +```python +from mindone.diffusers import DDIMPipeline +import numpy as np + +ddim = DDIMPipeline.from_pretrained( "google/ddpm-cifar10-32", use_safetensors=True) +image = ddim(num_inference_steps=2, output_type="np")[0] +print(np.abs(image).sum()) +``` + +Running the code above prints one value, but if you run it again you get a different value. + +Each time the pipeline is run, [numpy.random.Generator.standard_normal](https://numpy.org/doc/stable/reference/random/generated/numpy.random.Generator.standard_normal.html) uses a different random seed to create the Gaussian noise tensors. This leads to a different result each time it is run and enables the diffusion pipeline to generate a different random image each time. + +But if you need to reliably generate the same image, Diffusers has a [`randn_tensor`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/utilities/#mindone.diffusers.utils.mindspore_utils.randn_tensor) function for creating random noise using numpy, and then convert the array to tensor. The [`randn_tensor`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/utilities/#mindone.diffusers.utils.mindspore_utils.randn_tensor) function is used everywhere inside the pipeline. Now you can call [numpy.random.Generator](https://numpy.org/doc/stable/reference/random/generator.html) which automatically creates a `Generator` that can be passed to the pipeline. + +```python +import numpy as np +from mindone.diffusers import DDIMPipeline + +ddim = DDIMPipeline.from_pretrained("google/ddpm-cifar10-32", use_safetensors=True) +generator = np.random.Generator(np.random.PCG64(0)) +image = ddim(num_inference_steps=2, output_type="np", generator=generator)[0] +print(np.abs(image).sum()) +``` + +Finally, more complex pipelines such as [`UnCLIPPipeline`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/pipelines/unclip/#mindone.diffusers.UnCLIPPipeline), are often extremely +susceptible to precision error propagation. You'll need to use +exactly the same hardware and MindSpore version for full reproducibility. + +## Deterministic batch generation + +A practical application of creating reproducible pipelines is *deterministic batch generation*. You generate a batch of images and select one image to improve with a more detailed prompt. The main idea is to pass a list of [Generator's](https://numpy.org/doc/stable/reference/random/generator.html) to the pipeline and tie each `Generator` to a seed so you can reuse it. + +Let's use the [stable-diffusion-v1-5/stable-diffusion-v1-5](https://huggingface.co/stable-diffusion-v1-5/stable-diffusion-v1-5) checkpoint and generate a batch of images. + +```py +import mindspore as ms +from mindone.diffusers import DiffusionPipeline +from mindone.diffusers.utils import make_image_grid +import numpy as np + +pipeline = DiffusionPipeline.from_pretrained( + "stable-diffusion-v1-5/stable-diffusion-v1-5", mindspore_dtype=ms.float16, use_safetensors=True +) +``` + +Define four different `Generator`s and assign each `Generator` a seed (`0` to `3`). Then generate a batch of images and pick one to iterate on. + +!!! warning + + Use a list comprehension that iterates over the batch size specified in `range()` to create a unique `Generator` object for each image in the batch. If you multiply the `Generator` by the batch size integer, it only creates *one* `Generator` object that is used sequentially for each image in the batch. + + ```py + [np.random.Generator(np.random.PCG64(seed))] * 4 + ``` + +```python +generator = [np.random.Generator(np.random.PCG64(i)) for i in range(4)] +prompt = "Labrador in the style of Vermeer" +images = pipeline(prompt, generator=generator, num_images_per_prompt=4)[0] +make_image_grid(images, rows=2, cols=2) +``` + +
+ +
+ +Let's improve the first image (you can choose any image you want) which corresponds to the `Generator` with seed `0`. Add some additional text to your prompt and then make sure you reuse the same `Generator` with seed `0`. All the generated images should resemble the first image. + +```python +prompt = [prompt + t for t in [", highly realistic", ", artsy", ", trending", ", colorful"]] +generator = [np.random.Generator(np.random.PCG64(0)) for i in range(4)] +images = pipeline(prompt, generator=generator)[0] +make_image_grid(images, rows=2, cols=2) +``` + +
+ +
diff --git a/docs/diffusers/using-diffusers/scheduler_features.md b/docs/diffusers/using-diffusers/scheduler_features.md new file mode 100644 index 0000000000..284123ad9b --- /dev/null +++ b/docs/diffusers/using-diffusers/scheduler_features.md @@ -0,0 +1,236 @@ + + +# Scheduler features + +The scheduler is an important component of any diffusion model because it controls the entire denoising (or sampling) process. There are many types of schedulers, some are optimized for speed and some for quality. With Diffusers, you can modify the scheduler configuration to use custom noise schedules, sigmas, and rescale the noise schedule. Changing these parameters can have profound effects on inference quality and speed. + +This guide will demonstrate how to use these features to improve inference quality. + +!!! tip + + Diffusers currently only supports the `timesteps` and `sigmas` parameters for a select list of schedulers and pipelines. + +## Timestep schedules + +The timestep or noise schedule determines the amount of noise at each sampling step. The scheduler uses this to generate an image with the corresponding amount of noise at each step. The timestep schedule is generated from the scheduler's default configuration, but you can customize the scheduler to use new and optimized sampling schedules that aren't in Diffusers yet. + +For example, [Align Your Steps (AYS)](https://research.nvidia.com/labs/toronto-ai/AlignYourSteps/) is a method for optimizing a sampling schedule to generate a high-quality image in as little as 10 steps. The optimal `10-step schedule` for Stable Diffusion XL is: + +```py +sampling_schedule = [999, 845, 730, 587, 443, 310, 193, 116, 53, 13] +``` + +You can use the AYS sampling schedule in a pipeline by passing it to the `timesteps` parameter. + +```py +pipeline = StableDiffusionXLPipeline.from_pretrained( + "SG161222/RealVisXL_V4.0", + mindspore_dtype=ms.float16, + variant="fp16", +) +pipeline.scheduler = DPMSolverMultistepScheduler.from_config(pipeline.scheduler.config, algorithm_type="sde-dpmsolver++") + +prompt = "A cinematic shot of a cute little rabbit wearing a jacket and doing a thumbs up" +generator = np.random.Generator(np.random.PCG64(2487854446)) +image = pipeline( + prompt=prompt, + negative_prompt="", + generator=generator, + timesteps=sampling_schedule, +)[0][0] +``` + +
+
+ +
AYS timestep schedule 10 steps
+
+
+ +
Linearly-spaced timestep schedule 10 steps
+
+
+ +
Linearly-spaced timestep schedule 25 steps
+
+
+ +## Timestep spacing + +The way sample steps are selected in the schedule can affect the quality of the generated image, especially with respect to [rescaling the noise schedule](#rescale-noise-schedule), which can enable a model to generate much brighter or darker images. Diffusers provides three timestep spacing methods: + +- `leading` creates evenly spaced steps +- `linspace` includes the first and last steps and evenly selects the remaining intermediate steps +- `trailing` only includes the last step and evenly selects the remaining intermediate steps starting from the end + +It is recommended to use the `trailing` spacing method because it generates higher quality images with more details when there are fewer sample steps. But the difference in quality is not as obvious for more standard sample step values. + +```py +import mindspore as ms +from mindone.diffusers import StableDiffusionXLPipeline, DPMSolverMultistepScheduler +import numpy as np + +pipeline = StableDiffusionXLPipeline.from_pretrained( + "SG161222/RealVisXL_V4.0", + mindspore_dtype=ms.float16, + variant="fp16", +) +pipeline.scheduler = DPMSolverMultistepScheduler.from_config(pipeline.scheduler.config, timestep_spacing="trailing") + +prompt = "A cinematic shot of a cute little black cat sitting on a pumpkin at night" +generator = np.random.Generator(np.random.PCG64(2487854446)) +image = pipeline( + prompt=prompt, + negative_prompt="", + generator=generator, + num_inference_steps=5, +)[0][0] +image +``` + +
+
+ +
trailing spacing after 5 steps
+
+
+ +
leading spacing after 5 steps
+
+
+ +## Sigmas + +The `sigmas` parameter is the amount of noise added at each timestep according to the timestep schedule. Like the `timesteps` parameter, you can customize the `sigmas` parameter to control how much noise is added at each step. When you use a custom `sigmas` value, the `timesteps` are calculated from the custom `sigmas` value and the default scheduler configuration is ignored. + +For example, you can manually pass the `sigmas` for something like the 10-step AYS schedule from before to the pipeline. + +```py +import mindspore as ms +import numpy as np + +from mindone.diffusers import DiffusionPipeline, EulerDiscreteScheduler + +model_id = "stabilityai/stable-diffusion-xl-base-1.0" +pipeline = DiffusionPipeline.from_pretrained( + "stabilityai/stable-diffusion-xl-base-1.0", + mindspore_dtype=ms.float16, +) +pipeline.scheduler = EulerDiscreteScheduler.from_config(pipeline.scheduler.config) + +sigmas = [14.615, 6.315, 3.771, 2.181, 1.342, 0.862, 0.555, 0.380, 0.234, 0.113, 0.0] +prompt = "anthropomorphic capybara wearing a suit and working with a computer" +generator = np.random.Generator(np.random.PCG64(123)) +image = pipeline( + prompt=prompt, + num_inference_steps=10, + sigmas=sigmas, + generator=generator +)[0][0] +``` + +When you take a look at the scheduler's `timesteps` parameter, you'll see that it is the same as the AYS timestep schedule because the `timestep` schedule is calculated from the `sigmas`. + +```py +print(f" timesteps: {pipeline.scheduler.timesteps}") +"timesteps: [999., 845., 730., 587., 443., 310., 193., 116., 53., 13.]" +``` + +### Karras sigmas + +!!! tip + + Refer to the scheduler API [overview](../api/schedulers/overview.md) for a list of schedulers that support Karras sigmas. + + Karras sigmas should not be used for models that weren't trained with them. For example, the base Stable Diffusion XL model shouldn't use Karras sigmas but the [DreamShaperXL](https://hf.co/Lykon/dreamshaper-xl-1-0) model can since they are trained with Karras sigmas. + +Karras scheduler's use the timestep schedule and sigmas from the [Elucidating the Design Space of Diffusion-Based Generative Models](https://hf.co/papers/2206.00364) paper. This scheduler variant applies a smaller amount of noise per step as it approaches the end of the sampling process compared to other schedulers, and can increase the level of details in the generated image. + +Enable Karras sigmas by setting `use_karras_sigmas=True` in the scheduler. + +```py +import mindspore as ms +from mindone.diffusers import StableDiffusionXLPipeline, DPMSolverMultistepScheduler +import numpy as np + +pipeline = StableDiffusionXLPipeline.from_pretrained( + "SG161222/RealVisXL_V4.0", + mindspore_dtype=ms.float16, + variant="fp16", +) +pipeline.scheduler = DPMSolverMultistepScheduler.from_config(pipeline.scheduler.config, algorithm_type="sde-dpmsolver++", use_karras_sigmas=True) + +prompt = "A cinematic shot of a cute little rabbit wearing a jacket and doing a thumbs up" +generator = np.random.Generator(np.random.PCG64(2487854446)) +image = pipeline( + prompt=prompt, + negative_prompt="", + generator=generator, +)[0][0] +``` + +
+
+ +
Karras sigmas enabled
+
+
+ +
Karras sigmas disabled
+
+
+ +## Rescale noise schedule + +In the [Common Diffusion Noise Schedules and Sample Steps are Flawed](https://hf.co/papers/2305.08891) paper, the authors discovered that common noise schedules allowed some signal to leak into the last timestep. This signal leakage at inference can cause models to only generate images with medium brightness. By enforcing a zero signal-to-noise ratio (SNR) for the timstep schedule and sampling from the last timestep, the model can be improved to generate very bright or dark images. + +!!! tip + + For inference, you need a model that has been trained with *v_prediction*. To train your own model with *v_prediction*, add the following flag to the [train_text_to_image.py](https://github.com/mindspore-lab/mindone/blob/master/examples/diffusers/text_to_image/train_text_to_image.py) or [train_text_to_image_lora.py](https://github.com/mindspore-lab/mindone/blob/master/examples/diffusers/text_to_image/train_text_to_image_lora.py) scripts. + + ```bash + --prediction_type="v_prediction" + ``` + +For example, load the [ptx0/pseudo-journey-v2](https://hf.co/ptx0/pseudo-journey-v2) checkpoint which was trained with `v_prediction` and the [`DDIMScheduler`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/schedulers/ddim/#mindone.diffusers.DDIMScheduler). Configure the following parameters in the [`DDIMScheduler`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/schedulers/ddim/#mindone.diffusers.DDIMScheduler): + +* `rescale_betas_zero_snr=True` to rescale the noise schedule to zero SNR +* `timestep_spacing="trailing"` to start sampling from the last timestep + +Set `guidance_rescale` in the pipeline to prevent over-exposure. A lower value increases brightness but some of the details may appear washed out. + +```py +from mindone.diffusers import DiffusionPipeline, DDIMScheduler +import numpy as np + +pipeline = DiffusionPipeline.from_pretrained("ptx0/pseudo-journey-v2", use_safetensors=True) + +pipeline.scheduler = DDIMScheduler.from_config( + pipeline.scheduler.config, rescale_betas_zero_snr=True, timestep_spacing="trailing" +) +prompt = "cinematic photo of a snowy mountain at night with the northern lights aurora borealis overhead, 35mm photograph, film, professional, 4k, highly detailed" +generator = np.random.Generator(np.random.PCG64(23)) +image = pipeline(prompt, guidance_rescale=0.7, generator=generator)[0][0] +image +``` + +
+
+ +
default Stable Diffusion v2-1 image
+
+
+ +
image with zero SNR and trailing timestep spacing enabled
+
+
diff --git a/docs/diffusers/using-diffusers/schedulers.md b/docs/diffusers/using-diffusers/schedulers.md new file mode 100644 index 0000000000..e5b61b4a62 --- /dev/null +++ b/docs/diffusers/using-diffusers/schedulers.md @@ -0,0 +1,190 @@ + + +# Load schedulers and models + +Diffusion pipelines are a collection of interchangeable schedulers and models that can be mixed and matched to tailor a pipeline to a specific use case. The scheduler encapsulates the entire denoising process such as the number of denoising steps and the algorithm for finding the denoised sample. A scheduler is not parameterized or trained so they don't take very much memory. The model is usually only concerned with the forward pass of going from a noisy input to a less noisy sample. + +This guide will show you how to load schedulers and models to customize a pipeline. You'll use the [stable-diffusion-v1-5/stable-diffusion-v1-5](https://huggingface.co/stable-diffusion-v1-5/stable-diffusion-v1-5) checkpoint throughout this guide, so let's load it first. + +```py +import mindspore as ms +from mindone.diffusers import DiffusionPipeline + +pipeline = DiffusionPipeline.from_pretrained( + "stable-diffusion-v1-5/stable-diffusion-v1-5", mindspore_dtype=ms.float16, use_safetensors=True +) +``` + +You can see what scheduler this pipeline uses with the `pipeline.scheduler` attribute. + +```py +pipeline.scheduler +PNDMScheduler { + "_class_name": "PNDMScheduler", + "_diffusers_version": "0.29.2", + "beta_end": 0.012, + "beta_schedule": "scaled_linear", + "beta_start": 0.00085, + "clip_sample": false, + "num_train_timesteps": 1000, + "prediction_type": "epsilon", + "set_alpha_to_one": false, + "skip_prk_steps": true, + "steps_offset": 1, + "timestep_spacing": "leading", + "trained_betas": null +} +``` + +## Load a scheduler + +Schedulers are defined by a configuration file that can be used by a variety of schedulers. Load a scheduler with the [`from_pretrained`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/schedulers/overview/#mindone.diffusers.SchedulerMixin.from_pretrained) method, and specify the `subfolder` parameter to load the configuration file into the correct subfolder of the pipeline repository. + +For example, to load the [`DDIMScheduler`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/schedulers/ddim/#mindone.diffusers.DDIMScheduler): + +```py +from mindone.diffusers import DDIMScheduler, DiffusionPipeline + +ddim = DDIMScheduler.from_pretrained("stable-diffusion-v1-5/stable-diffusion-v1-5", subfolder="scheduler") +``` + +Then you can pass the newly loaded scheduler to the pipeline. + +```python +pipeline = DiffusionPipeline.from_pretrained( + "stable-diffusion-v1-5/stable-diffusion-v1-5", scheduler=ddim, mindspore_dtype=ms.float16, use_safetensors=True +) +``` + +## Compare schedulers + +Schedulers have their own unique strengths and weaknesses, making it difficult to quantitatively compare which scheduler works best for a pipeline. You typically have to make a trade-off between denoising speed and denoising quality. We recommend trying out different schedulers to find one that works best for your use case. Call the `pipeline.scheduler.compatibles` attribute to see what schedulers are compatible with a pipeline. + +Let's compare the [`LMSDiscreteScheduler`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/schedulers/lms_discrete/#mindone.diffusers.LMSDiscreteScheduler), [`EulerDiscreteScheduler`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/schedulers/euler/#mindone.diffusers.EulerDiscreteScheduler), [`EulerAncestralDiscreteScheduler`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/schedulers/euler_ancestral/#mindone.diffusers.EulerAncestralDiscreteScheduler), and the [`DPMSolverMultistepScheduler`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/schedulers/multistep_dpm_solver/#mindone.diffusers.DPMSolverMultistepScheduler) on the following prompt and seed. + +```py +import mindspore as ms +from mindone.diffusers import DiffusionPipeline +import numpy as np + +pipeline = DiffusionPipeline.from_pretrained( + "stable-diffusion-v1-5/stable-diffusion-v1-5", mindspore_dtype=ms.float16, use_safetensors=True +) + +prompt = "A photograph of an astronaut riding a horse on Mars, high resolution, high definition." +generator = np.random.Generator(np.random.PCG64(8)) +``` + +To change the pipelines scheduler, use the [`from_config`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/configuration/#mindone.diffusers.configuration_utils.ConfigMixin.from_config) method to load a different scheduler's `pipeline.scheduler.config` into the pipeline. + +=== "LMSDiscreteScheduler" + + [`LMSDiscreteScheduler`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/schedulers/lms_discrete/#mindone.diffusers.LMSDiscreteScheduler) typically generates higher quality images than the default scheduler. + + ```py + from mindone.diffusers import LMSDiscreteScheduler + + pipeline.scheduler = LMSDiscreteScheduler.from_config(pipeline.scheduler.config) + image = pipeline(prompt, generator=generator)[0][0] + image + ``` + +=== "EulerDiscreteScheduler" + + [`EulerDiscreteScheduler`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/schedulers/euler/#mindone.diffusers.EulerDiscreteScheduler) can generate higher quality images in just 30 steps. + + ```py + from mindone.diffusers import EulerDiscreteScheduler + + pipeline.scheduler = EulerDiscreteScheduler.from_config(pipeline.scheduler.config) + image = pipeline(prompt, generator=generator)[0][0] + image + ``` + +=== "EulerAncestralDiscreteScheduler" + + [`EulerAncestralDiscreteScheduler`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/schedulers/euler_ancestral/#mindone.diffusers.EulerAncestralDiscreteScheduler) can generate higher quality images in just 30 steps. + + ```py + from mindone.diffusers import EulerAncestralDiscreteScheduler + + pipeline.scheduler = EulerAncestralDiscreteScheduler.from_config(pipeline.scheduler.config) + image = pipeline(prompt, generator=generator)[0][0] + image + ``` + +=== "DPMSolverMultistepScheduler" + + [`DPMSolverMultistepScheduler`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/schedulers/multistep_dpm_solver/#mindone.diffusers.DPMSolverMultistepScheduler) provides a balance between speed and quality and can generate higher quality images in just 20 steps. + + ```py + from mindone.diffusers import DPMSolverMultistepScheduler + + pipeline.scheduler = DPMSolverMultistepScheduler.from_config(pipeline.scheduler.config) + image = pipeline(prompt, generator=generator)[0][0] + image + ``` + +
+
+ +
LMSDiscreteScheduler
+
+
+ +
EulerDiscreteScheduler
+
+
+
+
+ +
EulerAncestralDiscreteScheduler
+
+
+ +
DPMSolverMultistepScheduler
+
+
+ +Most images look very similar and are comparable in quality. Again, it often comes down to your specific use case so a good approach is to run multiple different schedulers and compare the results. + +## Models + +Models are loaded from the [`from_pretrained`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/models/overview/#mindone.diffusers.ModelMixin.from_pretrained) method, which downloads and caches the latest version of the model weights and configurations. If the latest files are available in the local cache, [`from_pretrained`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/models/overview/#mindone.diffusers.ModelMixin.from_pretrained) reuses files in the cache instead of re-downloading them. + +Models can be loaded from a subfolder with the `subfolder` argument. For example, the model weights for [stable-diffusion-v1-5/stable-diffusion-v1-5](https://huggingface.co/stable-diffusion-v1-5/stable-diffusion-v1-5) are stored in the [unet](https://huggingface.co/stable-diffusion-v1-5/stable-diffusion-v1-5/tree/main/unet) subfolder. + +```python +from mindone.diffusers import UNet2DConditionModel + +unet = UNet2DConditionModel.from_pretrained("stable-diffusion-v1-5/stable-diffusion-v1-5", subfolder="unet", use_safetensors=True) +``` + +They can also be directly loaded from a [repository](https://huggingface.co/google/ddpm-cifar10-32/tree/main). + +```python +from mindone.diffusers import UNet2DModel + +unet = UNet2DModel.from_pretrained("google/ddpm-cifar10-32", use_safetensors=True) +``` + +To load and save model variants, specify the `variant` argument in [`ModelMixin.from_pretrained`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/models/overview/#mindone.diffusers.ModelMixin.from_pretrained) and [`ModelMixin.save_pretrained`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/models/overview/#mindone.diffusers.ModelMixin.save_pretrained). + +```python +from mindone.diffusers import UNet2DConditionModel + +unet = UNet2DConditionModel.from_pretrained( + "stable-diffusion-v1-5/stable-diffusion-v1-5", subfolder="unet", variant="non_ema", use_safetensors=True +) +unet.save_pretrained("./local-unet", variant="non_ema") +``` diff --git a/docs/diffusers/using-diffusers/sdxl.md b/docs/diffusers/using-diffusers/sdxl.md new file mode 100644 index 0000000000..81e103e0f2 --- /dev/null +++ b/docs/diffusers/using-diffusers/sdxl.md @@ -0,0 +1,423 @@ + + +# Stable Diffusion XL + +[Stable Diffusion XL](https://huggingface.co/papers/2307.01952) (SDXL) is a powerful text-to-image generation model that iterates on the previous Stable Diffusion models in three key ways: + +1. the UNet is 3x larger and SDXL combines a second text encoder (OpenCLIP ViT-bigG/14) with the original text encoder to significantly increase the number of parameters +2. introduces size and crop-conditioning to preserve training data from being discarded and gain more control over how a generated image should be cropped +3. introduces a two-stage model process; the *base* model (can also be run as a standalone model) generates an image as an input to the *refiner* model which adds additional high-quality details + +This guide will show you how to use SDXL for text-to-image, image-to-image, and inpainting. + +Before you begin, make sure you have the following libraries installed: + +```py +# uncomment to install the necessary libraries +#!pip install mindone transformers +``` + +!!! warning + + mindone.diffusers does not support watermarker to help identify generated images. + +## Load model checkpoints + +Model weights may be stored in separate subfolders on the Hub or locally, in which case, you should use the [`from_pretrained`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/pipelines/overview/#mindone.diffusers.DiffusionPipeline.from_pretrained) method: + +```py +from mindone.diffusers import StableDiffusionXLPipeline, StableDiffusionXLImg2ImgPipeline +import mindspore as ms + +pipeline = StableDiffusionXLPipeline.from_pretrained( + "stabilityai/stable-diffusion-xl-base-1.0", mindspore_dtype=ms.float16, use_safetensors=True +) + +refiner = StableDiffusionXLImg2ImgPipeline.from_pretrained( + "stabilityai/stable-diffusion-xl-refiner-1.0", mindspore_dtype=ms.float16, use_safetensors=True +) +``` + +You can also use the [`from_single_file`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/loaders/single_file/#mindone.diffusers.loaders.single_file.FromSingleFileMixin.from_single_file) method to load a model checkpoint stored in a single file format (`.ckpt` or `.safetensors`) from the Hub or locally: + +```py +from mindone.diffusers import StableDiffusionXLPipeline, StableDiffusionXLImg2ImgPipeline +import mindspore as ms + +pipeline = StableDiffusionXLPipeline.from_single_file( + "https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0/blob/main/sd_xl_base_1.0.safetensors", + mindspore_dtype=ms.float16 +) + +refiner = StableDiffusionXLImg2ImgPipeline.from_single_file( + "https://huggingface.co/stabilityai/stable-diffusion-xl-refiner-1.0/blob/main/sd_xl_refiner_1.0.safetensors", mindspore_dtype=ms.float16 +) +``` + +## Text-to-image + +For text-to-image, pass a text prompt. By default, SDXL generates a 1024x1024 image for the best results. You can try setting the `height` and `width` parameters to 768x768 or 512x512, but anything below 512x512 is not likely to work. + +```py +from mindone.diffusers import StableDiffusionXLPipeline +import mindspore as ms + +pipeline_text2image = StableDiffusionXLPipeline.from_pretrained( + "stabilityai/stable-diffusion-xl-base-1.0", mindspore_dtype=ms.float16, use_safetensors=True +) + +prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k" +image = pipeline_text2image(prompt=prompt)[0][0] +image +``` + +
+ +
+ +## Image-to-image + +For image-to-image, SDXL works especially well with image sizes between 768x768 and 1024x1024. Pass an initial image, and a text prompt to condition the image with: + +```py +from mindone.diffusers import StableDiffusionXLImg2ImgPipeline +from mindone.diffusers.utils import load_image, make_image_grid +import mindspore as ms + +# use from_pipe to avoid consuming additional memory when loading a checkpoint +pipeline = StableDiffusionXLImg2ImgPipeline.from_pretrained( + "stabilityai/stable-diffusion-xl-base-1.0", mindspore_dtype=ms.float16, use_safetensors=True +) + +url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/sdxl-text2img.png" +init_image = load_image(url) +prompt = "a dog catching a frisbee in the jungle" +image = pipeline(prompt, image=init_image, strength=0.8, guidance_scale=10.5)[0][0] +make_image_grid([init_image, image], rows=1, cols=2) +``` + +
+ +
+ +## Inpainting + +For inpainting, you'll need the original image and a mask of what you want to replace in the original image. Create a prompt to describe what you want to replace the masked area with. + +```py +from mindone.diffusers import StableDiffusionXLInpaintPipeline +from mindone.diffusers.utils import load_image, make_image_grid +import mindspore as ms + +# use from_pipe to avoid consuming additional memory when loading a checkpoint +pipeline = StableDiffusionXLInpaintPipeline.from_pretrained( + "stabilityai/stable-diffusion-xl-base-1.0", mindspore_dtype=ms.float16, use_safetensors=True +) + +img_url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/sdxl-text2img.png" +mask_url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/sdxl-inpaint-mask.png" + +init_image = load_image(img_url) +mask_image = load_image(mask_url) + +prompt = "A deep sea diver floating" +image = pipeline(prompt=prompt, image=init_image, mask_image=mask_image, strength=0.85, guidance_scale=12.5)[0][0] +make_image_grid([init_image, mask_image, image], rows=1, cols=3) +``` + +
+ +
+ +## Refine image quality + +SDXL includes a [refiner model](https://huggingface.co/stabilityai/stable-diffusion-xl-refiner-1.0) specialized in denoising low-noise stage images to generate higher-quality images from the base model. There are two ways to use the refiner: + +1. use the base and refiner models together to produce a refined image +2. use the base model to produce an image, and subsequently use the refiner model to add more details to the image (this is how SDXL was originally trained) + +### Base + refiner model + +When you use the base and refiner model together to generate an image, this is known as an [*ensemble of expert denoisers*](https://research.nvidia.com/labs/dir/eDiff-I/). The ensemble of expert denoisers approach requires fewer overall denoising steps versus passing the base model's output to the refiner model, so it should be significantly faster to run. However, you won't be able to inspect the base model's output because it still contains a large amount of noise. + +As an ensemble of expert denoisers, the base model serves as the expert during the high-noise diffusion stage and the refiner model serves as the expert during the low-noise diffusion stage. Load the base and refiner model: + +```py +from mindone.diffusers import DiffusionPipeline +import mindspore as ms + +base = DiffusionPipeline.from_pretrained( + "stabilityai/stable-diffusion-xl-base-1.0", mindspore_dtype=ms.float16, use_safetensors=True +) + +refiner = DiffusionPipeline.from_pretrained( + "stabilityai/stable-diffusion-xl-refiner-1.0", + text_encoder_2=base.text_encoder_2, + vae=base.vae, + mindspore_dtype=ms.float16, + use_safetensors=True, +) +``` + +To use this approach, you need to define the number of timesteps for each model to run through their respective stages. For the base model, this is controlled by the [`denoising_end`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/pipelines/stable_diffusion/stable_diffusion_xl/#mindone.diffusers.StableDiffusionXLPipeline) parameter and for the refiner model, it is controlled by the [`denoising_start`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/pipelines/stable_diffusion/stable_diffusion_xl/#mindone.diffusers.StableDiffusionXLImg2ImgPipeline) parameter. + +!!! tip + + The `denoising_end` and `denoising_start` parameters should be a float between 0 and 1. These parameters are represented as a proportion of discrete timesteps as defined by the scheduler. If you're also using the `strength` parameter, it'll be ignored because the number of denoising steps is determined by the discrete timesteps the model is trained on and the declared fractional cutoff. + +Let's set `denoising_end=0.8` so the base model performs the first 80% of denoising the **high-noise** timesteps and set `denoising_start=0.8` so the refiner model performs the last 20% of denoising the **low-noise** timesteps. The base model output should be in **latent** space instead of a PIL image. + +```py +prompt = "A majestic lion jumping from a big stone at night" + +image = base( + prompt=prompt, + num_inference_steps=40, + denoising_end=0.8, + output_type="latent", +)[0] +image = refiner( + prompt=prompt, + num_inference_steps=40, + denoising_start=0.8, + image=image, +)[0][0] +image +``` + +
+
+ generated image of a lion on a rock at night +
default base model
+
+
+ generated image of a lion on a rock at night in higher quality +
ensemble of expert denoisers
+
+
+ +The refiner model can also be used for inpainting in the [`StableDiffusionXLInpaintPipeline`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/pipelines/stable_diffusion/stable_diffusion_xl/#mindone.diffusers.StableDiffusionXLInpaintPipeline): + +```py +from mindone.diffusers import StableDiffusionXLInpaintPipeline +from mindone.diffusers.utils import load_image, make_image_grid +import mindspore as ms + +base = StableDiffusionXLInpaintPipeline.from_pretrained( + "stabilityai/stable-diffusion-xl-base-1.0", mindspore_dtype=ms.float16, use_safetensors=True +) + +refiner = StableDiffusionXLInpaintPipeline.from_pretrained( + "stabilityai/stable-diffusion-xl-refiner-1.0", + text_encoder_2=base.text_encoder_2, + vae=base.vae, + mindspore_dtype=ms.float16, + use_safetensors=True, +) + +img_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo.png" +mask_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo_mask.png" + +init_image = load_image(img_url) +mask_image = load_image(mask_url) + +prompt = "A majestic tiger sitting on a bench" +num_inference_steps = 75 +high_noise_frac = 0.7 + +image = base( + prompt=prompt, + image=init_image, + mask_image=mask_image, + num_inference_steps=num_inference_steps, + denoising_end=high_noise_frac, + output_type="latent", +)[0] +image = refiner( + prompt=prompt, + image=image, + mask_image=mask_image, + num_inference_steps=num_inference_steps, + denoising_start=high_noise_frac, +)[0][0] +make_image_grid([init_image, mask_image, image.resize((512, 512))], rows=1, cols=3) +``` + +This ensemble of expert denoisers method works well for all available schedulers! + +### Base to refiner model + +SDXL gets a boost in image quality by using the refiner model to add additional high-quality details to the fully-denoised image from the base model, in an image-to-image setting. + +Load the base and refiner models: + +```py +from mindone.diffusers import DiffusionPipeline +import mindspore as ms + +base = DiffusionPipeline.from_pretrained( + "stabilityai/stable-diffusion-xl-base-1.0", mindspore_dtype=ms.float16, use_safetensors=True +) + +refiner = DiffusionPipeline.from_pretrained( + "stabilityai/stable-diffusion-xl-refiner-1.0", + text_encoder_2=base.text_encoder_2, + vae=base.vae, + mindspore_dtype=ms.float16, + use_safetensors=True, +) +``` + +Generate an image from the base model, and set the model output to **latent** space: + +```py +prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k" + +image = base(prompt=prompt, output_type="latent")[0][0] +``` + +Pass the generated image to the refiner model: + +```py +image = refiner(prompt=prompt, image=image[None, :])[0][0] +``` + +
+
+ generated image of an astronaut riding a green horse on Mars +
base model
+
+
+ higher quality generated image of an astronaut riding a green horse on Mars +
base model + refiner model
+
+
+ +For inpainting, load the base and the refiner model in the [`StableDiffusionXLInpaintPipeline`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/pipelines/stable_diffusion/stable_diffusion_xl/#mindone.diffusers.StableDiffusionXLInpaintPipeline), remove the `denoising_end` and `denoising_start` parameters, and choose a smaller number of inference steps for the refiner. + +## Micro-conditioning + +SDXL training involves several additional conditioning techniques, which are referred to as *micro-conditioning*. These include original image size, target image size, and cropping parameters. The micro-conditionings can be used at inference time to create high-quality, centered images. + +!!! tip + + You can use both micro-conditioning and negative micro-conditioning parameters thanks to classifier-free guidance. They are available in the [`StableDiffusionXLPipeline`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/pipelines/stable_diffusion/stable_diffusion_xl/#mindone.diffusers.StableDiffusionXLPipeline), [`StableDiffusionXLImg2ImgPipeline`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/pipelines/stable_diffusion/stable_diffusion_xl/#mindone.diffusers.StableDiffusionXLImg2ImgPipeline), [`StableDiffusionXLInpaintPipeline`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/pipelines/stable_diffusion/stable_diffusion_xl/#mindone.diffusers.StableDiffusionXLInpaintPipeline), and [`StableDiffusionXLControlNetPipeline`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/pipelines/controlnet_sdxl/#mindone.diffusers.StableDiffusionXLControlNetPipeline). + +### Size conditioning + +There are two types of size conditioning: + +- [`original_size`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/pipelines/stable_diffusion/stable_diffusion_xl/#mindone.diffusers.StableDiffusionXLPipeline) conditioning comes from upscaled images in the training batch (because it would be wasteful to discard the smaller images which make up almost 40% of the total training data). This way, SDXL learns that upscaling artifacts are not supposed to be present in high-resolution images. During inference, you can use `original_size` to indicate the original image resolution. Using the default value of `(1024, 1024)` produces higher-quality images that resemble the 1024x1024 images in the dataset. If you choose to use a lower resolution, such as `(256, 256)`, the model still generates 1024x1024 images, but they'll look like the low resolution images (simpler patterns, blurring) in the dataset. + +- [`target_size`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/pipelines/stable_diffusion/stable_diffusion_xl/#mindone.diffusers.StableDiffusionXLPipeline) conditioning comes from finetuning SDXL to support different image aspect ratios. During inference, if you use the default value of `(1024, 1024)`, you'll get an image that resembles the composition of square images in the dataset. We recommend using the same value for `target_size` and `original_size`, but feel free to experiment with other options! + +🤗 Diffusers also lets you specify negative conditions about an image's size to steer generation away from certain image resolutions: + +```py +from mindone.diffusers import StableDiffusionXLPipeline +import mindspore as ms + +pipe = StableDiffusionXLPipeline.from_pretrained( + "stabilityai/stable-diffusion-xl-base-1.0", mindspore_dtype=ms.float16, use_safetensors=True +) + +prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k" +image = pipe( + prompt=prompt, + negative_original_size=(512, 512), + negative_target_size=(1024, 1024), +)[0][0] +``` + +
+ +
Images negatively conditioned on image resolutions of (128, 128), (256, 256), and (512, 512).
+
+ +### Crop conditioning + +Images generated by previous Stable Diffusion models may sometimes appear to be cropped. This is because images are actually cropped during training so that all the images in a batch have the same size. By conditioning on crop coordinates, SDXL *learns* that no cropping - coordinates `(0, 0)` - usually correlates with centered subjects and complete faces (this is the default value in 🤗 Diffusers). You can experiment with different coordinates if you want to generate off-centered compositions! + +```py +from mindone.diffusers import StableDiffusionXLPipeline +import mindspore as ms + +pipeline = StableDiffusionXLPipeline.from_pretrained( + "stabilityai/stable-diffusion-xl-base-1.0", mindspore_dtype=ms.float16, use_safetensors=True +) + +prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k" +image = pipeline(prompt=prompt, crops_coords_top_left=(256, 0))[0][0] +image +``` + +
+ +
+ +You can also specify negative cropping coordinates to steer generation away from certain cropping parameters: + +```py +from mindone.diffusers import StableDiffusionXLPipeline +import mindspore as ms + +pipe = StableDiffusionXLPipeline.from_pretrained( + "stabilityai/stable-diffusion-xl-base-1.0", mindspore_dtype=ms.float16, use_safetensors=True +) + +prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k" +image = pipe( + prompt=prompt, + negative_original_size=(512, 512), + negative_crops_coords_top_left=(0, 0), + negative_target_size=(1024, 1024), +)[0][0] +image +``` + +## Use a different prompt for each text-encoder + +SDXL uses two text-encoders, so it is possible to pass a different prompt to each text-encoder, which can [improve quality](https://github.com/huggingface/diffusers/issues/4004#issuecomment-1627764201). Pass your original prompt to `prompt` and the second prompt to `prompt_2` (use `negative_prompt` and `negative_prompt_2` if you're using negative prompts): + +```py +from mindone.diffusers import StableDiffusionXLPipeline +import mindspore as ms + +pipeline = StableDiffusionXLPipeline.from_pretrained( + "stabilityai/stable-diffusion-xl-base-1.0", mindspore_dtype=ms.float16, use_safetensors=True +) + +# prompt is passed to OAI CLIP-ViT/L-14 +prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k" +# prompt_2 is passed to OpenCLIP-ViT/bigG-14 +prompt_2 = "Van Gogh painting" +image = pipeline(prompt=prompt, prompt_2=prompt_2)[0][0] +image +``` + +
+ generated image of an astronaut in a jungle in the style of a van gogh painting +
+ +The dual text-encoders also support textual inversion embeddings that need to be loaded separately as explained in the [SDXL textual inversion](textual_inversion_inference.md#stable-diffusion-xl) section. + +## Optimizations + +SDXL is a large model, and you may need to optimize memory to get it to run on your hardware. Here is a tip to save memory and speed up inference. + +Enable [xFormers](../optimization/xformers.md) to run SDXL: + +```diff ++ base.enable_xformers_memory_efficient_attention() ++ refiner.enable_xformers_memory_efficient_attention() +``` diff --git a/docs/diffusers/using-diffusers/sdxl_turbo.md b/docs/diffusers/using-diffusers/sdxl_turbo.md new file mode 100644 index 0000000000..393747df9c --- /dev/null +++ b/docs/diffusers/using-diffusers/sdxl_turbo.md @@ -0,0 +1,108 @@ + + +# Stable Diffusion XL Turbo + +SDXL Turbo is an adversarial time-distilled [Stable Diffusion XL](https://huggingface.co/papers/2307.01952) (SDXL) model capable +of running inference in as little as 1 step. + +This guide will show you how to use SDXL-Turbo for text-to-image and image-to-image. + +Before you begin, make sure you have the following libraries installed: + +```py +# uncomment to install the necessary libraries +#!pip install mindone transformers +``` + +## Load model checkpoints + +Model weights may be stored in separate subfolders on the Hub or locally, in which case, you should use the [`from_pretrained`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/pipelines/overview/#mindone.diffusers.DiffusionPipeline.from_pretrained) method: + +```py +from mindone.diffusers import StableDiffusionXLPipeline +import mindspore as ms + +pipeline = StableDiffusionXLPipeline.from_pretrained("stabilityai/sdxl-turbo", mindspore_dtype=ms.float16, variant="fp16") +``` + +You can also use the [`from_single_file`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/loaders/single_file/#mindone.diffusers.loaders.single_file.FromSingleFileMixin.from_single_file) method to load a model checkpoint stored in a single file format (`.ckpt` or `.safetensors`) from the Hub or locally. For this loading method, you need to set `timestep_spacing="trailing"` (feel free to experiment with the other scheduler config values to get better results): + +```py +from mindone.diffusers import StableDiffusionXLPipeline, EulerAncestralDiscreteScheduler +import mindspore as ms + +pipeline = StableDiffusionXLPipeline.from_single_file( + "https://huggingface.co/stabilityai/sdxl-turbo/blob/main/sd_xl_turbo_1.0_fp16.safetensors", + mindspore_dtype=ms.float16, variant="fp16") +pipeline.scheduler = EulerAncestralDiscreteScheduler.from_config(pipeline.scheduler.config, timestep_spacing="trailing") +``` + +## Text-to-image + +For text-to-image, pass a text prompt. By default, SDXL Turbo generates a 512x512 image, and that resolution gives the best results. You can try setting the `height` and `width` parameters to 768x768 or 1024x1024, but you should expect quality degradations when doing so. + +Make sure to set `guidance_scale` to 0.0 to disable, as the model was trained without it. A single inference step is enough to generate high quality images. +Increasing the number of steps to 2, 3 or 4 should improve image quality. + +```py +from mindone.diffusers import StableDiffusionXLPipeline +import mindspore as ms + +pipeline_text2image = StableDiffusionXLPipeline.from_pretrained("stabilityai/sdxl-turbo", mindspore_dtype=ms.float16, variant="fp16") + +prompt = "A cinematic shot of a baby racoon wearing an intricate italian priest robe." + +image = pipeline_text2image(prompt=prompt, guidance_scale=0.0, num_inference_steps=1)[0][0] +image +``` + +
+ generated image of a racoon in a robe +
+ +## Image-to-image + +For image-to-image generation, make sure that `num_inference_steps * strength` is larger or equal to 1. +The image-to-image pipeline will run for `int(num_inference_steps * strength)` steps, e.g. `0.5 * 2.0 = 1` step in +our example below. + +```py +from mindone.diffusers import StableDiffusionXLImg2ImgPipeline +from mindone.diffusers.utils import load_image, make_image_grid +import mindspore as ms + +# use from_pipe to avoid consuming additional memory when loading a checkpoint +pipeline_image2image = StableDiffusionXLImg2ImgPipeline.from_pretrained("stabilityai/sdxl-turbo", mindspore_dtype=ms.float16, variant="fp16") + +init_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/cat.png") +init_image = init_image.resize((512, 512)) + +prompt = "cat wizard, gandalf, lord of the rings, detailed, fantasy, cute, adorable, Pixar, Disney, 8k" + +image = pipeline_image2image(prompt, image=init_image, strength=0.5, guidance_scale=0.0, num_inference_steps=2)[0][0] +make_image_grid([init_image, image], rows=1, cols=2) +``` + +
+ Image-to-image generation sample using SDXL Turbo +
+ +## Speed-up SDXL Turbo even more + +- When using the default VAE, keep it in `float32` to avoid costly `dtype` conversions before and after each generation. You only need to do this one before your first generation: + +```py +pipe.upcast_vae() +``` + +As an alternative, you can also use a [16-bit VAE](https://huggingface.co/madebyollin/sdxl-vae-fp16-fix) created by community member [`@madebyollin`](https://huggingface.co/madebyollin) that does not need to be upcasted to `float32`. diff --git a/docs/diffusers/using-diffusers/shap-e.md b/docs/diffusers/using-diffusers/shap-e.md new file mode 100644 index 0000000000..34150120f8 --- /dev/null +++ b/docs/diffusers/using-diffusers/shap-e.md @@ -0,0 +1,182 @@ + + +# Shap-E + +Shap-E is a conditional model for generating 3D assets which could be used for video game development, interior design, and architecture. It is trained on a large dataset of 3D assets, and post-processed to render more views of each object and produce 16K instead of 4K point clouds. The Shap-E model is trained in two steps: + +1. an encoder accepts the point clouds and rendered views of a 3D asset and outputs the parameters of implicit functions that represent the asset +2. a diffusion model is trained on the latents produced by the encoder to generate either neural radiance fields (NeRFs) or a textured 3D mesh, making it easier to render and use the 3D asset in downstream applications + +This guide will show you how to use Shap-E to start generating your own 3D assets! + +Before you begin, make sure you have the following libraries installed: + +```py +# uncomment to install the necessary libraries +#!pip install mindone transformers +``` + +## Text-to-3D + +To generate a gif of a 3D object, pass a text prompt to the [`ShapEPipeline`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/pipelines/shap_e/#mindone.diffusers.ShapEPipeline). The pipeline generates a list of image frames which are used to create the 3D object. + +```py +import mindspore as ms +from mindone.diffusers import ShapEPipeline + +pipe = ShapEPipeline.from_pretrained("openai/shap-e", mindspore_dtype=ms.float16, variant="fp16") + +guidance_scale = 15.0 +prompt = ["A firecracker", "A birthday cupcake"] + +images = pipe( + prompt, + guidance_scale=guidance_scale, + num_inference_steps=64, + frame_size=256, +) +``` + +Now use the [`export_to_gif`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/utilities/#mindone.diffusers.utils.export_to_gif) function to turn the list of image frames into a gif of the 3D object. + +```py +from mindone.diffusers.utils import export_to_gif + +export_to_gif(images[0][0], "firecracker_3d.gif") +export_to_gif(images[0][1], "cake_3d.gif") +``` + +
+
+ +
prompt = "A firecracker"
+
+
+ +
prompt = "A birthday cupcake"
+
+
+ +## Image-to-3D + +To generate a 3D object from another image, use the [`ShapEImg2ImgPipeline`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/pipelines/shap_e/#mindone.diffusers.ShapEImg2ImgPipeline). You can use an existing image or generate an entirely new one. Let's use the [Kandinsky 2.1](../api/pipelines/kandinsky.md) model to generate a new image. + +```py +from mindone.diffusers import DiffusionPipeline +import mindspore as ms + +prior_pipeline = DiffusionPipeline.from_pretrained("kandinsky-community/kandinsky-2-1-prior", mindspore_dtype=ms.float16, use_safetensors=True) +pipeline = DiffusionPipeline.from_pretrained("kandinsky-community/kandinsky-2-1", mindspore_dtype=ms.float16, use_safetensors=True) + +prompt = "A cheeseburger, white background" + +image_embeds, negative_image_embeds = prior_pipeline(prompt, guidance_scale=1.0) +image = pipeline( + prompt, + image_embeds=image_embeds, + negative_image_embeds=negative_image_embeds, +)[0][0] + +image.save("burger.png") +``` + +Pass the cheeseburger to the [`ShapEImg2ImgPipeline`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/pipelines/shap_e/#mindone.diffusers.ShapEImg2ImgPipeline) to generate a 3D representation of it. + +```py +from PIL import Image +from mindone.diffusers import ShapEImg2ImgPipeline +from mindone.diffusers.utils import export_to_gif + +pipe = ShapEImg2ImgPipeline.from_pretrained("openai/shap-e-img2img", mindspore_dtype=ms.float16, variant="fp16") + +guidance_scale = 3.0 +image = Image.open("burger.png").resize((256, 256)) + +images = pipe( + image, + guidance_scale=guidance_scale, + num_inference_steps=64, + frame_size=256, +)[0] + +gif_path = export_to_gif(images[0], "burger_3d.gif") +``` + +
+
+ +
cheeseburger
+
+
+ +
3D cheeseburger
+
+
+ +## Generate mesh + +Shap-E is a flexible model that can also generate textured mesh outputs to be rendered for downstream applications. In this example, you'll convert the output into a `glb` file because the 🤗 Datasets library supports mesh visualization of `glb` files which can be rendered by the [Dataset viewer](https://huggingface.co/docs/hub/datasets-viewer#dataset-preview). + +You can generate mesh outputs for both the [`ShapEPipeline`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/pipelines/shap_e/#mindone.diffusers.ShapEPipeline) and [`ShapEImg2ImgPipeline`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/pipelines/shap_e/#mindone.diffusers.ShapEImg2ImgPipeline) by specifying the `output_type` parameter as `"mesh"`: + +```py +import mindspore as ms +from mindone.diffusers import ShapEPipeline + +pipe = ShapEPipeline.from_pretrained("openai/shap-e", mindspore_dtype=ms.float16, variant="fp16") + +guidance_scale = 15.0 +prompt = "A birthday cupcake" + +images = pipe(prompt, guidance_scale=guidance_scale, num_inference_steps=64, frame_size=256, output_type="mesh")[0] +``` + +Use the [`export_to_ply`] function to save the mesh output as a `ply` file: + +!!! tip + + You can optionally save the mesh output as an `obj` file with the [`export_to_obj`] function. The ability to save the mesh output in a variety of formats makes it more flexible for downstream usage! + +```py +from mindone.diffusers.utils import export_to_ply + +ply_path = export_to_ply(images[0], "3d_cake.ply") +print(f"Saved to folder: {ply_path}") +``` + +Then you can convert the `ply` file to a `glb` file with the trimesh library: + +```py +import trimesh + +mesh = trimesh.load("3d_cake.ply") +mesh_export = mesh.export("3d_cake.glb", file_type="glb") +``` + +By default, the mesh output is focused from the bottom viewpoint but you can change the default viewpoint by applying a rotation transform: + +```py +import trimesh +import numpy as np + +mesh = trimesh.load("3d_cake.ply") +rot = trimesh.transformations.rotation_matrix(-np.pi / 2, [1, 0, 0]) +mesh = mesh.apply_transform(rot) +mesh_export = mesh.export("3d_cake.glb", file_type="glb") +``` + +Upload the mesh file to your dataset repository to visualize it with the Dataset viewer! + +
+ +
diff --git a/docs/diffusers/using-diffusers/svd.md b/docs/diffusers/using-diffusers/svd.md new file mode 100644 index 0000000000..3c1838e42b --- /dev/null +++ b/docs/diffusers/using-diffusers/svd.md @@ -0,0 +1,112 @@ + + +# Stable Video Diffusion + +[Stable Video Diffusion (SVD)](https://huggingface.co/papers/2311.15127) is a powerful image-to-video generation model that can generate 2-4 second high resolution (576x1024) videos conditioned on an input image. + +This guide will show you how to use SVD to generate short videos from images. + +Before you begin, make sure you have the following libraries installed: + +```py +!pip install mindone transformers +``` + +The are two variants of this model, [SVD](https://huggingface.co/stabilityai/stable-video-diffusion-img2vid) and [SVD-XT](https://huggingface.co/stabilityai/stable-video-diffusion-img2vid-xt). The SVD checkpoint is trained to generate 14 frames and the SVD-XT checkpoint is further finetuned to generate 25 frames. + +You'll use the SVD-XT checkpoint for this guide. + +!!! warning + + Due to precision issues, modifications are required to ensure StableVideoDiffusionPipeline functions properly. For further details, please refer to the [Limitation](../limitations.md). + +```python +import mindspore as ms + +from mindone.diffusers import StableVideoDiffusionPipeline +from mindone.diffusers.utils import load_image, export_to_video +import numpy as np + +pipe = StableVideoDiffusionPipeline.from_pretrained( + "stabilityai/stable-video-diffusion-img2vid-xt", mindspore_dtype=ms.float16, variant="fp16" +) + +# Load the conditioning image +image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/svd/rocket.png") +image = image.resize((1024, 576)) + +generator = np.random.Generator(np.random.PCG64(42)) +frames = pipe(image, num_frames=5, decode_chunk_size=8, generator=generator)[0] + +export_to_video(frames, "generated.mp4", fps=7) +``` + +
+
+ +
"source image of a rocket"
+
+
+ +
"generated video from source image"
+
+
+ +## Reduce memory usage + +Video generation is very memory intensive because you're essentially generating `num_frames` all at once, similar to text-to-image generation with a high batch size. To reduce the memory requirement, there are multiple options that trade-off inference speed for lower memory requirement: + +- enable feed-forward chunking: the feed-forward layer runs in a loop instead of running a single feed-forward with a huge batch size. +- reduce `decode_chunk_size`: the VAE decodes frames in chunks instead of decoding them all together. Setting `decode_chunk_size=1` decodes one frame at a time and uses the least amount of memory (we recommend adjusting this value based on your NPU memory) but the video might have some flickering. + +```diff +- frames = pipe(image, num_frames=5, decode_chunk_size=8, generator=generator)[0][0] ++ pipe.unet.enable_forward_chunking() ++ frames = pipe(image, num_frames=5, decode_chunk_size=2, generator=generator)[0][0] +``` + +Using all these tricks together should lower the memory requirement. + +## Micro-conditioning + +Stable Diffusion Video also accepts micro-conditioning, in addition to the conditioning image, which allows more control over the generated video: + +- `fps`: the frames per second of the generated video. +- `motion_bucket_id`: the motion bucket id to use for the generated video. This can be used to control the motion of the generated video. Increasing the motion bucket id increases the motion of the generated video. +- `noise_aug_strength`: the amount of noise added to the conditioning image. The higher the values the less the video resembles the conditioning image. Increasing this value also increases the motion of the generated video. + +For example, to generate a video with more motion, use the `motion_bucket_id` and `noise_aug_strength` micro-conditioning parameters: + +```python +import mindspore as ms + +from mindone.diffusers import StableVideoDiffusionPipeline +from mindone.diffusers.utils import load_image, export_to_video +import numpy as np + +pipe = StableVideoDiffusionPipeline.from_pretrained( + "stabilityai/stable-video-diffusion-img2vid-xt", mindspore_dtype=ms.float16, variant="fp16" +) + +# Load the conditioning image +image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/svd/rocket.png") +image = image.resize((1024, 576)) + +generator = np.random.Generator(np.random.PCG64(7)) +frames = pipe(image, num_frames=5, decode_chunk_size=8, generator=generator, motion_bucket_id=180, noise_aug_strength=0.1)[0] +export_to_video(frames, "generated.mp4", fps=7) +``` + +
+ +
diff --git a/docs/diffusers/using-diffusers/t2i_adapter.md b/docs/diffusers/using-diffusers/t2i_adapter.md new file mode 100644 index 0000000000..0e65281c44 --- /dev/null +++ b/docs/diffusers/using-diffusers/t2i_adapter.md @@ -0,0 +1,215 @@ + + +# T2I-Adapter + +[T2I-Adapter](https://hf.co/papers/2302.08453) is a lightweight adapter for controlling and providing more accurate +structure guidance for text-to-image models. It works by learning an alignment between the internal knowledge of the +text-to-image model and an external control signal, such as edge detection or depth estimation. + +The T2I-Adapter design is simple, the condition is passed to four feature extraction blocks and three downsample +blocks. This makes it fast and easy to train different adapters for different conditions which can be plugged into the +text-to-image model. T2I-Adapter is similar to [ControlNet](controlnet.md) except it is smaller (~77M parameters) and +faster because it only runs once during the diffusion process. The downside is that performance may be slightly worse +than ControlNet. + +This guide will show you how to use T2I-Adapter with different Stable Diffusion models and how you can compose multiple +T2I-Adapters to impose more than one condition. + +!!! tip + + There are several T2I-Adapters available for different conditions, such as color palette, depth, sketch, pose, and + segmentation. Check out the [TencentARC](https://hf.co/TencentARC) repository to try them out! + +Before you begin, make sure you have the following libraries installed. + +```py +# uncomment to install the necessary libraries in Colab +#!pip install mindone +``` + +## Text-to-image + +Text-to-image models rely on a prompt to generate an image, but sometimes, text alone may not be enough to provide more +accurate structural guidance. T2I-Adapter allows you to provide an additional control image to guide the generation +process. For example, you can provide a canny image (a white outline of an image on a black background) to guide the +model to generate an image with a similar structure. + +=== "Stable Diffusion 1.5" + + Create a canny image with the [opencv-library](https://github.com/opencv/opencv-python). + + ```py + import cv2 + import numpy as np + from PIL import Image + from mindone.diffusers.utils import load_image + + image = load_image("https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/sd_controlnet/hf-logo.png") + image = np.array(image) + + low_threshold = 100 + high_threshold = 200 + + image = cv2.Canny(image, low_threshold, high_threshold) + image = Image.fromarray(image) + ``` + + Now load a T2I-Adapter conditioned on [canny images](https://hf.co/TencentARC/t2iadapter_canny_sd15v2) and pass it to + the [`StableDiffusionAdapterPipeline`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/pipelines/stable_diffusion/adapter/#mindone.diffusers.StableDiffusionAdapterPipeline). + + ```py + import mindspore as ms + from mindone.diffusers import StableDiffusionAdapterPipeline, T2IAdapter + import numpy as np + + adapter = T2IAdapter.from_pretrained("TencentARC/t2iadapter_canny_sd15v2", mindspore_dtype=ms.float16) + pipeline = StableDiffusionAdapterPipeline.from_pretrained( + "stable-diffusion-v1-5/stable-diffusion-v1-5", + adapter=adapter, + mindspore_dtype=ms.float16, + ) + ``` + + Finally, pass your prompt and control image to the pipeline. + + ```py + generator = np.random.Generator(np.random.PCG64(0)) + + image = pipeline( + prompt="cinematic photo of a plush and soft midcentury style rug on a wooden floor, 35mm photograph, film, professional, 4k, highly detailed", + image=image, + generator=generator, + )[0][0] + image + ``` + +
+ +
+ +=== "Stable Diffusion XL" + + !!! warning + + ⚠️ MindONE currently does not support the full process for the following code, as MindONE does not yet support `CannyDetector` from controlnet_aux.canny. Therefore, you need to prepare the `canny image` in advance to continue the process. + + Load a canny image. + + ```py + from diffusers.utils import load_image + + image = load_image("https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/sd_controlnet/hf-logo.png") + image = load_image("path/to/canny_image") + ``` + + Now load a T2I-Adapter conditioned on [canny images](https://hf.co/TencentARC/t2i-adapter-canny-sdxl-1.0) and pass it + to the [`StableDiffusionXLAdapterPipeline`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/pipelines/stable_diffusion/adapter/#mindone.diffusers.StableDiffusionXLAdapterPipeline). + + ```py + import mindspore as ms + from mindone.diffusers import StableDiffusionXLAdapterPipeline, T2IAdapter, EulerAncestralDiscreteScheduler, AutoencoderKL + import numpy as np + + scheduler = EulerAncestralDiscreteScheduler.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", subfolder="scheduler") + vae = AutoencoderKL.from_pretrained("madebyollin/sdxl-vae-fp16-fix", mindspore_dtype=ms.float16) + adapter = T2IAdapter.from_pretrained("TencentARC/t2i-adapter-canny-sdxl-1.0", mindspore_dtype=ms.float16) + pipeline = StableDiffusionXLAdapterPipeline.from_pretrained( + "stabilityai/stable-diffusion-xl-base-1.0", + adapter=adapter, + vae=vae, + scheduler=scheduler, + mindspore_dtype=ms.float16, + ) + ``` + + Finally, pass your prompt and control image to the pipeline. + + ```py + generator = np.random.Generator(np.random.PCG64(0)) + + image = pipeline( + prompt="cinematic photo of a plush and soft midcentury style rug on a wooden floor, 35mm photograph, film, professional, 4k, highly detailed", + image=image, + generator=generator, + )[0][0] + image + ``` + +
+ +
+ +## MultiAdapter + +T2I-Adapters are also composable, allowing you to use more than one adapter to impose multiple control conditions on an +image. For example, you can use a pose map to provide structural control and a depth map for depth control. This is +enabled by the [`MultiAdapter`] class. + +Let's condition a text-to-image model with a pose and depth adapter. Create and place your depth and pose image and in a list. + +```py +from mindone.diffusers.utils import load_image + +pose_image = load_image( + "https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/keypose_sample_input.png" +) +depth_image = load_image( + "https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/depth_sample_input.png" +) +cond = [pose_image, depth_image] +prompt = ["Santa Claus walking into an office room with a beautiful city view"] +``` + +
+
+ +
depth image
+
+
+ +
pose image
+
+
+ +Load the corresponding pose and depth adapters as a list in the [`MultiAdapter`] class. + +```py +import mindspore as ms +from mindone.diffusers import StableDiffusionAdapterPipeline, MultiAdapter, T2IAdapter + +adapters = MultiAdapter( + [ + T2IAdapter.from_pretrained("TencentARC/t2iadapter_keypose_sd14v1"), + T2IAdapter.from_pretrained("TencentARC/t2iadapter_depth_sd14v1"), + ] +) +adapters = adapters.to(ms.float16) +``` + +Finally, load a [`StableDiffusionAdapterPipeline`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/pipelines/stable_diffusion/adapter/#mindone.diffusers.StableDiffusionAdapterPipeline) with the adapters, and pass your prompt and conditioned images to +it. Use the [`adapter_conditioning_scale`] to adjust the weight of each adapter on the image. + +```py +pipeline = StableDiffusionAdapterPipeline.from_pretrained( + "CompVis/stable-diffusion-v1-4", + mindspore_dtype=ms.float16, + adapter=adapters, +) + +image = pipeline(prompt, cond, adapter_conditioning_scale=[0.7, 0.7])[0][0] +image +``` + +
+ +
diff --git a/docs/diffusers/using-diffusers/text-img2vid.md b/docs/diffusers/using-diffusers/text-img2vid.md new file mode 100644 index 0000000000..5a2e7d23cb --- /dev/null +++ b/docs/diffusers/using-diffusers/text-img2vid.md @@ -0,0 +1,306 @@ + + +# Text or image-to-video + +Driven by the success of text-to-image diffusion models, generative video models are able to generate short clips of video from a text prompt or an initial image. These models extend a pretrained diffusion model to generate videos by adding some type of temporal and/or spatial convolution layer to the architecture. A mixed dataset of images and videos are used to train the model which learns to output a series of video frames based on the text or image conditioning. + +This guide will show you how to generate videos, how to configure video model parameters, and how to control video generation. + +## Popular models + +!!! tip + + Discover other cool and trending video generation models on the Hub [here](https://huggingface.co/models?pipeline_tag=text-to-video&sort=trending)! + +[Stable Video Diffusions (SVD)](https://huggingface.co/stabilityai/stable-video-diffusion-img2vid), [I2VGen-XL](https://huggingface.co/ali-vilab/i2vgen-xl/) and [AnimateDiff](https://huggingface.co/guoyww/animatediff) are popular models used for video diffusion. Each model is distinct. For example, AnimateDiff inserts a motion modeling module into a frozen text-to-image model to generate personalized animated images, whereas SVD is entirely pretrained from scratch with a three-stage training process to generate short high-quality videos. + +### Stable Video Diffusion + +[SVD](../api/pipelines/svd.md) is based on the Stable Diffusion 2.1 model and it is trained on images, then low-resolution videos, and finally a smaller dataset of high-resolution videos. This model generates a short 2-4 second video from an initial image. You can learn more details about model, like micro-conditioning, in the [Stable Video Diffusion](../using-diffusers/svd.md) guide. + +Begin by loading the [`StableVideoDiffusionPipeline`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/pipelines/stable_diffusion/svd/#mindone.diffusers.StableVideoDiffusionPipeline) and passing an initial image to generate a video from. + +```py +import mindspore as ms +from mindone.diffusers import StableVideoDiffusionPipeline +from mindone.diffusers.utils import load_image, export_to_video +import numpy as np + +pipeline = StableVideoDiffusionPipeline.from_pretrained( + "stabilityai/stable-video-diffusion-img2vid-xt", mindspore_dtype=ms.float16, variant="fp16" +) + +image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/svd/rocket.png") +image = image.resize((1024, 576)) + +generator = np.random.Generator(np.random.PCG64(42)) +frames = pipeline(image, decode_chunk_size=8, generator=generator, num_frames=5)[0] +export_to_video(frames, "generated.mp4", fps=7) +``` + +
+
+ +
initial image
+
+
+ +
generated video
+
+
+ +### I2VGen-XL + +[I2VGen-XL](../api/pipelines/i2vgenxl.md) is a diffusion model that can generate higher resolution videos than SVD and it is also capable of accepting text prompts in addition to images. The model is trained with two hierarchical encoders (detail and global encoder) to better capture low and high-level details in images. These learned details are used to train a video diffusion model which refines the video resolution and details in the generated video. + +You can use I2VGen-XL by loading the [`I2VGenXLPipeline`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/pipelines/i2vgenxl/#mindone.diffusers.I2VGenXLPipeline), and passing a text and image prompt to generate a video. + +```py +import mindspore as ms +from mindone.diffusers import I2VGenXLPipeline +from mindone.diffusers.utils import export_to_gif, load_image +import numpy as np + +pipeline = I2VGenXLPipeline.from_pretrained("ali-vilab/i2vgen-xl", mindspore_dtype=ms.float16, variant="fp16") + +image_url = "https://huggingface.co/datasets/diffusers/docs-images/resolve/main/i2vgen_xl_images/img_0009.png" +image = load_image(image_url).convert("RGB") +image = image.resize((image.width // 2, image.height // 2)) + +prompt = "Papers were floating in the air on a table in the library" +negative_prompt = "Distorted, discontinuous, Ugly, blurry, low resolution, motionless, static, disfigured, disconnected limbs, Ugly faces, incomplete arms" +generator = np.random.Generator(np.random.PCG64(8888)) + +frames = pipeline( + prompt=prompt, + image=image, + height=image.height, + width=image.width, + num_inference_steps=50, + negative_prompt=negative_prompt, + guidance_scale=9.0, + generator=generator +)[0][0] +export_to_gif(frames, "i2v.gif") +``` + +
+
+ +
initial image
+
+
+ +
generated video
+
+
+ +### AnimateDiff + +[AnimateDiff](../api/pipelines/animatediff.md) is an adapter model that inserts a motion module into a pretrained diffusion model to animate an image. The adapter is trained on video clips to learn motion which is used to condition the generation process to create a video. It is faster and easier to only train the adapter and it can be loaded into most diffusion models, effectively turning them into "video models". + +Start by loading a [`MotionAdapter`]. + +```py +import mindspore as ms +from mindone.diffusers import AnimateDiffPipeline, DDIMScheduler, MotionAdapter +from mindone.diffusers.utils import export_to_gif +import numpy as np + +adapter = MotionAdapter.from_pretrained("guoyww/animatediff-motion-adapter-v1-5-2", mindspore_dtype=ms.float16) +``` + +Then load a finetuned Stable Diffusion model with the [`AnimateDiffPipeline`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/pipelines/animatediff/#mindone.diffusers.AnimateDiffPipeline). + +```py +pipeline = AnimateDiffPipeline.from_pretrained("emilianJR/epiCRealism", motion_adapter=adapter, mindspore_dtype=ms.float16) +scheduler = DDIMScheduler.from_pretrained( + "emilianJR/epiCRealism", + subfolder="scheduler", + clip_sample=False, + timestep_spacing="linspace", + beta_schedule="linear", + steps_offset=1, +) +pipeline.scheduler = scheduler +``` + +Create a prompt and generate the video. + +```py +output = pipeline( + prompt="A space rocket with trails of smoke behind it launching into space from the desert, 4k, high resolution", + negative_prompt="bad quality, worse quality, low resolution", + num_frames=16, + guidance_scale=7.5, + num_inference_steps=50, + generator=np.random.Generator(np.random.PCG64(49)), +) +frames = output[0][0] +export_to_gif(frames, "animation.gif") +``` + +
+ +
+ +## Configure model parameters + +There are a few important parameters you can configure in the pipeline that'll affect the video generation process and quality. Let's take a closer look at what these parameters do and how changing them affects the output. + +### Number of frames + +The `num_frames` parameter determines how many video frames are generated per second. A frame is an image that is played in a sequence of other frames to create motion or a video. This affects video length because the pipeline generates a certain number of frames per second (check a pipeline's API reference for the default value). To increase the video duration, you'll need to increase the `num_frames` parameter. + +```py +import mindspore as ms +from mindone.diffusers import I2VGenXLPipeline +from mindone.diffusers.utils import export_to_gif, load_image +import numpy as np + +pipeline = I2VGenXLPipeline.from_pretrained("ali-vilab/i2vgen-xl", mindspore_dtype=ms.float16, variant="fp16") + +image_url = "https://huggingface.co/datasets/diffusers/docs-images/resolve/main/i2vgen_xl_images/img_0009.png" +image = load_image(image_url).convert("RGB") +image = image.resize((image.width // 2, image.height // 2)) + +prompt = "Papers were floating in the air on a table in the library" +negative_prompt = "Distorted, discontinuous, Ugly, blurry, low resolution, motionless, static, disfigured, disconnected limbs, Ugly faces, incomplete arms" +generator = np.random.Generator(np.random.PCG64(8888)) + +frames = pipeline( + prompt=prompt, + image=image, + height=image.height, + width=image.width, + num_inference_steps=50, + negative_prompt=negative_prompt, + guidance_scale=9.0, + generator=generator, + num_frames=25, +)[0][0] +export_to_gif(frames, "i2v.gif") +``` + +
+
+ +
num_frames=14
+
+
+ +
num_frames=25
+
+
+ +### Guidance scale + +The `guidance_scale` parameter controls how closely aligned the generated video and text prompt or initial image is. A higher `guidance_scale` value means your generated video is more aligned with the text prompt or initial image, while a lower `guidance_scale` value means your generated video is less aligned which could give the model more "creativity" to interpret the conditioning input. + +!!! tip + + SVD uses the `min_guidance_scale` and `max_guidance_scale` parameters for applying guidance to the first and last frames respectively. + +```py +import mindspore as ms +from mindone.diffusers import I2VGenXLPipeline +from mindone.diffusers.utils import export_to_gif, load_image +import numpy as np + +pipeline = I2VGenXLPipeline.from_pretrained("ali-vilab/i2vgen-xl", mindspore_dtype=ms.float16, variant="fp16") + +image_url = "https://huggingface.co/datasets/diffusers/docs-images/resolve/main/i2vgen_xl_images/img_0009.png" +image = load_image(image_url).convert("RGB") +image = image.resize((image.width // 2, image.height // 2)) + +prompt = "Papers were floating in the air on a table in the library" +negative_prompt = "Distorted, discontinuous, Ugly, blurry, low resolution, motionless, static, disfigured, disconnected limbs, Ugly faces, incomplete arms" +generator = np.random.Generator(np.random.PCG64(0)) + +frames = pipeline( + prompt=prompt, + image=image, + height=image.height, + width=image.width, + num_inference_steps=50, + negative_prompt=negative_prompt, + guidance_scale=1.0, + generator=generator +)[0][0] +export_to_gif(frames, "i2v.gif") +``` + +
+
+ +
guidance_scale=9.0
+
+
+ +
guidance_scale=1.0
+
+
+ +### Negative prompt + +A negative prompt deters the model from generating things you don’t want it to. This parameter is commonly used to improve overall generation quality by removing poor or bad features such as “low resolution” or “bad details”. + +```py +import mindspore as ms +from mindone.diffusers import AnimateDiffPipeline, DDIMScheduler, MotionAdapter +from mindone.diffusers.utils import export_to_gif +import numpy as np + +adapter = MotionAdapter.from_pretrained("guoyww/animatediff-motion-adapter-v1-5-2", mindspore_dtype=ms.float16) + +pipeline = AnimateDiffPipeline.from_pretrained("emilianJR/epiCRealism", motion_adapter=adapter, mindspore_dtype=ms.float16) +scheduler = DDIMScheduler.from_pretrained( + "emilianJR/epiCRealism", + subfolder="scheduler", + clip_sample=False, + timestep_spacing="linspace", + beta_schedule="linear", + steps_offset=1, +) +pipeline.scheduler = scheduler + +output = pipeline( + prompt="360 camera shot of a sushi roll in a restaurant", + negative_prompt="Distorted, discontinuous, ugly, blurry, low resolution, motionless, static", + num_frames=16, + guidance_scale=7.5, + num_inference_steps=50, + generator=np.random.Generator(np.random.PCG64(0)), +) +frames = output[0][0] +export_to_gif(frames, "animation.gif") +``` + +
+
+ +
no negative prompt
+
+
+ +
negative prompt applied
+
+
+ +### Model-specific parameters + +There are some pipeline parameters that are unique to each model such as adjusting the motion in a video or adding noise to the initial image. + +Stable Video Diffusion provides additional micro-conditioning for the frame rate with the `fps` parameter and for motion with the `motion_bucket_id` parameter. Together, these parameters allow for adjusting the amount of motion in the generated video. + +There is also a `noise_aug_strength` parameter that increases the amount of noise added to the initial image. Varying this parameter affects how similar the generated video and initial image are. A higher `noise_aug_strength` also increases the amount of motion. To learn more, read the [Micro-conditioning](../using-diffusers/svd.md#micro-conditioning) guide. diff --git a/docs/diffusers/using-diffusers/textual_inversion_inference.md b/docs/diffusers/using-diffusers/textual_inversion_inference.md new file mode 100644 index 0000000000..a98ac6b5ce --- /dev/null +++ b/docs/diffusers/using-diffusers/textual_inversion_inference.md @@ -0,0 +1,112 @@ + + +# Textual inversion + +The [`StableDiffusionPipeline`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/pipelines/stable_diffusion/text2img/#mindone.diffusers.StableDiffusionPipeline) supports textual inversion, a technique that enables a model like Stable Diffusion to learn a new concept from just a few sample images. This gives you more control over the generated images and allows you to tailor the model towards specific concepts. You can get started quickly with a collection of community created concepts in the [Stable Diffusion Conceptualizer](https://huggingface.co/spaces/sd-concepts-library/stable-diffusion-conceptualizer). + +This guide will show you how to run inference with textual inversion using a pre-learned concept from the Stable Diffusion Conceptualizer. If you're interested in teaching a model new concepts with textual inversion, take a look at the [Textual Inversion](../training/text_inversion.md) training guide. + +Import the necessary libraries: + +```py +import mindspore as ms +from mindone.diffusers import StableDiffusionPipeline +from mindone.diffusers.utils import make_image_grid +``` + +## Stable Diffusion 1 and 2 + +Pick a Stable Diffusion checkpoint and a pre-learned concept from the [Stable Diffusion Conceptualizer](https://huggingface.co/spaces/sd-concepts-library/stable-diffusion-conceptualizer): + +```py +pretrained_model_name_or_path = "stable-diffusion-v1-5/stable-diffusion-v1-5" +repo_id_embeds = "sd-concepts-library/cat-toy" +``` + +Now you can load a pipeline, and pass the pre-learned concept to it: + +```py +pipeline = StableDiffusionPipeline.from_pretrained( + pretrained_model_name_or_path, mindspore_dtype=ms.float16, use_safetensors=True +) + +pipeline.load_textual_inversion(repo_id_embeds) +``` + +Create a prompt with the pre-learned concept by using the special placeholder token ``, and choose the number of samples and rows of images you'd like to generate: + +```py +prompt = "a grafitti in a favela wall with a on it" + +num_samples_per_row = 2 +num_rows = 2 +``` + +Then run the pipeline (feel free to adjust the parameters like `num_inference_steps` and `guidance_scale` to see how they affect image quality), save the generated images and visualize them with the helper function you created at the beginning: + +```py +all_images = [] +for _ in range(num_rows): + images = pipeline(prompt, num_images_per_prompt=num_samples_per_row, num_inference_steps=50, guidance_scale=7.5)[0] + all_images.extend(images) + +grid = make_image_grid(all_images, num_rows, num_samples_per_row) +grid +``` + +
+ +
+ +## Stable Diffusion XL + +Stable Diffusion XL (SDXL) can also use textual inversion vectors for inference. In contrast to Stable Diffusion 1 and 2, SDXL has two text encoders so you'll need two textual inversion embeddings - one for each text encoder model. + +Let's download the SDXL textual inversion embeddings and have a closer look at it's structure: + +```py +from huggingface_hub import hf_hub_download +from mindone.safetensors.mindspore import load_file + +file = hf_hub_download("dn118/unaestheticXL", filename="unaestheticXLv31.safetensors") +state_dict = load_file(file) +state_dict +``` + +``` +{'clip_g': Parameter (name=clip_g, shape=(8, 1280), dtype=Float16, requires_grad=True) + 'clip_l': Parameter (name=clip_l, shape=(8, 768), dtype=Float16, requires_grad=True)} +``` + +There are two tensors, `"clip_g"` and `"clip_l"`. +`"clip_g"` corresponds to the bigger text encoder in SDXL and refers to +`pipe.text_encoder_2` and `"clip_l"` refers to `pipe.text_encoder`. + +Now you can load each tensor separately by passing them along with the correct text encoder and tokenizer +to [`load_textual_inversion`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/loaders/textual_inversion/#mindone.diffusers.loaders.textual_inversion.TextualInversionLoaderMixin.load_textual_inversion): + +```py +from mindone.diffusers import StableDiffusionXLPipeline +import mindspore as ms +import numpy as np + +pipe = StableDiffusionXLPipeline.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", mindspore_dtype=ms.float16) + +pipe.load_textual_inversion(state_dict["clip_g"], token="unaestheticXLv31", text_encoder=pipe.text_encoder_2, tokenizer=pipe.tokenizer_2) +pipe.load_textual_inversion(state_dict["clip_l"], token="unaestheticXLv31", text_encoder=pipe.text_encoder, tokenizer=pipe.tokenizer) + +# the embedding should be used as a negative embedding, so we pass it as a negative prompt +generator = np.random.Generator(np.random.PCG64(33)) +image = pipe("a woman standing in front of a mountain", negative_prompt="unaestheticXLv31", generator=generator)[0][0] +image +``` diff --git a/docs/diffusers/using-diffusers/unconditional_image_generation.md b/docs/diffusers/using-diffusers/unconditional_image_generation.md new file mode 100644 index 0000000000..931d4a3bd2 --- /dev/null +++ b/docs/diffusers/using-diffusers/unconditional_image_generation.md @@ -0,0 +1,42 @@ + + +# Unconditional image generation + +Unconditional image generation generates images that look like a random sample from the training data the model was trained on because the denoising process is not guided by any additional context like text or image. + +To get started, use the [`DiffusionPipeline`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/pipelines/overview/#mindone.diffusers.DiffusionPipeline) to load the [anton-l/ddpm-butterflies-128](https://huggingface.co/anton-l/ddpm-butterflies-128) checkpoint to generate images of butterflies. The [`DiffusionPipeline`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/pipelines/overview/#mindone.diffusers.DiffusionPipeline) downloads and caches all the model components required to generate an image. + +```py +from mindone.diffusers import DiffusionPipeline + +generator = DiffusionPipeline.from_pretrained("anton-l/ddpm-butterflies-128") +image = generator()[0][0] +image +``` + +!!! tip + + Want to generate images of something else? Take a look at the training [guide](../training/unconditional_training.md) to learn how to train a model to generate your own images. + +The output image is a [`PIL.Image`](https://pillow.readthedocs.io/en/stable/reference/Image.html?highlight=image#the-image-class) object that can be saved: + +```py +image.save("generated_image.png") +``` + +You can also try experimenting with the `num_inference_steps` parameter, which controls the number of denoising steps. More denoising steps typically produce higher quality images, but it'll take longer to generate. Feel free to play around with this parameter to see how it affects the image quality. + +```py +image = generator(num_inference_steps=100)[0][0] +image +``` diff --git a/docs/diffusers/using-diffusers/write_own_pipeline.md b/docs/diffusers/using-diffusers/write_own_pipeline.md index b4405420b5..eb1670cecb 100644 --- a/docs/diffusers/using-diffusers/write_own_pipeline.md +++ b/docs/diffusers/using-diffusers/write_own_pipeline.md @@ -12,7 +12,7 @@ specific language governing permissions and limitations under the License. # Understanding pipelines, models and schedulers -🧨 Diffusers is designed to be a user-friendly and flexible toolbox for building diffusion systems tailored to your use-case. At the core of the toolbox are models and schedulers. While the [`DiffusionPipeline`] bundles these components together for convenience, you can also unbundle the pipeline and use the models and schedulers separately to create new diffusion systems. +🧨 Diffusers is designed to be a user-friendly and flexible toolbox for building diffusion systems tailored to your use-case. At the core of the toolbox are models and schedulers. While the [`DiffusionPipeline`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/pipelines/overview/#mindone.diffusers.DiffusionPipeline) bundles these components together for convenience, you can also unbundle the pipeline and use the models and schedulers separately to create new diffusion systems. In this tutorial, you'll learn how to use models and schedulers to assemble a diffusion system for inference, starting with a basic pipeline and then progressing to the Stable Diffusion pipeline. @@ -20,82 +20,83 @@ In this tutorial, you'll learn how to use models and schedulers to assemble a di A pipeline is a quick and easy way to run a model for inference, requiring no more than four lines of code to generate an image: -```pycon ->>> from mindone.diffusers import DDPMPipeline +```python +from mindone.diffusers import DDPMPipeline ->>> ddpm = DDPMPipeline.from_pretrained("google/ddpm-cat-256", use_safetensors=True) ->>> image = ddpm(num_inference_steps=25)[0][0] ->>> image +ddpm = DDPMPipeline.from_pretrained("google/ddpm-cat-256", use_safetensors=True) +image = ddpm(num_inference_steps=1000)[0][0] +image ``` -
- Image of cat created from DDPMPipeline +
+ Image of cat created from DDPMPipeline
That was super easy, but how did the pipeline do that? Let's breakdown the pipeline and take a look at what's happening under the hood. -In the example above, the pipeline contains a [`UNet2DModel`] model and a [`DDPMScheduler`]. The pipeline denoises an image by taking random noise the size of the desired output and passing it through the model several times. At each timestep, the model predicts the *noise residual* and the scheduler uses it to predict a less noisy image. The pipeline repeats this process until it reaches the end of the specified number of inference steps. +In the example above, the pipeline contains a [`UNet2DModel`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/models/unet2d/#mindone.diffusers.UNet2DModel) model and a [`DDPMScheduler`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/schedulers/ddpm/#mindone.diffusers.DDPMScheduler). The pipeline denoises an image by taking random noise the size of the desired output and passing it through the model several times. At each timestep, the model predicts the *noise residual* and the scheduler uses it to predict a less noisy image. The pipeline repeats this process until it reaches the end of the specified number of inference steps. To recreate the pipeline with the model and scheduler separately, let's write our own denoising process. 1. Load the model and scheduler: - ```pycon - >>> from mindone.diffusers import DDPMScheduler, UNet2DModel +```python +from mindone.diffusers import DDPMScheduler, UNet2DModel - >>> scheduler = DDPMScheduler.from_pretrained("google/ddpm-cat-256") - >>> model = UNet2DModel.from_pretrained("google/ddpm-cat-256", use_safetensors=True) - ``` +scheduler = DDPMScheduler.from_pretrained("google/ddpm-cat-256") +model = UNet2DModel.from_pretrained("google/ddpm-cat-256", use_safetensors=True) +``` 2. Set the number of timesteps to run the denoising process for: - ```pycon - >>> scheduler.set_timesteps(50) - ``` +```python +scheduler.set_timesteps(50) +``` 3. Setting the scheduler timesteps creates a tensor with evenly spaced elements in it, 50 in this example. Each element corresponds to a timestep at which the model denoises an image. When you create the denoising loop later, you'll iterate over this tensor to denoise an image: - ```pycon - >>> scheduler.timesteps - tensor([980, 960, 940, 920, 900, 880, 860, 840, 820, 800, 780, 760, 740, 720, - 700, 680, 660, 640, 620, 600, 580, 560, 540, 520, 500, 480, 460, 440, - 420, 400, 380, 360, 340, 320, 300, 280, 260, 240, 220, 200, 180, 160, - 140, 120, 100, 80, 60, 40, 20, 0]) - ``` +```python +scheduler.timesteps +Tensor(shape=[50], dtype=Int64, value=[980, 960, 940, 920, 900, 880, 860, + 840, 820, 800, 780, 760, 740, 720, 700, 680, 660, 640, 620, 600, 580, + 560, 540, 520, 500, 480, 460, 440, 420, 400, 380, 360, 340, 320, 300, + 280, 260, 240, 220, 200, 180, 160, 140, 120, 100, 80, 60, 40, 20, + 0]) +``` 4. Create some random noise with the same shape as the desired output: - ```pycon - >>> import mindspore +```python +import mindspore - >>> sample_size = model.config.sample_size - >>> noise = mindspore.ops.randn((1, 3, sample_size, sample_size)) - ``` +sample_size = model.config.sample_size +noise = mindspore.ops.randn((1, 3, sample_size, sample_size)) +``` -5. Now write a loop to iterate over the timesteps. At each timestep, the model does a [`UNet2DModel.forward`] pass and returns the noisy residual. The scheduler's [`~DDPMScheduler.step`] method takes the noisy residual, timestep, and input and it predicts the image at the previous timestep. This output becomes the next input to the model in the denoising loop, and it'll repeat until it reaches the end of the `timesteps` array. +5. Now write a loop to iterate over the timesteps. At each timestep, the model does a [`UNet2DModel.construct`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/models/unet2d/#mindone.diffusers.UNet2DModel.construct) pass and returns the noisy residual. The scheduler's [`step`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/schedulers/ddpm/#mindone.diffusers.DDPMScheduler.step) method takes the noisy residual, timestep, and input and it predicts the image at the previous timestep. This output becomes the next input to the model in the denoising loop, and it'll repeat until it reaches the end of the `timesteps` array. - ```pycon - >>> input = noise +```python +input = noise - >>> for t in scheduler.timesteps: - ... noisy_residual = model(input, t)[0] - ... previous_noisy_sample = scheduler.step(noisy_residual, t, input)[0] - ... input = previous_noisy_sample - ``` +for t in scheduler.timesteps: +noisy_residual = model(input, t)[0] +previous_noisy_sample = scheduler.step(noisy_residual, t, input)[0] +input = previous_noisy_sample +``` This is the entire denoising process, and you can use this same pattern to write any diffusion system. 6. The last step is to convert the denoised output into an image: - ```pycon - >>> from PIL import Image - >>> import numpy as np +```python +from PIL import Image +import numpy as np - >>> image = (input / 2 + 0.5).clamp(0, 1).squeeze() - >>> image = (image.permute(1, 2, 0) * 255).round().to(mindspore.uint8).numpy() - >>> image = Image.fromarray(image) - >>> image - ``` +image = (input / 2 + 0.5).clamp(0, 1).squeeze() +image = (image.permute(1, 2, 0) * 255).round().to(mindspore.uint8).numpy() +image = Image.fromarray(image) +image +``` In the next section, you'll put your skills to the test and breakdown the more complex Stable Diffusion pipeline. The steps are more or less the same. You'll initialize the necessary components, and set the number of timesteps to create a `timestep` array. The `timestep` array is used in the denoising loop, and for each element in this array, the model predicts a less noisy image. The denoising loop iterates over the `timestep`'s, and at each timestep, it outputs a noisy residual and the scheduler uses it to predict a less noisy image at the previous timestep. This process is repeated until you reach the end of the `timestep` array. @@ -111,31 +112,31 @@ As you can see, this is already more complex than the DDPM pipeline which only c 💡 Read the [How does Stable Diffusion work?](https://huggingface.co/blog/stable_diffusion#how-does-stable-diffusion-work) blog for more details about how the VAE, UNet, and text encoder models work. -Now that you know what you need for the Stable Diffusion pipeline, load all these components with the [`~ModelMixin.from_pretrained`] method. You can find them in the pretrained [`runwayml/stable-diffusion-v1-5`](https://huggingface.co/runwayml/stable-diffusion-v1-5) checkpoint, and each component is stored in a separate subfolder: - -```pycon ->>> from PIL import Image ->>> import mindspore ->>> from transformers import CLIPTokenizer ->>> from mindone.transformers import CLIPTextModel ->>> from mindone.diffusers import AutoencoderKL, UNet2DConditionModel, PNDMScheduler - ->>> vae = AutoencoderKL.from_pretrained("CompVis/stable-diffusion-v1-4", subfolder="vae", use_safetensors=True) ->>> tokenizer = CLIPTokenizer.from_pretrained("CompVis/stable-diffusion-v1-4", subfolder="tokenizer") ->>> text_encoder = CLIPTextModel.from_pretrained( -... "CompVis/stable-diffusion-v1-4", subfolder="text_encoder", use_safetensors=True -... ) ->>> unet = UNet2DConditionModel.from_pretrained( -... "CompVis/stable-diffusion-v1-4", subfolder="unet", use_safetensors=True -... ) +Now that you know what you need for the Stable Diffusion pipeline, load all these components with the [`from_pretrained`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/models/overview/#mindone.diffusers.ModelMixin.from_pretrained) method. You can find them in the pretrained [`stable-diffusion-v1-5/stable-diffusion-v1-5`](https://huggingface.co/stable-diffusion-v1-5/stable-diffusion-v1-5) checkpoint, and each component is stored in a separate subfolder: + +```python +from PIL import Image +import mindspore +from transformers import CLIPTokenizer +from mindone.transformers import CLIPTextModel +from mindone.diffusers import AutoencoderKL, UNet2DConditionModel, PNDMScheduler + +vae = AutoencoderKL.from_pretrained("CompVis/stable-diffusion-v1-4", subfolder="vae", use_safetensors=True) +tokenizer = CLIPTokenizer.from_pretrained("CompVis/stable-diffusion-v1-4", subfolder="tokenizer") +text_encoder = CLIPTextModel.from_pretrained( + "CompVis/stable-diffusion-v1-4", subfolder="text_encoder", use_safetensors=True +) +unet = UNet2DConditionModel.from_pretrained( + "CompVis/stable-diffusion-v1-4", subfolder="unet", use_safetensors=True +) ``` -Instead of the default [`PNDMScheduler`], exchange it for the [`UniPCMultistepScheduler`] to see how easy it is to plug a different scheduler in: +Instead of the default [`PNDMScheduler`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/schedulers/pndm/#mindone.diffusers.PNDMScheduler), exchange it for the [`UniPCMultistepScheduler`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/schedulers/unipc/#mindone.diffusers.UniPCMultistepScheduler) to see how easy it is to plug a different scheduler in: -```pycon ->>> from mindone.diffusers import UniPCMultistepScheduler +```python +from mindone.diffusers import UniPCMultistepScheduler ->>> scheduler = UniPCMultistepScheduler.from_pretrained("CompVis/stable-diffusion-v1-4", subfolder="scheduler") +scheduler = UniPCMultistepScheduler.from_pretrained("CompVis/stable-diffusion-v1-4", subfolder="scheduler") ``` ### Create text embeddings @@ -148,38 +149,38 @@ The next step is to tokenize the text to generate embeddings. The text is used t Feel free to choose any prompt you like if you want to generate something else! -```pycon ->>> prompt = ["a photograph of an astronaut riding a horse"] ->>> height = 512 # default height of Stable Diffusion ->>> width = 512 # default width of Stable Diffusion ->>> num_inference_steps = 25 # Number of denoising steps ->>> guidance_scale = 7.5 # Scale for classifier-free guidance ->>> generator = np.random.Generator(np.random.PCG64(seed=0)) # Seed generator to create the initial latent noise ->>> batch_size = len(prompt) +```python +prompt = ["a photograph of an astronaut riding a horse"] +height = 512 # default height of Stable Diffusion +width = 512 # default width of Stable Diffusion +num_inference_steps = 25 # Number of denoising steps +guidance_scale = 7.5 # Scale for classifier-free guidance +generator = np.random.Generator(np.random.PCG64(seed=0)) # Seed generator to create the initial latent noise +batch_size = len(prompt) ``` Tokenize the text and generate the embeddings from the prompt: -```pycon ->>> text_input = tokenizer( -... prompt, padding="max_length", max_length=tokenizer.model_max_length, truncation=True, return_tensors="np" -... ) +```python +text_input = tokenizer( + prompt, padding="max_length", max_length=tokenizer.model_max_length, truncation=True, return_tensors="np" +) ->>> text_embeddings = text_encoder(mindspore.Tensor(text_input.input_ids))[0] +text_embeddings = text_encoder(mindspore.Tensor(text_input.input_ids))[0] ``` You'll also need to generate the *unconditional text embeddings* which are the embeddings for the padding token. These need to have the same shape (`batch_size` and `seq_length`) as the conditional `text_embeddings`: -```pycon ->>> max_length = text_input.input_ids.shape[-1] ->>> uncond_input = tokenizer([""] * batch_size, padding="max_length", max_length=max_length, return_tensors="np") ->>> uncond_embeddings = text_encoder(mindspore.Tensor(uncond_input.input_ids))[0] +```python +max_length = text_input.input_ids.shape[-1] +uncond_input = tokenizer([""] * batch_size, padding="max_length", max_length=max_length, return_tensors="np") +uncond_embeddings = text_encoder(mindspore.Tensor(uncond_input.input_ids))[0] ``` Let's concatenate the conditional and unconditional embeddings into a batch to avoid doing two forward passes: -```pycon ->>> text_embeddings = mindspore.ops.cat([uncond_embeddings, text_embeddings]) +```python +text_embeddings = mindspore.ops.cat([uncond_embeddings, text_embeddings]) ``` ### Create random noise @@ -190,22 +191,22 @@ Next, generate some initial random noise as a starting point for the diffusion p 💡 The height and width are divided by 8 because the `vae` model has 3 down-sampling layers. You can check by running the following: - ```pycon - >>> 2 ** (len(vae.config.block_out_channels) - 1) == 8 + ```python + print(2 ** (len(vae.config.block_out_channels) - 1) == 8) ``` -```pycon ->>> latents = mindspore.ops.randn( -... (batch_size, unet.config.in_channels, height // 8, width // 8), -... ) +```python +latents = mindspore.ops.randn( + (batch_size, unet.config.in_channels, height // 8, width // 8), +) ``` ### Denoise the image -Start by scaling the input with the initial noise distribution, *sigma*, the noise scale value, which is required for improved schedulers like [`UniPCMultistepScheduler`]: +Start by scaling the input with the initial noise distribution, *sigma*, the noise scale value, which is required for improved schedulers like [`UniPCMultistepScheduler`](https://mindspore-lab.github.io/mindone/latest/diffusers/api/schedulers/unipc/#mindone.diffusers.UniPCMultistepScheduler): -```pycon ->>> latents = latents * scheduler.init_noise_sigma +```python +latents = latents * scheduler.init_noise_sigma ``` The last step is to create the denoising loop that'll progressively transform the pure noise in `latents` to an image described by your prompt. Remember, the denoising loop needs to do three things: @@ -214,49 +215,49 @@ The last step is to create the denoising loop that'll progressively transform th 2. Iterate over the timesteps. 3. At each timestep, call the UNet model to predict the noise residual and pass it to the scheduler to compute the previous noisy sample. -```pycon ->>> from tqdm.auto import tqdm - ->>> scheduler.set_timesteps(num_inference_steps) - ->>> for t in tqdm(scheduler.timesteps): -... # expand the latents if we are doing classifier-free guidance to avoid doing two forward passes. -... latent_model_input = mindspore.ops.cat([latents] * 2) -... -... latent_model_input = scheduler.scale_model_input(latent_model_input, timestep=t) -... -... # predict the noise residual -... noise_pred = unet(latent_model_input, t, encoder_hidden_states=text_embeddings)[0] -... -... # perform guidance -... noise_pred_uncond, noise_pred_text = noise_pred.chunk(2) -... noise_pred = noise_pred_uncond + guidance_scale * (noise_pred_text - noise_pred_uncond) -... -... # compute the previous noisy sample x_t -> x_t-1 -... latents = scheduler.step(noise_pred, t, latents)[0] +```python +from tqdm.auto import tqdm + +scheduler.set_timesteps(num_inference_steps) + +for t in tqdm(scheduler.timesteps): + # expand the latents if we are doing classifier-free guidance to avoid doing two forward passes. + latent_model_input = mindspore.ops.cat([latents] * 2) + + latent_model_input = scheduler.scale_model_input(latent_model_input, timestep=t) + + # predict the noise residual + noise_pred = unet(latent_model_input, t, encoder_hidden_states=text_embeddings)[0] + + # perform guidance + noise_pred_uncond, noise_pred_text = noise_pred.chunk(2) + noise_pred = noise_pred_uncond + guidance_scale * (noise_pred_text - noise_pred_uncond) + + # compute the previous noisy sample x_t -> x_t-1 + latents = scheduler.step(noise_pred, t, latents)[0] ``` ### Decode the image The final step is to use the `vae` to decode the latent representation into an image and get the decoded output with `sample`: -```pycon ->>> # scale and decode the image latents with vae ->>> latents = 1 / 0.18215 * latents ->>> image = vae.decode(latents)[0] +```python +# scale and decode the image latents with vae +latents = 1 / 0.18215 * latents +image = vae.decode(latents)[0] ``` Lastly, convert the image to a `PIL.Image` to see your generated image! -```pycon ->>> image = (image / 2 + 0.5).clamp(0, 1).squeeze() ->>> image = (image.permute(1, 2, 0) * 255).to(mindspore.uint8).numpy() ->>> image = Image.fromarray(image) ->>> image +```python +image = (image / 2 + 0.5).clamp(0, 1).squeeze() +image = (image.permute(1, 2, 0) * 255).to(mindspore.uint8).numpy() +image = Image.fromarray(image) +image ``` -
- +
+
## Next steps @@ -267,5 +268,5 @@ This is really what 🧨 Diffusers is designed for: to make it intuitive and eas For your next steps, feel free to: -* Learn how to [build and contribute a pipeline](../using-diffusers/contribute_pipeline) to 🧨 Diffusers. We can't wait and see what you'll come up with! -* Explore [existing pipelines](../api/pipelines/overview) in the library, and see if you can deconstruct and build a pipeline from scratch using the models and schedulers separately. +* Learn how to [build and contribute a pipeline](../using-diffusers/contribute_pipeline.md) to 🧨 Diffusers. We can't wait and see what you'll come up with! +* Explore [existing pipelines](../api/pipelines/overview.md) in the library, and see if you can deconstruct and build a pipeline from scratch using the models and schedulers separately. diff --git a/docs/stylesheets/extra.css b/docs/stylesheets/extra.css index 4d6fe3e603..4036792521 100644 --- a/docs/stylesheets/extra.css +++ b/docs/stylesheets/extra.css @@ -17,3 +17,7 @@ --md-footer-bg-color: hsla(0, 0%, 100%, 0.87); --md-footer-bg-color--dark: hsla(0, 0%, 100%, 0.32); } + +img { + border-radius: 15px; +} diff --git a/mindone/diffusers/README.md b/mindone/diffusers/README.md index 4b63f10a6f..6461bfa137 100644 --- a/mindone/diffusers/README.md +++ b/mindone/diffusers/README.md @@ -29,13 +29,13 @@ limitations under the License. > [!WARNING] > Due to differences in framework, some APIs will not be identical to [huggingface/diffusers](https://github.com/huggingface/diffusers) in the foreseeable future, see [Limitations](#Limitations) for details. -🤗 Diffusers is the go-to library for state-of-the-art pretrained diffusion models for generating images, audio, and even 3D structures of molecules. Whether you're looking for a simple inference solution or training your own diffusion models, 🤗 Diffusers is a modular toolbox that supports both. Our library is designed with a focus on [usability over performance](https://huggingface.co/docs/diffusers/conceptual/philosophy#usability-over-performance), [simple over easy](https://huggingface.co/docs/diffusers/conceptual/philosophy#simple-over-easy), and [customizability over abstractions](https://huggingface.co/docs/diffusers/conceptual/philosophy#tweakable-contributorfriendly-over-abstraction). +🤗 Diffusers is the go-to library for state-of-the-art pretrained diffusion models for generating images, audio, and even 3D structures of molecules. Whether you're looking for a simple inference solution or training your own diffusion models, 🤗 Diffusers is a modular toolbox that supports both. Our library is designed with a focus on [usability over performance](https://mindspore-lab.github.io/mindone/latest/diffusers/conceptual/philosophy/#usability-over-performance), [simple over easy](https://mindspore-lab.github.io/mindone/latest/diffusers/conceptual/philosophy/#simple-over-easy), and [customizability over abstractions](https://mindspore-lab.github.io/mindone/latest/diffusers/conceptual/philosophy/#tweakable-contributor-friendly-over-abstraction). 🤗 Diffusers offers three core components: -- State-of-the-art [diffusion pipelines](https://huggingface.co/docs/diffusers/api/pipelines/overview) that can be run in inference with just a few lines of code. -- Interchangeable noise [schedulers](https://huggingface.co/docs/diffusers/api/schedulers/overview) for different diffusion speeds and output quality. -- Pretrained [models](https://huggingface.co/docs/diffusers/api/models/overview) that can be used as building blocks, and combined with schedulers, for creating your own end-to-end diffusion systems. +- State-of-the-art [diffusion pipelines](https://mindspore-lab.github.io/mindone/latest/diffusers/api/pipelines/overview) that can be run in inference with just a few lines of code. +- Interchangeable noise [schedulers](https://mindspore-lab.github.io/mindone/latest/diffusers/api/schedulers/overview) for different diffusion speeds and output quality. +- Pretrained [models](https://mindspore-lab.github.io/mindone/latest/diffusers/api/models/overview) that can be used as building blocks, and combined with schedulers, for creating your own end-to-end diffusion systems. ## Quickstart @@ -83,7 +83,7 @@ image = Image.fromarray((image * 255).round().astype("uint8")) image ``` -Check out the [Quickstart](https://huggingface.co/docs/diffusers/quicktour) to launch your diffusion journey today! +Check out the [Quickstart](https://mindspore-lab.github.io/mindone/latest/diffusers/quicktour/) to launch your diffusion journey today! ## Roadmap