mindspore-lab · SamitHuang · Sep 27, 2024 · Aug 1, 2024 · Aug 1, 2024 · Aug 1, 2024
@@ -148,6 +148,7 @@ Your contributions are welcome.
 * [Data Processing](#data-processing)
 * [Training](#training)
 * [Evaluation](#evaluation)
+* [VAE Training & Evaluation](#vae-training--evaluation)
 * [Contribution](#contribution)
 * [Acknowledgement](#acknowledgement)
 
@@ -284,6 +285,7 @@ parameters is 724M. More information about training can be found in HPC-AI Tech'
 </details>
 
 
+
 ## Inference
 
 ### Open-Sora 1.2 and 1.1 Command Line Inference
@@ -759,7 +761,80 @@ Here are some generation results after fine-tuning STDiT on a subset of WebVid d
 #### Quality Evaluation
 For quality evaluation, please refer to the original HPC-AI Tech [evaluation doc](https://github.com/hpcaitech/Open-Sora/blob/main/eval/README.md) for video generation quality evaluation.
 
-</details>
+
+## VAE Training & Evaluation
+
+A 3D-VAE pipeline consisting of a spatial VAE followed by a temporal VAE is trained in OpenSora v1.1. For more details, refer to [VAE Documentation](https://github.com/hpcaitech/Open-Sora/blob/main/docs/vae.md).
+
+### Prepare Pretrained Weights
+
+- Download pretained VAE-2D checkpoint from [PixArt-alpha/pixart_sigma_sdxlvae_T5_diffusers](https://huggingface.co/PixArt-alpha/pixart_sigma_sdxlvae_T5_diffusers/tree/main/vae) if you aim to train VAE-3D from spatial VAE initialization.
+
+    Convert to ms checkpoint:
+    ```
+    python tools/convert_vae1.2.py --src /path/to/pixart_sigma_sdxlvae_T5_diffusers/vae/diffusion_pytorch_model.safetensors --target models/sdxl_vae.ckpt --from_vae2d
+    ```
+
+- Downalod pretrained VAE-3D checkpoint from [hpcai-tech/OpenSora-VAE-v1.2](https://huggingface.co/hpcai-tech/OpenSora-VAE-v1.2/tree/main) if you aim to train VAEA-3D from the VAE-3D model pre-trained with 3 stages.
+
+    Convert to ms checkpoint:
+    ```
+    python tools/convert_vae1.2.py --src /path/OpenSora-VAE-v1.2/models.safetensors --target models/OpenSora-VAE-v1.2/sdxl_vae.ckpt
+    ```
+
+- Download lpips mindspore checkpoint from [here](https://download-mindspore.osinfra.cn/toolkits/mindone/autoencoders/lpips_vgg-426bf45c.ckpt) and put it under 'models/'
+
+
+### Data Preprocess
+Before VAE-3D training, we need to prepare a csv annotation file for the training videos. The csv file list the path to each video related to the root `video_folder`. An example is
+```
+video
+dance/vid001.mp4
+dance/vid002.mp4
+...
+```
+
+Taking UCF-101 for example, please download the [UCF-101](https://www.crcv.ucf.edu/data/UCF101.php) dataset and extract it to `datasets/UCF-101` folder. You can generate the csv annotation by running  `python tools/annotate_vae_ucf101.py`. It will result in two csv files, `datasets/ucf101_train.csv` and `datasets/ucf101_test.csv`, for training and  testing respectively.
+
+
+### Training
+```bash
+# stage 1 training, 8 NPUs
+msrun --worker_num=8 --local_work_num=8 \
+python scripts/train_vae.py --config configs/vae/train/stage1.yaml --use_parallel=True --csv_path datasets/ucf101_train.csv --video_folder datasets/UCF-101
+
+# stage 2 training, 8 NPUs
+msrun --worker_num=8 --local_work_num=8 \
+python scripts/train_vae.py --config configs/vae/train/stage2.yaml --use_parallel=True --csv_path datasets/ucf101_train.csv --video_folder datasets/UCF-101
+
+# stage 3 training, 8 NPUs
+msrun --worker_num=8 --local_work_num=8 \
+python scripts/train_vae.py --config configs/vae/train/stage3.yaml --use_parallel=True --csv_path datasets/ucf101_train.csv --video_folder datasets/UCF-101
+```
+
+You can change the `csv_path` and `video_folder` to train on your own data.
+
+###  Performance Evaluation
+To evaluate the VAE performance, you need to run VAE inference first to generate the videos, then calculate scores on the generated videos:
+
+```bash
+# video generation and evaluation
+python scripts/inference_vae.py --ckpt_path /path/to/you_vae_ckpt --image_size 256 --num_frames=17 --csv_path datasets/ucf101_test.csv --video_folder datasets/UCF-101
+```
+
+You can change the `csv_path` and `video_folder` to evaluate on your own data.
+
+Here, we report the training performance and evaluation results on the UCF-101 dataset.
+
+| Model       | Context      | jit_level | Precision | BS | NPUs | Resolution(framesxHxW) | Train T. (s/step) |    PSNR   |   SSIM  |
+|:------------|:-------------|:--------|:---------:|:--:|:----:|:----------------------:|:-----------------:|:-----------------:|:-----------------:|
+| VAE-3D | D910\*-[CANN C18(0705)](https://repo.mindspore.cn/ascend/ascend910/20240705/)-[MS2.3](https://www.mindspore.cn/install) |    O1  |    BF16   |  1 |  1   |       stage1-17x256x256       |       0.59         |    n.a.      |    n.a.    |
+| VAE-3D | D910\*-[CANN C18(0705)](https://repo.mindspore.cn/ascend/ascend910/20240705/)-[MS2.3](https://www.mindspore.cn/install) |    O1  |    BF16   |  1 |  1   |       stage2-17x256x256      |        0.56         |    n.a.      |    n.a.    |
+| VAE-3D | D910\*-[CANN C18(0705)](https://repo.mindspore.cn/ascend/ascend910/20240705/)-[MS2.3](https://www.mindspore.cn/install) |    O1  |    BF16   |  1 |  8   |       stage3-17x256x256      |       0.97        |    29.29      |    0.88    |
+> Context: {G:GPU, D:Ascend}{chip type}-{mindspore version}.
+
+Note that we train with mixed video ang image strategy i.e. `--mixed_strategy=mixed_video_image` for stage 3 instead of random number of frames (`mixed_video_random`). Random frame training will be supported in the future.
+
 
 ## Training and Inference Using the FiT-Like Pipeline
 

@@ -0,0 +1,47 @@
+# model
+model_type: "OpenSoraVAE_V1_2"
+freeze_vae_2d: True
+pretrained_model_path: "models/sdxl_vae.ckpt"
+
+# loss
+perceptual_loss_weight: 0.1
+kl_loss_weight: 1.e-6
+use_real_rec_loss: False
+use_z_rec_loss: True
+use_image_identity_loss: True
+mixed_strategy: "mixed_video_image"
+mixed_image_ratio: 0.2
+
+# data
+dataset_name: "video"
+csv_path: "../videocomposer/datasets/webvid5_copy.csv"
+video_folder: "../videocomposer/datasets/webvid5"
+frame_stride: 1
+num_frames: 17
+image_size: 256
+
+micro_frame_size: null
+micro_batch_size: null
+
+# training recipe
+seed: 42
+use_discriminator: False
+dtype: "fp16"
+batch_size: 1
+clip_grad: True
+max_grad_norm: 1.0
+start_learning_rate: 1.e-5
+scale_lr: False
+use_recompute: False
+
+epochs: 2000
+ckpt_save_interval: 100
+init_loss_scale: 1.
+
+scheduler: "constant"
+use_ema: False
+
+output_path: "outputs/causal_vae"
+
+# ms settting
+jit_level: O0
@@ -0,0 +1,48 @@
+# model
+model_type: "OpenSoraVAE_V1_2"
+freeze_vae_2d: False
+pretrained_model_path: "outputs/vae_stage1.ckpt"
+
+# loss
+perceptual_loss_weight: 0.1
+kl_loss_weight: 1.e-6
+use_real_rec_loss: False
+use_z_rec_loss: True
+use_image_identity_loss: False
+mixed_strategy: "mixed_video_image"
+mixed_image_ratio: 0.2
+
+# data
+dataset_name: "video"
+csv_path: "../videocomposer/datasets/webvid5_copy.csv"
+video_folder: "../videocomposer/datasets/webvid5"
+frame_stride: 1
+num_frames: 17
+image_size: 256
+
+micro_frame_size: null
+micro_batch_size: null
+# flip: True
+
+# training recipe
+seed: 42
+use_discriminator: False
+dtype: "bf16"
+batch_size: 1
+clip_grad: True
+max_grad_norm: 1.0
+start_learning_rate: 1.e-5
+scale_lr: False
+use_recompute: True
+
+epochs: 500
+ckpt_save_interval: 100
+init_loss_scale: 1.
+
+scheduler: "constant"
+use_ema: False
+
+output_path: "outputs/vae_stage2"
+
+# ms settting
+jit_level: O0
@@ -0,0 +1,49 @@
+# model
+model_type: "OpenSoraVAE_V1_2"
+freeze_vae_2d: False
+pretrained_model_path: "outputs/vae_stage2.ckpt"
+
+# loss
+perceptual_loss_weight: 0.1
+kl_loss_weight: 1.e-6
+use_real_rec_loss: True
+use_z_rec_loss: False
+use_image_identity_loss: False
+mixed_strategy: "mixed_video_image"  # TODO: use mixed_video_random after dynamic shape adaptation
+mixed_image_ratio: 0.2
+
+# data
+dataset_name: "video"
+csv_path: "../videocomposer/datasets/webvid5_copy.csv"
+video_folder: "../videocomposer/datasets/webvid5"
+frame_stride: 1
+num_frames: 33  # TODO: set 33 after dynamic shape adaptation and posterior concat fixed
+image_size: 256
+
+micro_frame_size: 17
+micro_batch_size: 4
+# flip: True
+
+# training recipe
+seed: 42
+use_discriminator: False
+dtype: "fp16"
+batch_size: 1
+clip_grad: True
+max_grad_norm: 1.0
+start_learning_rate: 1.e-5
+scale_lr: False
+weight_decay: 0.
+use_recompute: True
+
+epochs: 400
+ckpt_save_interval: 100
+init_loss_scale: 1.
+
+scheduler: "constant"
+use_ema: False
+
+output_path: "outputs/vae_stage3"
+
+# ms settting
+jit_level: O0