Skip to content

Latest commit

 

History

History
128 lines (98 loc) · 5.48 KB

README.md

File metadata and controls

128 lines (98 loc) · 5.48 KB

[Updated 2024/08/08]. Code released.

[Planned to release in July 2024]

Pytorch Implementation of Cross-view Masked Diffusion Transformers for Person Image Synthesis, ICML 2024.

Authors: Trung X. Pham, Zhang Kang, and Chang D. Yoo.

Introduction

X-MDPT ($\underline{Cross}$-view Masked Diffusion Prediction Transformers) is the first diffusion transformer-based framework, a novel approach designed for pose-guided human image generation. X-MDPT demonstrates exceptional scalability and performance, significantly improving FID, SSIM, and LPIPS metrics as model size increases. Despite its straightforward design, the framework outperforms state-of-the-art approaches on the DeepFashion dataset, excelling in training efficiency and inference speed. The compact 33MB model achieves an FID of 7.42, surpassing the prior most efficient Unet latent diffusion approach PoCoLD (FID of 8.07) with $11\times$ fewer parameters (396MB). The best model surpasses SOTA pixel-based diffusion PIDM with two-thirds of the parameters and achieves $5.43\times$ faster inference.

Efficiency Advantages

image

Comparisons with state-of-the-arts

image

Consistent Targets

image

Setup Environment

We have tested with Pytorch 1.12+cuda11.6, using a docker.

conda create -n xmdpt python=3.8
conda activate xmdpt
pip install -r requirements.txt

Prepare Dataset

Downloading the DeepFashion dataset and processing it into the lmdb format for easy training and inference. Refer to PIDM (CVPR2023) for this LMDB. The data structure should be as follows:

datasets/
|-- [  38]  deepfashion
|   |-- [6.4M]  train_pairs.txt
|   |-- [2.1M]  train.lst
|   |-- [817K]  test_pairs.txt
|   |-- [182K]  test.lst
|   |-- [4.0K]  256-256
|   |   |-- [8.0K]  lock.mdb
|   |   `-- [2.4G]  data.mdb
|   |-- [8.7M]  pose.rar
|   `-- [4.0K]  512-512
|       |-- [8.0K]  lock.mdb
|       `-- [8.4G]  data.mdb
|   |-- [4.0K]  pose
|   |   |-- [4.0K]  WOMEN
|   |   |   |-- [ 12K]  Shorts
|   |   |   |   |-- [4.0K]  id_00007890
|   |   |   |   |   |-- [ 900]  04_4_full.txt
|   |   |-- [4.0K]  MEN
...

Training

CUDA_VISIBLE_DEVICES=0 bash run_train.sh

By default, it will save checkpoints for every 10k steps. You can use that for inference as below.

Inference

Download all checkpoints and VAE (fine-tuned only decoder) and put them into the correct place as in the default file infer_xmdpt.py.

For the test set of Deep Fashion, run the following

CUDA_VISIBLE_DEVICES=0 python infer_xmdpt.py

It will save the output image samples as in test_img of this repo.

For the arbitrary image, run the following (not implemented)

CUDA_VISIBLE_DEVICES=0 python infer_xmdpt.py --image_path test.png

Pretrained Models

All of our models had been trained and tested using a single A100 (80GB) GPU.

Model Step Resolution FID Params Inference Time Link
X-MDPT-S 300k 256x256 7.42 33.5M 1.1s Link
X-MDPT-B 300k 256x256 6.72 131.9M 1.3s Link
X-MDPT-L 300k 256x256 6.60 460.2M 3.1s Link
VAE - - - - - Link

Expected Outputs

image

Citation If X-MDPT is useful or relevant to your research, please kindly recognize our contributions by citing our papers:

@inproceedings{pham2024crossview,
title={Cross-view Masked Diffusion Transformers for Person Image Synthesis},
author={Trung X. Pham and Kang Zhang and Chang D. Yoo},
booktitle={Forty-first International Conference on Machine Learning},
year={2024},
url={https://openreview.net/forum?id=jEoIkNkqyc}
}

Acknowledgements

This work was supported by the Institute for Information & Communications Technology Planning & Evaluation (IITP) grants funded by the Korean government (MSIT) (No. 2021-0-01381, Development of Causal AI through Video Understanding and Reinforcement Learning, and Its Applications to Real Environments) and (No. 2022-0-00184, Development and Study of AI Technologies to Inexpensively Conform to Evolving Policy on Ethics).

Helpful Repo

Thanks nice works of MDT (ICCV2023) and PIDM (CVPR2023) for publishing their codes.