MMDiT implementation and text-to-image training with rectified flows #155

coryMosaicML · 2024-07-09T01:10:43Z

This PR contains an implementation of the MMDiT model from the SD3 paper, along with a model class for using it to train text to image models. To support this, a generic model inference class is also included.

Major additions:
diffusion/inference/inference_model.py has Modelnference class for inference with arbitrary models from models.py
diffusion/models/models.py Includes a text_to_image_transformer model for SD3 style MMDiT
diffusion/models/t2i_transformer.py has the ComposerModel class for the MMDiT text to image model
diffusion/models/transformer.py has the layers/blocks for the MMDiT model
diffusion/train.py includes a new function to configure the optimizer for the new text to image model

diffusion/models/transformer.py

A-Jacobson

Left a few commends about structure. Basically, I'd like to see more of the transformer logic confined to the transformer with the ComposerModel getting as close to "create the 3 models and call them" as possible. Overall, it's awesome and I conditionally approve as it's non-breaking and also has successful test runs.

diffusion/models/t2i_transformer.py

A-Jacobson reviewed Jul 9, 2024

View reviewed changes

diffusion/models/transformer.py Outdated Show resolved Hide resolved

A-Jacobson approved these changes Jul 9, 2024

View reviewed changes

diffusion/models/t2i_transformer.py Show resolved Hide resolved

diffusion/models/t2i_transformer.py Show resolved Hide resolved

corystephenson-db added 25 commits July 25, 2024 21:21

Initial transformer

293d30d

Initial running composer model

a882d67

Configurable model size

307bc12

Fix calculation of input sequence length

e2dd53e

Forgot downsample factor in sequence length calc

3317a09

Need affines for the layernorms

312c13e

Turn off weight decay for biases, norms, and position embeddings

36b3432

Refactor and add tests

40b21cc

Wrapping, pooled conditioning, flop calc fix

b396fa6

Initial working MMDiT implementation

36df692

Simplify mask logic

d440142

Rectified flows and pooled embeddings

de3d50e

Prep for inference

2ad5fc2

Pooled embeddings should be zeroed after embedding for cfg

499e7ee

Use shared functions to reduce error surface

d06580a

Docs and a subtle timestep bug fix

4e21bc0

Docs and types for transformer

602c0a9

Refactor composer model to be separate from base transformer

a59ee6d

Docs and types for composer model

5db34d4

Add dummy pretrained flag

f2ffcb6

Minor cleanup

b8aaa4c

Update tests

40b5d9f

Figure out max input sequence length from image size

f62f647

Equally spaced timesteps during eval

0df0095

Separate AdaLN and modulation modules

ab641eb

coryMosaicML force-pushed the mmdit-rf branch from be346cd to ab641eb Compare July 25, 2024 21:23

corystephenson-db added 2 commits July 25, 2024 21:31

Fix formatting

ce55ee0

Some tests are gpu only

0f0ca90

coryMosaicML merged commit ef74f2b into mosaicml:main Jul 26, 2024
5 checks passed

shunk031 mentioned this pull request Aug 2, 2024

Example config for training Diffusion Transformer #162

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MMDiT implementation and text-to-image training with rectified flows #155

MMDiT implementation and text-to-image training with rectified flows #155

coryMosaicML commented Jul 9, 2024

A-Jacobson left a comment

MMDiT implementation and text-to-image training with rectified flows #155

MMDiT implementation and text-to-image training with rectified flows #155

Conversation

coryMosaicML commented Jul 9, 2024

A-Jacobson left a comment

Choose a reason for hiding this comment