A simple transformer-based autoencoder model.
- Encoder and decoder are both vanilla ViT models.
- The skeleton of the code is recycled from Facebook's MAE repository with several simplifications.
- Work in progress.
- Better representational alignment with transformer models used in downstream tasks, e.g. diffusion transformers.
- Trading off embedding dimensionality for much reduced spatial size, e.g. being able to train diffusion transformers with a 4x4 spatial grid = 16 spatial tokens (this can in principle be done with convnet-based autoencoders too, but is more natural and convenient with transformers). In transformers, complexity scales quadratically with the number of spatial tokens and linearly with dimensionality, so this trade-off leads to more compute efficient models. It also opens the door to training models on massively larger images/videos.
- Current "first stage models" used for image/video compression are too complicated, e.g. using adversarial losses (among others). I'd like to simplify this process by showing simple plain autoencoders are performant as first stage models.