GitHub - eminorhan/tae: A simple transformer-based autoencoder model

Transformer Autoencoder (TAE)

A simple transformer-based autoencoder model.

Encoder and decoder are both vanilla ViT models.
The skeleton of the code is recycled from Facebook's MAE repository with several simplifications.
Work in progress.

Better representational alignment with transformer models used in downstream tasks, e.g. diffusion transformers.
Trading off embedding dimensionality for much reduced spatial size, e.g. being able to train diffusion transformers with a 4x4 spatial grid = 16 spatial tokens (this can in principle be done with convnet-based autoencoders too, but is more natural and convenient with transformers). In transformers, complexity scales quadratically with the number of spatial tokens and linearly with dimensionality, so this trade-off leads to more compute efficient models. It also opens the door to training models on massively larger images/videos.
Current "first stage models" used for image/video compression are too complicated, e.g. using adversarial losses (among others). I'd like to simplify this process by showing simple plain autoencoders are performant as first stage models.

Name		Name	Last commit message	Last commit date
Latest commit History 89 Commits
outputs		outputs
outputs_recognition/in19k		outputs_recognition/in19k
recognition		recognition
scripts		scripts
segmentation		segmentation
tests		tests
util		util
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
encode.py		encode.py
evaluate.py		evaluate.py
tae.py		tae.py
train.py		train.py