Vision Transformer from Scratch in PyTorch

Overview:

The default network is a scaled-down version of the original ViT architecture from the ViT Paper.
Has only 200k-800k parameters depending upon the embedding dimension (Original ViT-Base has 86 million).
Tested on MNIST, FashionMNIST, SVHN, CIFAR10, and CIFAR100 datasets.
Uses a smaller patch size of 4.
Can be used with bigger datasets by increasing the model parameters and patch size.

Dataset	Run command	Test Acc
MNIST	python main.py --dataset mnist --epochs 100	99.5
Fashion MNIST	python main.py --dataset fmnist	92.3
SVHN	python main.py --dataset svhn --n_channels 3 --image_size 32 --embed_dim 128	96.2
CIFAR10	python main.py --dataset cifar10 --n_channels 3 --image_size 32 --embed_dim 128	86.3 (82.5 w/o RandAug)
CIFAR100	python main.py --dataset cifar100 --n_channels 3 --image_size 32 --embed_dim 128	59.6 (55.8 w/o RandAug)

Name		Name	Last commit message	Last commit date
Latest commit History 150 Commits
outputs		outputs
LICENSE		LICENSE
README.md		README.md
data_loader.py		data_loader.py
main.py		main.py
model.py		model.py
scripts.sh		scripts.sh
solver.py		solver.py