This assignment is to train a model to classification the images of cifar10. All the models in this project were built by pytorch.
In addition, please refer to the following report link for detailed report and description of the experimental results.
Operating System: Ubuntu 20.04.3 LTS
CPU: Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz
In this part, I use anaconda and pip to build the execution environment.
In addition, the following two options can be used to build an execution environment
conda env create -f environment.yml
conda create --name cifar python=3.8
conda activate cifar
conda install pytorch torchvision torchaudio cudatoolkit=10.2 -c pytorch
conda install matplotlib pandas scikit-learn -y
pip install tqdm
In this homework, you can put the folder on the specified path according to the pattern in the following directory tree for training and testing.
The model weight can be download in the following link, please put it under the checkpoint directory.
The data can be download in the following link, please put it in the under the repository according to the following description.
├─ environment.yml
├─ history_csv
├─ checkpoint
├─ x_train.npy
├─ x_test.npy
├─ y_train.npy
├─ y_test.npy
image_size = 224
number_worker = 4
batch_size = 64
epochs = 10
lr = 2e-5
optimizer = AdamW
loss function = CrossEntropy
The Data Preprocess include two parts. The first part is the standardization of pixel value([0, 255] to [0, 1]). The second part is to adjust the image to 224 x 224.
In order to avoid the problem of the cuda out of memory, I create the data loader to process the data.
- Input: Image Array, Label Array, Data Augmentation method.
- Ouput: DataLoader
class CIFARLoader(data.Dataset):
def __init__(self, image, label, transform=None):
self.img_name, self.labels = image, label
self.transform = transform
print("> Found %d images..." % (len(self.img_name)))
def __len__(self):
"""'return the size of dataset"""
return len(self.img_name)
def __getitem__(self, index):
"""something you should implement here"""
self.img = self.img_name[index]
self.label = self.labels[index]
if self.transform:
self.img = self.transform(self.img)
return self.img, self.label
In this homework, I used the Vision Transformer pretrained model to classify images.
In addition, I added the linear layer to the Vision Transformer (VIT) [1], all the weight of the VIT is unfreeze.
The Architecture of the classification model is as follows.
class VIT(nn.Module):
def __init__(self, pretrained=True):
super(VIT, self).__init__()
self.model = models.vit_b_32(pretrained=pretrained)
self.classify = nn.Linear(1000, num_classes)
def forward(self, x):
x = self.model(x)
x = self.classify(x)
return x
model = VIT()
for name, child in model.named_children():
for param in child.parameters():
param.requires_grad = True
You can switch to the training mode with the following instruction, and then you can start training the classification model.
python --mode train
The best model weight during training will be stored at checkpoint directory, and the training history will in the history_csv directory.
The training accuracy history is as following.
The training Loss history is as following.
You can switch to the testing mode with the following instruction, and then you can evaluate the classification result.
Best Model Weight name: BEST_VIT_CIFAR.rar (Which is in the checkpoint directory)
python --mode test
[1] A. Dosovitskiy et al., “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale,” arXiv, arXiv:2010.11929, Jun. 2021. doi: 10.48550/arXiv.2010.11929.