SERE: Exploring Feature Self-relation for Self-supervised Transformer (TPAMI 2023)
The official codebase for SERE: Exploring Feature Self-relation for Self-supervised Transformer.
Learning representations with self-supervision for convolutional networks (CNN) has been validated to be effective for vision tasks. As an alternative to CNN, vision transformers (ViT) have strong representation ability with spatial self-attention and channel-level feedforward networks. Recent works reveal that self-supervised learning helps unleash the great potential of ViT. Still, most works follow self-supervised strategies designed for CNN, e.g., instance-level discrimination of samples, but they ignore the properties of ViT. We observe that relational modeling on spatial and channel dimensions distinguishes ViT from other networks. To enforce this property, we explore the feature SElf-RElation (SERE) for training self-supervised ViT. Specifically, instead of conducting self-supervised learning solely on feature embeddings from multiple views, we utilize the feature self-relations, i.e., spatial/channel self-relations, for self-supervised learning. Self-relation based learning further enhances the relation modeling ability of ViT, resulting in stronger representations that stably improve performance on multiple downstream tasks.
Please install PyTorch and download the ImageNet dataset. This codebase has been developed with python version 3.8, PyTorch version 1.10.1, CUDA 11.3 and torchvision 0.11.2.
Architecture | Method | Parameters | Pre-training Epochs | Fine-tuning Epochs | Top-1 | download | |||||
---|---|---|---|---|---|---|---|---|---|---|---|
ViT-S/16 | iBOT+SERE | 21M | 100 | 100 | 81.5% | backbone | |||||
ViT-B/16 | iBOT+SERE | 85M | 100 | 100 | 83.7% | backbone |
IBOT+SERE with ViT-S/16:
python -m torch.distributed.launch --nproc_per_node=8 \
--master_port=$PORT \
main_sere.py \
--arch vit_small \
--output_dir $OUTPUT_DIR \
--data_path $IMAGENET \
--teacher_temp 0.07 \
--warmup_teacher_temp_epochs 30 \
--norm_last_layer false \
--epochs 100 \
--shared_head true \
--out_dim 8192 \
--local_crops_number 10 \
--global_crops_scale 0.40 1 \
--local_crops_scale 0.05 0.40 \
--pred_ratio 0 0.3 \
--pred_ratio_var 0 0.2 \
--batch_size_per_gpu 128 \
--num_workers 6 \
--saveckp_freq 10 \
--alpha 0.2 \
--beta 0.5 \
--clip_grad 0.3
IBOT+SERE with ViT-B/16:
python -m torch.distributed.launch --nproc_per_node=8 \
--master_port=$PORT \
main_sere.py \
--arch vit_base \
--output_dir $OUTPUT_DIR \
--data_path $IMAGENET \
--teacher_temp 0.07 \
--teacher_patch_temp 0.07 \
--warmup_teacher_temp 0.04 \
--warmup_teacher_patch_temp 0.04 \
--warmup_teacher_temp_epochs 50 \
--norm_last_layer true \
--warmup_epochs 10 \
--epochs 100 \
--lr 0.00075 \
--min_lr 2e-6 \
--weight_decay 0.04 \
--weight_decay_end 0.4 \
--shared_head true \
--shared_head_teacher true \
--out_dim 8192 \
--patch_out_dim 8192 \
--local_crops_number 10 \
--global_crops_scale 0.32 1 \
--local_crops_scale 0.05 0.32 \
--pred_ratio 0 0.3 \
--pred_ratio_var 0 0.2 \
--pred_shape block \
--batch_size_per_gpu 128 \
--num_workers 6 \
--saveckp_freq 10 \
--freeze_last_layer 3 \
--clip_grad 0.3 \
--alpha 0.2 \
--beta 0.5 \
--use_fp16 true
We fully fine-tune the pre-trained models on ImageNet-1K by using the codebase of MAE.
For downstream tasks, e.g., semantic segmentation, PLease refer to iBOT.
Addentionally, we also use ImageNetSegModel to implement semi-supevised semantic segmentation on ImageNet-S dataset.
If you find this repository useful, please consider giving a star and a citation:
@article{li2023sere,
title={SERE: Exploring Feature Self-relation for Self-supervised Transformer},
author={Zhong-Yu Li and Shanghua Gao and Ming-Ming Cheng},
journal=TPAMI,
year={2023}
}
The code is released under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International Public License for Noncommercial use only. Any commercial use should get formal permission first.
This repository is built using the DINO repository, the iBOT repository, and the MAE repository.