Implementation of HAT https://arxiv.org/pdf/2204.00993
@inproceedings{bai2022improving,
title={Improving Vision Transformers by Revisiting High-frequency Components},
author={Bai, Jiawang and Yuan, Li and Xia, Shu-Tao and Yan, Shuicheng and Li, Zhifeng and Liu, Wei},
booktitle={European Conference on Computer Vision},
year={2022}
}
torch>=1.7.0
torchvision>=0.8.0
timm==0.4.5
tlt==0.1.0
pyyaml
apex-amp
We use the ImageNet-1K training and validation datasets by default. Please save them in [your_imagenet_path].
Training ViT models with HAT using the default settings in our paper on 8 GPUs:
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 ./distributed_train.sh 8 \
--data_dir [your_imagenet_path] \
--model [your_vit_model_name] \
--adv-epochs 200 \
--adv-iters 3 \
--adv-eps 0.00784314 \
--adv-kl-weight 0.01 \
--adv-ce-weight 3.0 \
--output [your_output_path] \
and_other_parameters_specified_for_your_vit_models...
For instance, we train Swin-T with the following command:
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 ./distributed_train.sh 8 \
--data_dir [your_imagenet_path] \
--model swin_tiny_patch4_window7_224 \
--adv-epochs 200 \
--adv-iters 3 \
--adv-eps 0.00784314 \
--adv-kl-weight 0.01 \
--adv-ce-weight 3.0 \
--output [your_output_path] \
--batch-size 256 \
--drop-path 0.2 \
--lr 1e-3 \
--weight-decay 0.05 \
--clip-grad 1.0
For training variants of ViT, Swin Transformer, VOLO, we use the hyper-parameters in [3], [4], and [2], respectively.
We also combine HAT with knowledge distillation in [5], using train_kd.py.
After training, we can use validate.py to evaluate the ViT model trained with HAT.
For instance, we evaluate Swin-T with the following command:
python3 -u validate.py \
--data_dir [your_imagenet_path] \
--model swin_tiny_patch4_window7_224 \
--checkpoint [your_checkpoint_path] \
--batch-size 128 \
--num-gpu 8 \
--apex-amp \
--results-file [your_results_file_path]
Model | Params | FLOPs | Test Size | Top-1 | +HAT Top-1 | Download |
---|---|---|---|---|---|---|
ViT-T | 5.7M | 1.6G | 224 | 72.2 | 73.3 | link |
ViT-S | 22.1M | 4.7G | 224 | 80.1 | 80.9 | link |
ViT-B | 86.6M | 17.6G | 224 | 82.0 | 83.2 | link |
Swin-T | 28.3M | 4.5G | 224 | 81.2 | 82.0 | link |
Swin-S | 49.6M | 8.7G | 224 | 83.0 | 83.3 | link |
Swin-B | 87.8M | 15.4G | 224 | 83.5 | 84.0 | link |
VOLO-D1 | 26.6M | 6.8G | 224 | 84.2 | 84.5 | link |
VOLO-D1 | 26.6M | 22.8G | 384 | 85.2 | 85.5 | link |
VOLO-D5 | 295.5M | 69.0G | 224 | 86.1 | 86.3 | link |
VOLO-D5 | 295.5M | 304G | 448 | 87.0 | 87.2 | link |
VOLO-D5 | 295.5M | 412G | 512 | 87.1 | 87.3 | link |
The result of combining HAT with knowledge distillation in [5] is 84.3% for ViT-B, and it can be downloaded here.
We first pretrain Swin-T/S/B on the ImageNet-1k dataset with our proposed HAT, and then transfer the models to the downstream tasks, including object detection, instance segmentation, and semantic segmentation.
We use the codes in Swin Transformer for Object Detection and Swin Transformer for Semantic Segmentaion, and follow their configurations.
Backbone | Params | FLOPs | Config | AP_box | +HAT AP_box | AP_mask | +HAT AP_mask |
---|---|---|---|---|---|---|---|
Swin-T | 86M | 745G | config | 50.5 | 50.9 | 43.7 | 43.9 |
Swin-S | 107M | 838G | config | 51.8 | 52.5 | 44.7 | 45.4 |
Swin-B | 145M | 982G | config | 51.9 | 52.8 | 45.0 | 45.6 |
Backbone | Params | FLOPs | Config | mIoU(MS) | +HAT mIoU(MS) |
---|---|---|---|---|---|
Swin-T | 60M | 945G | config | 46.1 | 46.7 |
Swin-S | 81M | 1038G | config | 49.5 | 49.7 |
Swin-B | 121M | 1088G | config | 49.7 | 50.3 |
[1] Wightman, R. Pytorch image models. https://github.com/rwightman/pytorch-image-models , 2019.
[2] Yuan, L. et al. Volo: Vision outlooker for visual recognition. arXiv, 2021.
[3] Dosovitskiy, A. et al. An image is worth 16x16 words: Transformers for image recognition at scale. ICLR, 2020.
[4] Liu, Z. et al. Swin transformer: Hierarchical vision transformer using shifted windows. ICCV, 2021.
[5] Touvron H. et al. Training data-efficient image transformers & distillation through attention. ICML, 2021.