MADTP: Multimodal Alignment-Guided Dynamic Token Pruning for Accelerating Vision-Language Transformer
Official implementation of MADTP: Multimodal Alignment-Guided Dynamic Token Pruning for Accelerating Vision-Language Transformer.
-
(SEP 6, 2024), we released the
implementation
andscripts
of MADTP. (Note thatcheckpoints
andlogs
will come soon.)[Code] 🚩 -
(Feb 27, 2024), MADTP: Multimodal Alignment-Guided Dynamic Token Pruning for Accelerating Vision-Language Transformer was accepted by CVPR 2024. [Paper] [ArXiv]. 🎉
The code is tested on Pytorch==1.11.0
, cuda==11.3.1
, and python==3.8.13
. The dependencies can be installed by:
conda env create -f environment.yml
Type | Supported Tasks | Supported Models | Supported Datasets |
---|---|---|---|
Multi-modal | Visual Reasoning | BLIP (instructions) | NLVR2 |
Multi-modal | Image Caption | BLIP (instructions) | COCO Caption |
Multi-modal | Visual Question Answer | BLIP (instructions) | VQAv2 |
Multi-modal | Image-Text Retrieval | CLIP (instructions), BLIP (instructions) | COCO, Flickr30k |
Multi-modal | Text-Image Retrieval | CLIP (instructions), BLIP (instructions) | COCO, Flickr30k |
-
Dataset & Annotation
Download the NLVR2 dataset, unzip it under the
datasets
folder, and accordingly modify theimage_root
in config. Download all-in-one annotations (including annotations for Visual Reasoning, Image Caption, VQA, Image-Text Retrieval, and Text-Image Retrieval tasks) from this link, unzip it under theannotation
folder, and accordingly modify theannotation
in config. See here for expected folder structres. -
Evaluation
Download compressed checkpoints from the table below, put them under the
output
folder, and accordingly modify the--pretrained
of the scripts. For example, to evaluate a compressed model with 0.5 reduce ratio:python -m torch.distributed.run --nproc_per_node=8 compress_nlvr.py --evaluate \ --pretrained output/nlvr_nlvr2_compression_p0.5/model_base_nlvr_nlvr2_p0.5_compressed.pth \ --config ./configs/nlvr.yaml \ --output_dir output/nlvr_nlvr2_compression_p0.5
-
Compression
Download the uncompressed model from the table below, put it under the
pretrained
folder, and accordingly modify thepretrained
in config. For example, to conduct a compression at 0.5 reduce ratio on 8 A100 GPUs (80G):python -m torch.distributed.run --nproc_per_node=8 compress_nlvr_dtp.py --p 0.5 --epoch 15 \ --pretrained pretrained/model_base_nlvr.pth \ --config ./configs/nlvr.yaml \ --output_dir output/nlvr_nlvr2_compression_p0.5
-
Resources
Reduction Uncompressed Model Compression Script Training Log Compressed Checkpoint Evaluation Script 0.3 Download Link Download Download Link 0.5 Download Link Download Download Link 0.6 Download Link Download Download Link 0.7 Download Link Download Download Link 0.8 Download Link Download Download Link
-
Dataset & Annotation
Download the COCO Caption dataset, unzip it under the
datasets
folder, and accordingly modify theimage_root
in config. Download all-in-one annotations from this link, unzip it under theannotation
folder, and accordingly modify theannotation
in config. See here for expected folder structres. -
Evaluation
Download compressed checkpoints from the table below, put them under the
output
folder, and accordingly modify the--pretrained
of the scripts. For example, to evaluate a compressed model with 0.5 reduce ratio:python -m torch.distributed.run --nproc_per_node=8 compress_caption_dtp.py --evaluate \ --pretrained output/caption_coco_compression_p0.5/model_base_caption_capfilt_large_coco_p0.5_compressed.pth \ --config ./configs/caption_coco.yaml \ --output_dir output/caption_coco_compression_p0.5
-
Compression
Download the uncompressed model from the table below, put it under the
pretrained
folder, and accordingly modify thepretrained
in config. For example, to conduct a compression at 0.5 reduce ratio on 8 A100 GPUs (80G):python -m torch.distributed.run --nproc_per_node=8 compress_caption_dtp.py --p 0.5 --epoch 5 \ --pretrained pretrained/model_base_caption_capfilt_large.pth \ --config ./configs/caption_coco.yaml \ --output_dir output/caption_coco_compression_p0.5
-
Dataset & Annotation
Download the VQAv2 dataset and Visual Genome dataset, unzip them under the
datasets
folder, and accordingly modify theimage_root
in config. Download all-in-one annotations from this link, unzip it under theannotation
folder, and accordingly modify theannotation
in config. See here for expected folder structres. -
Evaluation
Download compressed checkpoints from the table below, put them under the
output
folder, and accordingly modify the--pretrained
of the scripts. For example, to evaluate a compressed model with 0.5 reduce ratio: (Note that the scripts will generate answersvqa_result.json
, which should be submitted to the official server to obtain evaluation results.)python -m torch.distributed.run --nproc_per_node=8 compress_vqa_dtp.py --evaluate \ --pretrained output/vqa_vqa2_compression_p0.5/model_base_vqa_capfilt_large_vqa2_p0.5_compressed.pth \ --config ./configs/vqa.yaml \ --output_dir output/vqa_vqa2_compression_p0.5
-
Compression
Download the uncompressed model from the table below, put it under the
pretrained
folder, and accordingly modify thepretrained
in config. For example, to conduct a compression at 0.5 reduce ratio on 8 A100 GPUs (80G):python -m torch.distributed.run --nproc_per_node=8 compress_vqa_dtp.py --p 0.5 --epoch 3 \ --pretrained pretrained/model_base_vqa_capfilt_large.pth \ --config ./configs/vqa.yaml \ --output_dir output/vqa_vqa2_compression_p0.5
-
Dataset & Annotation
Download the COCO dataset, unzip it under the
datasets
folder, and accordingly modify theimage_root
in config. Download all-in-one annotations from this link, unzip it under theannotation
folder, and accordingly modify theannotation
in config. See here for expected folder structres. -
Evaluation
Download compressed checkpoints from the table below, put them under the
output
folder, and accordingly modify the--pretrained
of the scripts. For example, to evaluate a compressed model with 0.5 reduce ratio:python -m torch.distributed.run --nproc_per_node=8 compress_retrieval_dtp.py --evaluate \ --pretrained output/retrieval_coco_compression_p0.5/model_base_retrieval_coco_p0.5_compressed.pth --config ./configs/retrieval_coco.yaml \ --output_dir output/retrieval_coco_compression_p0.5
-
Compression
Download the uncompressed model from the table below, put it under the
pretrained
folder, and accordingly modify thepretrained
in config. For example, to conduct a compression at 0.5 reduce ratio on 8 A100 GPUs (80G):python -m torch.distributed.run --nproc_per_node=8 compress_retrieval_dtp.py --p 0.5 --epoch 5 \ --pretrained pretrained/model_base_retrieval_coco.pth \ --config ./configs/retrieval_coco.yaml \ --output_dir output/retrieval_coco_compression_p0.5
-
Dataset & Annotation
Download the Flickr30k dataset, unzip it under the
datasets
folder, and accordingly modify theimage_root
in config. Download all-in-one annotations from this link, unzip it under theannotation
folder, and accordingly modify theannotation
in config. See here for expected folder structres. -
Evaluation
Download compressed checkpoints from the table below, put them under the
output
folder, and accordingly modify the--pretrained
of the scripts. For example, to evaluate a compressed model with 0.5 reduce ratio:python -m torch.distributed.run --nproc_per_node=8 compress_retrieval_flickr.py --evaluate \ --pretrained output/retrieval_flickr_compression_2x/model_base_retrieval_flickr_2x_compressed.pth \ --config ./configs/retrieval_flickr.yaml \ --output_dir output/retrieval_flickr_compression_2x
-
Compression
Download the uncompressed model from the table below, put it under the
pretrained
folder, and accordingly modify thepretrained
in config. For example, to conduct a compression at 0.5 reduce ratio on 8 A100 GPUs (80G):python -m torch.distributed.run --nproc_per_node=8 compress_retrieval_flickr_dtp.py --p 0.5 --epoch 10 \ --pretrained pretrained/model_base_retrieval_flickr.pth \ --config ./configs/retrieval_flickr.yaml \ --output_dir output/retrieval_flickr_compression_p0.75
-
Dataset & Annotation
Download the COCO dataset, unzip it under the
datasets
folder, and accordingly modify theimage_root
in config. Download all-in-one annotations from this link, unzip it under theannotation
folder, and accordingly modify theannotation
in config. See here for expected folder structres. -
Evaluation
Download compressed checkpoints from the table below, put them under the
output
folder, and accordingly modify the--pretrained
of the scripts. For example, to evaluate a compressed model with 0.5 reduce ratio:python -m torch.distributed.run --nproc_per_node=8 compress_retrieval_clip_dtp.py --evaluate \ --pretrained output/retrieval_coco_clip_compression_p0.5/clip_large_retrieval_coco_p0.5_compressed.pth \ --config ./configs/retrieval_coco_clip.yaml \ --output_dir output/retrieval_coco_clip_compression_p0.5
-
Compression
Download the uncompressed model from the table below, put it under the
pretrained
folder, and accordingly modify thepretrained
in config. For example, to conduct a compression at 0.5 reduce ratio on 8 A100 GPUs (80G):python -m torch.distributed.run --nproc_per_node=8 compress_retrieval_clip_dtp.py --p 0.5 --epoch 5 \ --pretrained pretrained/clip_large_retrieval_coco.pth \ --config ./configs/retrieval_coco_clip.yaml \ --output_dir output/retrieval_coco_clip_compression_p0.5
-
Dataset & Annotation
Download the Flickr30k dataset, unzip it under the
datasets
folder, and accordingly modify theimage_root
in config. Download all-in-one annotations from this link, unzip it under theannotation
folder, and accordingly modify theannotation
in config. See here for expected folder structres. -
Evaluation
Download compressed checkpoints from the table below, put them under the
output
folder, and accordingly modify the--pretrained
of the scripts. For example, to evaluate a compressed model with 0.5 reduce ratio:python -m torch.distributed.run --nproc_per_node=8 compress_retrieval_clip_dtp.py --evaluate \ --pretrained output/retrieval_flickr_clip_compression_p0.5/checkpoint_best.pth \ --config ./configs/retrieval_flickr_clip.yaml \ --output_dir output/retrieval_flickr_clip_compression_p0.5
-
Compression
Download the uncompressed model from the table below, put it under the
pretrained
folder, and accordingly modify thepretrained
in config. For example, to conduct a compression at 0.5 reduce ratio on 8 A100 GPUs (80G):python -m torch.distributed.run --nproc_per_node=8 compress_retrieval_clip_dtp.py --p 0.5 --epoch 10 \ --pretrained pretrained/clip_large_retrieval_flickr.pth \ --config ./configs/retrieval_flickr_clip.yaml \ --output_dir output/retrieval_flickr_clip_compression_p0.5
-
For BLIP and CLIP models, evaluate the 2x compressed BLIP model on the NLVR2 dataset as an example:
python compress_nlvr_dtp.py --evaluate \ --pretrained output/nlvr_nlvr2_compression_p0.5/checkpoint_best.pth \ --config ./configs/nlvr.yaml \ --output_dir output/nlvr_nlvr2_compression_p0.5
-
For BLIP and CLIP models, compress the BLIP model to half on the NLVR2 dataset as an example:
python compress_nlvr_dtp.py --p 0.5 --epoch 15 \ --pretrained pretrained/model_base_nlvr.pth \ --config ./configs/nlvr.yaml \ --output_dir output/nlvr_nlvr2_compression_p0.5
You can post them on the Issues page.
├── annotation
│ ├── answer_list.json
│ ├── coco_gt
│ │ ├── coco_karpathy_test_gt.json
│ │ └── coco_karpathy_val_gt.json
│ ├── ...
├── clip
├── compress_caption_dtp.py
├── compress_nlvr_dtp.py
├── compress ...
├── configs
├── data
├── datasets
│ └── vision
│ ├── coco
│ ├── flickr
│ ├── NLVR2
│ ├── ...
├── log
├── models
├── output
├── pretrained
│ ├── bert-base-uncased
│ ├── clip_large_retrieval_coco.pth
│ ├── clip_large_retrieval_flickr.pth
│ ├── ...
├──
├── transform
└── utils.py
This code is built upon BLIP, CLIP, UPop, and timm. We thank the original authors for their open-source work.
If you find this work useful, please consider citing the corresponding paper:
@article{cao2024madtp,
title={MADTP: Multimodal Alignment-Guided Dynamic Token Pruning for Accelerating Vision-Language Transformer},
author={Jianjian, Cao and Peng, Ye and Shengze, Li and Chong, Yu and Yansong, Tang and Jiwen, Lu and Tao, Chen},
journal={IEEE Conference on Computer Vision and Pattern Recognition},
year={2024}
}