MADTP: Multimodal Alignment-Guided Dynamic Token Pruning for Accelerating Vision-Language Transformer

Official implementation of MADTP: Multimodal Alignment-Guided Dynamic Token Pruning for Accelerating Vision-Language Transformer.

What's New 🥳

(SEP 6, 2024), we released the implementation and scripts of MADTP. (Note that checkpoints and logs will come soon.)[Code] 🚩
(Feb 27, 2024), MADTP: Multimodal Alignment-Guided Dynamic Token Pruning for Accelerating Vision-Language Transformer was accepted by CVPR 2024. [Paper] [ArXiv]. 🎉

Installation

The code is tested on Pytorch==1.11.0, cuda==11.3.1, and python==3.8.13. The dependencies can be installed by:

conda env create -f environment.yml

Supported Tasks, Models, and Datasets

Type	Supported Tasks	Supported Models	Supported Datasets
Multi-modal	Visual Reasoning	BLIP (instructions)	NLVR2
Multi-modal	Image Caption	BLIP (instructions)	COCO Caption
Multi-modal	Visual Question Answer	BLIP (instructions)	VQAv2
Multi-modal	Image-Text Retrieval	CLIP (instructions), BLIP (instructions)	COCO, Flickr30k
Multi-modal	Text-Image Retrieval	CLIP (instructions), BLIP (instructions)	COCO, Flickr30k

Visual Reasoning on the NLVR2 Dataset

Dataset & Annotation

Download the NLVR2 dataset, unzip it under the datasets folder, and accordingly modify the image_root in config. Download all-in-one annotations (including annotations for Visual Reasoning, Image Caption, VQA, Image-Text Retrieval, and Text-Image Retrieval tasks) from this link, unzip it under the annotation folder, and accordingly modify the annotation in config. See here for expected folder structres.

Evaluation

Download compressed checkpoints from the table below, put them under the output folder, and accordingly modify the --pretrained of the scripts. For example, to evaluate a compressed model with 0.5 reduce ratio:

python -m torch.distributed.run --nproc_per_node=8 compress_nlvr.py --evaluate \
--pretrained output/nlvr_nlvr2_compression_p0.5/model_base_nlvr_nlvr2_p0.5_compressed.pth \
--config ./configs/nlvr.yaml \
--output_dir output/nlvr_nlvr2_compression_p0.5

Compression

Download the uncompressed model from the table below, put it under the pretrained folder, and accordingly modify the pretrained in config. For example, to conduct a compression at 0.5 reduce ratio on 8 A100 GPUs (80G):
```
python -m torch.distributed.run --nproc_per_node=8 compress_nlvr_dtp.py --p 0.5 --epoch 15 \
--pretrained pretrained/model_base_nlvr.pth \
--config ./configs/nlvr.yaml \
--output_dir output/nlvr_nlvr2_compression_p0.5
```

Resources

Reduction	Uncompressed Model	Compression Script	Training Log	Compressed Checkpoint	Evaluation Script
0.3	Download	Link	Download	Download	Link
0.5	Download	Link	Download	Download	Link
0.6	Download	Link	Download	Download	Link
0.7	Download	Link	Download	Download	Link
0.8	Download	Link	Download	Download	Link

Image Caption on the COCO Caption Dataset

Dataset & Annotation

Download the COCO Caption dataset, unzip it under the datasets folder, and accordingly modify the image_root in config. Download all-in-one annotations from this link, unzip it under the annotation folder, and accordingly modify the annotation in config. See here for expected folder structres.

Evaluation

Download compressed checkpoints from the table below, put them under the output folder, and accordingly modify the --pretrained of the scripts. For example, to evaluate a compressed model with 0.5 reduce ratio:

python -m torch.distributed.run --nproc_per_node=8 compress_caption_dtp.py --evaluate \
--pretrained output/caption_coco_compression_p0.5/model_base_caption_capfilt_large_coco_p0.5_compressed.pth \
--config ./configs/caption_coco.yaml \
--output_dir output/caption_coco_compression_p0.5

Compression

Download the uncompressed model from the table below, put it under the pretrained folder, and accordingly modify the pretrained in config. For example, to conduct a compression at 0.5 reduce ratio on 8 A100 GPUs (80G):

python -m torch.distributed.run --nproc_per_node=8 compress_caption_dtp.py --p 0.5 --epoch 5 \
--pretrained pretrained/model_base_caption_capfilt_large.pth \
--config ./configs/caption_coco.yaml \
--output_dir output/caption_coco_compression_p0.5

Visual Question Answer on the VQAv2 Dataset

Dataset & Annotation

Download the VQAv2 dataset and Visual Genome dataset, unzip them under the datasets folder, and accordingly modify the image_root in config. Download all-in-one annotations from this link, unzip it under the annotation folder, and accordingly modify the annotation in config. See here for expected folder structres.
Evaluation

Download compressed checkpoints from the table below, put them under the output folder, and accordingly modify the --pretrained of the scripts. For example, to evaluate a compressed model with 0.5 reduce ratio: (Note that the scripts will generate answers vqa_result.json, which should be submitted to the official server to obtain evaluation results.)
```
python -m torch.distributed.run --nproc_per_node=8 compress_vqa_dtp.py --evaluate \
--pretrained output/vqa_vqa2_compression_p0.5/model_base_vqa_capfilt_large_vqa2_p0.5_compressed.pth \
--config ./configs/vqa.yaml \
--output_dir output/vqa_vqa2_compression_p0.5
```

Compression

Download the uncompressed model from the table below, put it under the pretrained folder, and accordingly modify the pretrained in config. For example, to conduct a compression at 0.5 reduce ratio on 8 A100 GPUs (80G):

python -m torch.distributed.run --nproc_per_node=8 compress_vqa_dtp.py --p 0.5 --epoch 3 \
--pretrained pretrained/model_base_vqa_capfilt_large.pth \
--config ./configs/vqa.yaml \
--output_dir output/vqa_vqa2_compression_p0.5

Image-Text and Text-Image Retrieval on the COCO Dataset

Dataset & Annotation

Download the COCO dataset, unzip it under the datasets folder, and accordingly modify the image_root in config. Download all-in-one annotations from this link, unzip it under the annotation folder, and accordingly modify the annotation in config. See here for expected folder structres.

Evaluation

Download compressed checkpoints from the table below, put them under the output folder, and accordingly modify the --pretrained of the scripts. For example, to evaluate a compressed model with 0.5 reduce ratio:

python -m torch.distributed.run --nproc_per_node=8 compress_retrieval_dtp.py --evaluate \
--pretrained output/retrieval_coco_compression_p0.5/model_base_retrieval_coco_p0.5_compressed.pth --config ./configs/retrieval_coco.yaml \
--output_dir output/retrieval_coco_compression_p0.5

Compression

Download the uncompressed model from the table below, put it under the pretrained folder, and accordingly modify the pretrained in config. For example, to conduct a compression at 0.5 reduce ratio on 8 A100 GPUs (80G):

python -m torch.distributed.run --nproc_per_node=8 compress_retrieval_dtp.py --p 0.5 --epoch 5 \
--pretrained pretrained/model_base_retrieval_coco.pth \
--config ./configs/retrieval_coco.yaml \
--output_dir output/retrieval_coco_compression_p0.5

Image-Text and Text-Image Retrieval on the Flickr30K Dataset

Dataset & Annotation

Download the Flickr30k dataset, unzip it under the datasets folder, and accordingly modify the image_root in config. Download all-in-one annotations from this link, unzip it under the annotation folder, and accordingly modify the annotation in config. See here for expected folder structres.

Evaluation

Download compressed checkpoints from the table below, put them under the output folder, and accordingly modify the --pretrained of the scripts. For example, to evaluate a compressed model with 0.5 reduce ratio:

python -m torch.distributed.run --nproc_per_node=8 compress_retrieval_flickr.py --evaluate \
--pretrained output/retrieval_flickr_compression_2x/model_base_retrieval_flickr_2x_compressed.pth \
--config ./configs/retrieval_flickr.yaml \
--output_dir output/retrieval_flickr_compression_2x

Compression

Download the uncompressed model from the table below, put it under the pretrained folder, and accordingly modify the pretrained in config. For example, to conduct a compression at 0.5 reduce ratio on 8 A100 GPUs (80G):

python -m torch.distributed.run --nproc_per_node=8 compress_retrieval_flickr_dtp.py --p 0.5 --epoch 10 \
--pretrained pretrained/model_base_retrieval_flickr.pth \
--config ./configs/retrieval_flickr.yaml \
--output_dir output/retrieval_flickr_compression_p0.75

Image-Text and Text-Image Retrieval on the COCO Dataset with CLIP

Dataset & Annotation

Download the COCO dataset, unzip it under the datasets folder, and accordingly modify the image_root in config. Download all-in-one annotations from this link, unzip it under the annotation folder, and accordingly modify the annotation in config. See here for expected folder structres.

Evaluation

Download compressed checkpoints from the table below, put them under the output folder, and accordingly modify the --pretrained of the scripts. For example, to evaluate a compressed model with 0.5 reduce ratio:

python -m torch.distributed.run --nproc_per_node=8 compress_retrieval_clip_dtp.py --evaluate \
--pretrained output/retrieval_coco_clip_compression_p0.5/clip_large_retrieval_coco_p0.5_compressed.pth \
--config ./configs/retrieval_coco_clip.yaml \
--output_dir output/retrieval_coco_clip_compression_p0.5

Compression

Download the uncompressed model from the table below, put it under the pretrained folder, and accordingly modify the pretrained in config. For example, to conduct a compression at 0.5 reduce ratio on 8 A100 GPUs (80G):

python -m torch.distributed.run --nproc_per_node=8 compress_retrieval_clip_dtp.py --p 0.5 --epoch 5 \
--pretrained pretrained/clip_large_retrieval_coco.pth \
--config ./configs/retrieval_coco_clip.yaml \
--output_dir output/retrieval_coco_clip_compression_p0.5

Image-Text and Text-Image Retrieval on the Flickr30K Dataset with CLIP

Dataset & Annotation

Download the Flickr30k dataset, unzip it under the datasets folder, and accordingly modify the image_root in config. Download all-in-one annotations from this link, unzip it under the annotation folder, and accordingly modify the annotation in config. See here for expected folder structres.

Evaluation

Download compressed checkpoints from the table below, put them under the output folder, and accordingly modify the --pretrained of the scripts. For example, to evaluate a compressed model with 0.5 reduce ratio:

python -m torch.distributed.run --nproc_per_node=8 compress_retrieval_clip_dtp.py --evaluate \
--pretrained output/retrieval_flickr_clip_compression_p0.5/checkpoint_best.pth \
--config ./configs/retrieval_flickr_clip.yaml \
--output_dir output/retrieval_flickr_clip_compression_p0.5

Compression

Download the uncompressed model from the table below, put it under the pretrained folder, and accordingly modify the pretrained in config. For example, to conduct a compression at 0.5 reduce ratio on 8 A100 GPUs (80G):

python -m torch.distributed.run --nproc_per_node=8 compress_retrieval_clip_dtp.py --p 0.5 --epoch 10 \
--pretrained pretrained/clip_large_retrieval_flickr.pth \
--config ./configs/retrieval_flickr_clip.yaml \
--output_dir output/retrieval_flickr_clip_compression_p0.5

Common Issues

1. Evaluation with single GPU

For BLIP and CLIP models, evaluate the 2x compressed BLIP model on the NLVR2 dataset as an example:

python compress_nlvr_dtp.py --evaluate \
--pretrained output/nlvr_nlvr2_compression_p0.5/checkpoint_best.pth \
--config ./configs/nlvr.yaml \
--output_dir output/nlvr_nlvr2_compression_p0.5

2. Compress with single GPU

For BLIP and CLIP models, compress the BLIP model to half on the NLVR2 dataset as an example:

python compress_nlvr_dtp.py --p 0.5 --epoch 15 \
--pretrained pretrained/model_base_nlvr.pth \
--config ./configs/nlvr.yaml \
--output_dir output/nlvr_nlvr2_compression_p0.5

3. Other issues

You can post them on the Issues page.

Expected Folder Structures

├── annotation
│   ├── answer_list.json
│   ├── coco_gt
│   │   ├── coco_karpathy_test_gt.json
│   │   └── coco_karpathy_val_gt.json
│   ├── ...
├── clip                                               
├── compress_caption_dtp.py             
├── compress_nlvr_dtp.py                  
├── compress ...    
├── configs                                             
├── data                                        
├── datasets
│   └── vision
│       ├── coco
│       ├── flickr
│       ├── NLVR2     
│       ├── ...                                                                               
├── log                                     
├── models            
├── output                                    
├── pretrained
│   ├── bert-base-uncased
│   ├── clip_large_retrieval_coco.pth
│   ├── clip_large_retrieval_flickr.pth
│   ├── ...       
├──                                                                                
├── transform                                                                           
└── utils.py

Acknowledgments

This code is built upon BLIP, CLIP, UPop, and timm. We thank the original authors for their open-source work.

Citation

If you find this work useful, please consider citing the corresponding paper:

@article{cao2024madtp,
  title={MADTP: Multimodal Alignment-Guided Dynamic Token Pruning for Accelerating Vision-Language Transformer},
  author={Jianjian, Cao and Peng, Ye and Shengze, Li and Chong, Yu and Yansong, Tang and Jiwen, Lu and Tao, Chen},
  journal={IEEE Conference on Computer Vision and Pattern Recognition},
  year={2024}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MADTP: Multimodal Alignment-Guided Dynamic Token Pruning for Accelerating Vision-Language Transformer

What's New 🥳

Installation

Supported Tasks, Models, and Datasets

Visual Reasoning on the NLVR2 Dataset

Image Caption on the COCO Caption Dataset

Visual Question Answer on the VQAv2 Dataset

Image-Text and Text-Image Retrieval on the COCO Dataset

Image-Text and Text-Image Retrieval on the Flickr30K Dataset

Image-Text and Text-Image Retrieval on the COCO Dataset with CLIP

Image-Text and Text-Image Retrieval on the Flickr30K Dataset with CLIP

Common Issues

1. Evaluation with single GPU

2. Compress with single GPU

3. Other issues

Expected Folder Structures

Acknowledgments

Citation

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
clip		clip
configs		configs
data		data
models		models
scripts		scripts
transform		transform
.gitignore		.gitignore
LICENSE		LICENSE
MADTP.png		MADTP.png
README.md		README.md
compress_caption_dtp.py		compress_caption_dtp.py
compress_nlvr_dtp.py		compress_nlvr_dtp.py
compress_retrieval_clip_dtp.py		compress_retrieval_clip_dtp.py
compress_retrieval_dtp.py		compress_retrieval_dtp.py
compress_retrieval_flickr_dtp.py		compress_retrieval_flickr_dtp.py
compress_vqa_dtp.py		compress_vqa_dtp.py
environment.yml		environment.yml
requirements.txt		requirements.txt
utils.py		utils.py

License

double125/MADTP

Folders and files

Latest commit

History

Repository files navigation

MADTP: Multimodal Alignment-Guided Dynamic Token Pruning for Accelerating Vision-Language Transformer

What's New 🥳

Installation

Supported Tasks, Models, and Datasets

Visual Reasoning on the NLVR2 Dataset

Image Caption on the COCO Caption Dataset

Visual Question Answer on the VQAv2 Dataset

Image-Text and Text-Image Retrieval on the COCO Dataset

Image-Text and Text-Image Retrieval on the Flickr30K Dataset

Image-Text and Text-Image Retrieval on the COCO Dataset with CLIP

Image-Text and Text-Image Retrieval on the Flickr30K Dataset with CLIP

Common Issues

1. Evaluation with single GPU

2. Compress with single GPU

3. Other issues

Expected Folder Structures

Acknowledgments

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages