Skip to content

MADTP: Multimodal Alignment-Guided Dynamic Token Pruning for Accelerating Vision-Language Transformer

License

Notifications You must be signed in to change notification settings

double125/MADTP

Repository files navigation

MADTP: Multimodal Alignment-Guided Dynamic Token Pruning for Accelerating Vision-Language Transformer

[Paper] [ArXiv] [Code]

Official implementation of MADTP: Multimodal Alignment-Guided Dynamic Token Pruning for Accelerating Vision-Language Transformer.

What's New 🥳

  • (SEP 6, 2024), we released the implementation and scripts of MADTP. (Note that checkpoints and logs will come soon.)[Code] 🚩

  • (Feb 27, 2024), MADTP: Multimodal Alignment-Guided Dynamic Token Pruning for Accelerating Vision-Language Transformer was accepted by CVPR 2024. [Paper] [ArXiv]. 🎉

Installation

The code is tested on Pytorch==1.11.0, cuda==11.3.1, and python==3.8.13. The dependencies can be installed by:

conda env create -f environment.yml

Supported Tasks, Models, and Datasets

Type Supported Tasks Supported Models Supported Datasets
Multi-modal Visual Reasoning BLIP (instructions) NLVR2
Multi-modal Image Caption BLIP (instructions) COCO Caption
Multi-modal Visual Question Answer BLIP (instructions) VQAv2
Multi-modal Image-Text Retrieval CLIP (instructions), BLIP (instructions) COCO, Flickr30k
Multi-modal Text-Image Retrieval CLIP (instructions), BLIP (instructions) COCO, Flickr30k

Visual Reasoning on the NLVR2 Dataset

  • Dataset & Annotation

    Download the NLVR2 dataset, unzip it under the datasets folder, and accordingly modify the image_root in config. Download all-in-one annotations (including annotations for Visual Reasoning, Image Caption, VQA, Image-Text Retrieval, and Text-Image Retrieval tasks) from this link, unzip it under the annotation folder, and accordingly modify the annotation in config. See here for expected folder structres.

  • Evaluation

    Download compressed checkpoints from the table below, put them under the output folder, and accordingly modify the --pretrained of the scripts. For example, to evaluate a compressed model with 0.5 reduce ratio:

    python -m torch.distributed.run --nproc_per_node=8 compress_nlvr.py --evaluate \
    --pretrained output/nlvr_nlvr2_compression_p0.5/model_base_nlvr_nlvr2_p0.5_compressed.pth \
    --config ./configs/nlvr.yaml \
    --output_dir output/nlvr_nlvr2_compression_p0.5
  • Compression

    Download the uncompressed model from the table below, put it under the pretrained folder, and accordingly modify the pretrained in config. For example, to conduct a compression at 0.5 reduce ratio on 8 A100 GPUs (80G):

    python -m torch.distributed.run --nproc_per_node=8 compress_nlvr_dtp.py --p 0.5 --epoch 15 \
    --pretrained pretrained/model_base_nlvr.pth \
    --config ./configs/nlvr.yaml \
    --output_dir output/nlvr_nlvr2_compression_p0.5
  • Resources

    Reduction Uncompressed Model Compression Script Training Log Compressed Checkpoint Evaluation Script
    0.3 Download Link Download Download Link
    0.5 Download Link Download Download Link
    0.6 Download Link Download Download Link
    0.7 Download Link Download Download Link
    0.8 Download Link Download Download Link

Image Caption on the COCO Caption Dataset

  • Dataset & Annotation

    Download the COCO Caption dataset, unzip it under the datasets folder, and accordingly modify the image_root in config. Download all-in-one annotations from this link, unzip it under the annotation folder, and accordingly modify the annotation in config. See here for expected folder structres.

  • Evaluation

    Download compressed checkpoints from the table below, put them under the output folder, and accordingly modify the --pretrained of the scripts. For example, to evaluate a compressed model with 0.5 reduce ratio:

    python -m torch.distributed.run --nproc_per_node=8 compress_caption_dtp.py --evaluate \
    --pretrained output/caption_coco_compression_p0.5/model_base_caption_capfilt_large_coco_p0.5_compressed.pth \
    --config ./configs/caption_coco.yaml \
    --output_dir output/caption_coco_compression_p0.5
  • Compression

    Download the uncompressed model from the table below, put it under the pretrained folder, and accordingly modify the pretrained in config. For example, to conduct a compression at 0.5 reduce ratio on 8 A100 GPUs (80G):

    python -m torch.distributed.run --nproc_per_node=8 compress_caption_dtp.py --p 0.5 --epoch 5 \
    --pretrained pretrained/model_base_caption_capfilt_large.pth \
    --config ./configs/caption_coco.yaml \
    --output_dir output/caption_coco_compression_p0.5

Visual Question Answer on the VQAv2 Dataset

  • Dataset & Annotation

    Download the VQAv2 dataset and Visual Genome dataset, unzip them under the datasets folder, and accordingly modify the image_root in config. Download all-in-one annotations from this link, unzip it under the annotation folder, and accordingly modify the annotation in config. See here for expected folder structres.

  • Evaluation

    Download compressed checkpoints from the table below, put them under the output folder, and accordingly modify the --pretrained of the scripts. For example, to evaluate a compressed model with 0.5 reduce ratio: (Note that the scripts will generate answers vqa_result.json, which should be submitted to the official server to obtain evaluation results.)

    python -m torch.distributed.run --nproc_per_node=8 compress_vqa_dtp.py --evaluate \
    --pretrained output/vqa_vqa2_compression_p0.5/model_base_vqa_capfilt_large_vqa2_p0.5_compressed.pth \
    --config ./configs/vqa.yaml \
    --output_dir output/vqa_vqa2_compression_p0.5
  • Compression

    Download the uncompressed model from the table below, put it under the pretrained folder, and accordingly modify the pretrained in config. For example, to conduct a compression at 0.5 reduce ratio on 8 A100 GPUs (80G):

    python -m torch.distributed.run --nproc_per_node=8 compress_vqa_dtp.py --p 0.5 --epoch 3 \
    --pretrained pretrained/model_base_vqa_capfilt_large.pth \
    --config ./configs/vqa.yaml \
    --output_dir output/vqa_vqa2_compression_p0.5

Image-Text and Text-Image Retrieval on the COCO Dataset

  • Dataset & Annotation

    Download the COCO dataset, unzip it under the datasets folder, and accordingly modify the image_root in config. Download all-in-one annotations from this link, unzip it under the annotation folder, and accordingly modify the annotation in config. See here for expected folder structres.

  • Evaluation

    Download compressed checkpoints from the table below, put them under the output folder, and accordingly modify the --pretrained of the scripts. For example, to evaluate a compressed model with 0.5 reduce ratio:

    python -m torch.distributed.run --nproc_per_node=8 compress_retrieval_dtp.py --evaluate \
    --pretrained output/retrieval_coco_compression_p0.5/model_base_retrieval_coco_p0.5_compressed.pth --config ./configs/retrieval_coco.yaml \
    --output_dir output/retrieval_coco_compression_p0.5
  • Compression

    Download the uncompressed model from the table below, put it under the pretrained folder, and accordingly modify the pretrained in config. For example, to conduct a compression at 0.5 reduce ratio on 8 A100 GPUs (80G):

    python -m torch.distributed.run --nproc_per_node=8 compress_retrieval_dtp.py --p 0.5 --epoch 5 \
    --pretrained pretrained/model_base_retrieval_coco.pth \
    --config ./configs/retrieval_coco.yaml \
    --output_dir output/retrieval_coco_compression_p0.5

Image-Text and Text-Image Retrieval on the Flickr30K Dataset

  • Dataset & Annotation

    Download the Flickr30k dataset, unzip it under the datasets folder, and accordingly modify the image_root in config. Download all-in-one annotations from this link, unzip it under the annotation folder, and accordingly modify the annotation in config. See here for expected folder structres.

  • Evaluation

    Download compressed checkpoints from the table below, put them under the output folder, and accordingly modify the --pretrained of the scripts. For example, to evaluate a compressed model with 0.5 reduce ratio:

    python -m torch.distributed.run --nproc_per_node=8 compress_retrieval_flickr.py --evaluate \
    --pretrained output/retrieval_flickr_compression_2x/model_base_retrieval_flickr_2x_compressed.pth \
    --config ./configs/retrieval_flickr.yaml \
    --output_dir output/retrieval_flickr_compression_2x
  • Compression

    Download the uncompressed model from the table below, put it under the pretrained folder, and accordingly modify the pretrained in config. For example, to conduct a compression at 0.5 reduce ratio on 8 A100 GPUs (80G):

    python -m torch.distributed.run --nproc_per_node=8 compress_retrieval_flickr_dtp.py --p 0.5 --epoch 10 \
    --pretrained pretrained/model_base_retrieval_flickr.pth \
    --config ./configs/retrieval_flickr.yaml \
    --output_dir output/retrieval_flickr_compression_p0.75

Image-Text and Text-Image Retrieval on the COCO Dataset with CLIP

  • Dataset & Annotation

    Download the COCO dataset, unzip it under the datasets folder, and accordingly modify the image_root in config. Download all-in-one annotations from this link, unzip it under the annotation folder, and accordingly modify the annotation in config. See here for expected folder structres.

  • Evaluation

    Download compressed checkpoints from the table below, put them under the output folder, and accordingly modify the --pretrained of the scripts. For example, to evaluate a compressed model with 0.5 reduce ratio:

    python -m torch.distributed.run --nproc_per_node=8 compress_retrieval_clip_dtp.py --evaluate \
    --pretrained output/retrieval_coco_clip_compression_p0.5/clip_large_retrieval_coco_p0.5_compressed.pth \
    --config ./configs/retrieval_coco_clip.yaml \
    --output_dir output/retrieval_coco_clip_compression_p0.5
  • Compression

    Download the uncompressed model from the table below, put it under the pretrained folder, and accordingly modify the pretrained in config. For example, to conduct a compression at 0.5 reduce ratio on 8 A100 GPUs (80G):

    python -m torch.distributed.run --nproc_per_node=8 compress_retrieval_clip_dtp.py --p 0.5 --epoch 5 \
    --pretrained pretrained/clip_large_retrieval_coco.pth \
    --config ./configs/retrieval_coco_clip.yaml \
    --output_dir output/retrieval_coco_clip_compression_p0.5

Image-Text and Text-Image Retrieval on the Flickr30K Dataset with CLIP

  • Dataset & Annotation

    Download the Flickr30k dataset, unzip it under the datasets folder, and accordingly modify the image_root in config. Download all-in-one annotations from this link, unzip it under the annotation folder, and accordingly modify the annotation in config. See here for expected folder structres.

  • Evaluation

    Download compressed checkpoints from the table below, put them under the output folder, and accordingly modify the --pretrained of the scripts. For example, to evaluate a compressed model with 0.5 reduce ratio:

    python -m torch.distributed.run --nproc_per_node=8 compress_retrieval_clip_dtp.py --evaluate \
    --pretrained output/retrieval_flickr_clip_compression_p0.5/checkpoint_best.pth \
    --config ./configs/retrieval_flickr_clip.yaml \
    --output_dir output/retrieval_flickr_clip_compression_p0.5
  • Compression

    Download the uncompressed model from the table below, put it under the pretrained folder, and accordingly modify the pretrained in config. For example, to conduct a compression at 0.5 reduce ratio on 8 A100 GPUs (80G):

    python -m torch.distributed.run --nproc_per_node=8 compress_retrieval_clip_dtp.py --p 0.5 --epoch 10 \
    --pretrained pretrained/clip_large_retrieval_flickr.pth \
    --config ./configs/retrieval_flickr_clip.yaml \
    --output_dir output/retrieval_flickr_clip_compression_p0.5

Common Issues

1. Evaluation with single GPU

  • For BLIP and CLIP models, evaluate the 2x compressed BLIP model on the NLVR2 dataset as an example:

    python compress_nlvr_dtp.py --evaluate \
    --pretrained output/nlvr_nlvr2_compression_p0.5/checkpoint_best.pth \
    --config ./configs/nlvr.yaml \
    --output_dir output/nlvr_nlvr2_compression_p0.5

2. Compress with single GPU

  • For BLIP and CLIP models, compress the BLIP model to half on the NLVR2 dataset as an example:

    python compress_nlvr_dtp.py --p 0.5 --epoch 15 \
    --pretrained pretrained/model_base_nlvr.pth \
    --config ./configs/nlvr.yaml \
    --output_dir output/nlvr_nlvr2_compression_p0.5

3. Other issues

You can post them on the Issues page.

Expected Folder Structures

├── annotation
│   ├── answer_list.json
│   ├── coco_gt
│   │   ├── coco_karpathy_test_gt.json
│   │   └── coco_karpathy_val_gt.json
│   ├── ...
├── clip                                               
├── compress_caption_dtp.py             
├── compress_nlvr_dtp.py                  
├── compress ...    
├── configs                                             
├── data                                        
├── datasets
│   └── vision
│       ├── coco
│       ├── flickr
│       ├── NLVR2     
│       ├── ...                                                                               
├── log                                     
├── models            
├── output                                    
├── pretrained
│   ├── bert-base-uncased
│   ├── clip_large_retrieval_coco.pth
│   ├── clip_large_retrieval_flickr.pth
│   ├── ...       
├──                                                                                
├── transform                                                                           
└── utils.py                                

Acknowledgments

This code is built upon BLIP, CLIP, UPop, and timm. We thank the original authors for their open-source work.

Citation

If you find this work useful, please consider citing the corresponding paper:

@article{cao2024madtp,
  title={MADTP: Multimodal Alignment-Guided Dynamic Token Pruning for Accelerating Vision-Language Transformer},
  author={Jianjian, Cao and Peng, Ye and Shengze, Li and Chong, Yu and Yansong, Tang and Jiwen, Lu and Tao, Chen},
  journal={IEEE Conference on Computer Vision and Pattern Recognition},
  year={2024}
}

About

MADTP: Multimodal Alignment-Guided Dynamic Token Pruning for Accelerating Vision-Language Transformer

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published