This is a repository for our research paper Privacy and Accuracy-Aware Model Deduplication for Emerging Model Infrastructures.
The training of the ViT-large model with a big batch size 50 needs a GPU with memory size 40 GB. Inferencing with a batch size 16 will need about 11 GB GPU memory for ViT-large, 6 GB GPU memory for Roberta-base and 5 GB GPU memory for ResNet152.
The online serving experiment use an AWS c5a.xlarge instance (7.63 GB main memory, 4vCPU) with a batch size 1.
We used the fastDP Python package to train our DP models(available from https://github.com/awslabs/fast-differential-privacy). The deduplication and inference code also use the same dependencies.
Here are the recommended steps to set up the environment:
- Use conda to create a virtual environment with Python 3.10 to be compatible with the Pytorch release. Here is an example command to create the environment named fastdp:
conda create -n fastdp python=3.10
- Install pytorch 1.11 because versions above pytorch 1.12 can slow down the training. This is stated on the home page of its official repository. Here is an example comand to install pytorch 1.11:
conda install pytorch==1.11.0 torchvision==0.12.0 torchaudio==0.11.0 cudatoolkit=11.3 -c pytorch
- Clone the official repository and install it:
python -m pip install -e .
. - Change opacus version to
opacus==1.4.0
in requirements.txt (otherwise you may encounter atorch.func not found
error.) - Install the requirements with no dependencies:
pip install -r requirements.txt --no-dependencies
. - Install timm:
pip install -U timm
. - Install transformers:
pip install -U sentence-transformers
. - Install sentence-transformers:
pip install -U sentence-transformers
. - Install accelerate:
pip install -U accelerate
- Install pandas:
pip install -U pandas
- Install ml-swissknife:
pip install -U ml-swissknife
.
Clone this repository and activate the above fastdp
environment.
The steps to download the dataset are showed below. These data are used in both training and model deduplication.
For training DP models, please refer to the instructions in fastDP
to set up the datasets. For model deduplication, create a folder named data
in the root directory of this repo to hold datasets.
Download the data by wget https://nlp.cs.princeton.edu/projects/lm-bff/datasets.tar
and extract the files to the data
folder: tar xvf datasets.tar
. We will only use the QNLI, MNLI, and SST2 datasets.
The dataset can be downloaded from https://www.cs.toronto.edu/~kriz/cifar.html
Unzip the downloaded file into the data
folder.
Here is the home page of the CelebA dataset: https://mmlab.ie.cuhk.edu.hk/projects/CelebA.html
After click the first link under "Downloads", you will be redirected to a Google Drive, where you can download the data. Then download the followings:
/home/local/ASUAD/hguan6/DP_dedup/fastdp_examples/image_classification1. The file img_align_celeba.zip
in the Img
folder.
2. All text files in Eval
and Anno
.
3. Create a folder under data
named celeba
and unzip the img_align_celeba.zip
file into celeba
and also move all text files to celeba
.
To run the experiments, the first thing you need is to train DP models. Next, you can run the model deduplication algorithms: GreedyN, Dynamic algorithms (DRD and DRED), and MCTS. Lastly, you can run the model online serving on a resource constraint environment, such as an AWS c5a.xlarge instance.
We used the example code in fastDP
to train DP models: Roberta-base models on QNLI, MNLI, and SST-2; ViT models on CIFAR100 and CelebA, ResNet152 models on CelebA.
Here's an example of a bash script to the QNLI dataset
#!/bin/bash
python -m text_classification.run_wrapper \
--model_name_or_path roberta-base \
--physical_batch_size 16 \
--output_dir text_classification/models/base-qnli-eps0.2 \
--task_name qnli \
--target_epsilon 0.2 \
To enable training five partitions of MNLI seperately, we modified the dataset.py
file in examples/text_classification/src/
and run_classification.py
in examples/text_classification/
. You can replace these files in fastdp_examples/text_classification/
of this repository. You also need to change the part_id
argument in run_classification.py
to specify a partition. The choices for part_id
are {0,1,2,3,4}.
To train the vision models, run the example code by specifying the model and dataset. We set mini_bs
to 50 and the epochs
to 5 and used the default values for other arguments.
After training DP models, make a folder named models
that is in the same folder with our repository DP_Dedup
.
Then move the trained models to this models
folder, which will be loaded by deduplication algorithm.
To run our dynamic algorithms, simply run python run_drd.py
or python run_dred.py
with a task_type. For example,
python run_drd.py --task_type text_qnli
To enable SVT, set argument extra_val_eps
to a value greater than or equal to 0. You can also change the cut-off value c
by setting argument max_fails
to a positive integer. For example,
python run_drd.py --task_type text_qnli --extra_val_eps 0.0 --max_fails 3
You can save the final deduplicated models by enable --save_combined_storage True
.
To run GreedyN algorithms, run python run_greedyn.py
with argument every_n
, which means evaluation is performed after deduplicating every n
blocks. Other arguments are the same as the dynamic algorithms. For example,
python run_greedyn.py --task_type vision_vit --every_n 30
To run MCTS algorithms, run python run_mcts.py
with argument every_n
. For example,
python run_mcts.py --task_type vision_vit --every_n 20
To run online serving, you can run python run_online_serving.py
with three arguments. For --load_from
, you can choose from memory
and disk
. For --workload
, you can choose from random
and roundrobin
. For --dataset_name
, you can choose from CIFAR100
, CelebA
, and qnli
.
You can run pruning and quantization by setting --prune True
or --quantize True
. Currently, we only experimented them on ResNet152.
To evaluate the original models, you can comment or uncomment the task in evaluate_models.py
and run python evaluate_models.py
with the same arguments as described in the dynamic algorithms section.
Here are the commands for reproducing the major results in our paper.
- Overall effectiveness (Fig. 2)
To obtain the accuracy of the original models and "retrained" models, you need to first train the models with the listed epsilon values. Next choose a function call in
evaluate_models.py
, and run the model evaluation script:
python evaluate_models.py --task_type text_qnli
python evaluate_model.py --task_type vision_vit
python evaluate_model.py --task_type vision_resnet
To run deduplication using DRD:
python run_drd.py --task_type text_qnli
python run_drd.py --task_type text_mnli_sst2
python run_drd.py --task_type text_mnli
python run_drd.py --task_type vision_vit
python run_drd.py --task_type vision_resnet
To use DRED, just change the file name from run_drd.py
to run_dred.py
- Overall effectiveness (Fig. 3) To run any the combinations of datasets (CIFAR100, CelebA, qnli), workload type (random, round robin), and w/wo deduplication, here is an example to run ResNet on CIFAR100 with deduplication on the random workload:
python run_online_serving.py --dataset_name CIFAR100 --workload random --load_from memory
- Comparison of deduplication algorithms (Tab. 2) To run MCTS:
python run_mcts.py --task_type text_qnli --every_n 20
python run_mcts.py --task_type text_qnli --every_n 30
python run_mcts.py --task_type vision_vit --every_n 20
python run_mcts.py --task_type vision_vit --every_n 30
python run_mcts.py --task_type vision_resnet --every_n 20
python run_mcts.py --task_type vision_resnet --every_n 30
To run GreedyN:
python run_greedyn.py --task_type text_qnli --every_n 10
python run_greedyn.py --task_type text_qnli --every_n 20
python run_greedyn.py --task_type text_qnli --every_n 30
python run_greedyn.py --task_type text_qnli --every_n 40
python run_greedyn.py --task_type vision_vit --every_n 10
python run_greedyn.py --task_type vision_vit --every_n 20
python run_greedyn.py --task_type vision_vit --every_n 30
python run_greedyn.py --task_type vision_vit --every_n 40
python run_greedyn.py --task_type vision_resnet --every_n 10
python run_greedyn.py --task_type vision_resnet --every_n 20
python run_greedyn.py --task_type vision_resnet --every_n 30
python run_greedyn.py --task_type vision_resnet --every_n 40
To run dynamic algorithms:
python run_drd.py --task_type text_qnli
python run_dred.py --task_type text_qnli
python run_drd.py --task_type vision_vit
python run_dred.py --task_type vision_vit
python run_drd.py --task_type vision_resnet
python run_dred.py --task_type vision_resnet
- Comparison of base model selection in different scenarios (Fig. 4) Single model (C1):
python run_drd.py --task_type text_qnli --dummy_base_model 0
python run_drd.py --task_type text_qnli --dummy_base_model 1
python run_drd.py --task_type text_qnli --dummy_base_model 2
python run_drd.py --task_type text_qnli --dummy_base_model 3
For DRED, just change run_drd.py
to run_dred.py
.
Single model (C2):
python run_drd.py --task_type vision_vit --dummy_base_model 0
python run_drd.py --task_type vision_vit --dummy_base_model 1
python run_drd.py --task_type vision_vit --dummy_base_model 2
python run_drd.py --task_type vision_vit --dummy_base_model 3
For DRED, just change run_drd.py
to run_dred.py
.
Single model (C3):
python run_drd.py --task_type vision_resnet --dummy_base_model 0
python run_drd.py --task_type vision_resnet --dummy_base_model 1
python run_drd.py --task_type vision_resnet --dummy_base_model 2
python run_drd.py --task_type vision_resnet --dummy_base_model 3
For DRED, just change run_drd.py
to run_dred.py
.
Multiple base models (C4):
python run_drd.py --task_type text_qnli --in_group_n_base 1
python run_drd.py --task_type text_qnli --in_group_n_base 2
python run_drd.py --task_type text_qnli --in_group_n_base 3
python run_drd.py --task_type text_qnli --in_group_n_base 4
For DRED, just change run_drd.py
to run_dred.py
.
Intra-data v.s. Inter-data
python run_drd.py --task_type text_qnli_mnli.py
python run_drd.py --task_type text_sst2_mnli.py
python run_drd.py --task_type text_mnli.py
For DRED, just change run_drd.py
to run_dred.py
.
- Deduplication using privacy data with SVT
python run_drd.py --task_type text_qnli --big_batch True --extra_val_eps 0
python run_drd.py --task_type vision_resnet --big_batch True --extra_val_eps 0
For DRED, just change run_drd.py
to run_dred.py
.
TODO: Note that the overall compression ratio computation can vary in different experiments. The value shown at the end of the output may not be the value in the paper.
The code for baseline Dedup can be found from this link: https://github.com/asu-cactus/Model_Deduplication_Train_Text_Classification_Model/tree/dp_dedup