Skip to content

GBDT-based model with efficient unlearning (SIGMOD 2023)

Notifications You must be signed in to change notification settings

Xtra-Computing/DeltaBoost

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DeltaBoost Documentation

News: DeltaBoost has won the Honorable Mention for Best Artifact Award in SIGMOD23!

DeltaBoost is a machine learning model based on gradient boosting decision tree (GBDT) that supports efficient machine unlearning, which is published on SIGMOD 23. We provide two methods to reproduce the results in the paper: a master script and a step-by-step guide. The master script will automatically download the datasets, build DeltaBoost, run the experiments, and summary results. The estimated execution time of the master script is a week. The step-by-step guide will show how to run each experiment in the paper.

Contents

Getting Started

Environment (Docker)

The recommended approach for environment configuration is through a docker image. Download the image by

docker pull jerrylife/deltaboost

Create a container named deltaboost based on the image.

docker run -d -t --name deltaboost jerrylife/deltaboost

Find the container ID at the first column by

docker ps

Execute the master script in the container in background

docker exec -t <container-ID> bash run.sh

You may also enter the container to observe the results by

docker exec -it <container-ID> bash

Important: download_datasets.sh is only tested for fresh execution. If a download is terminated and needed to restart, please remove the data folder by rm -rf data/ before the next execution.

For convenience of manual configuration, we also provide the Dockerfile for image building.

Environment (Step by Step)

The required packages for DeltaBoost includes

  • g++-10 or above
  • OpenSSL
  • OpenCL
  • CMake 3.15 or above
  • GMP
  • NTL
  • Boost
  • Python 3.9+

Install G++, GCC, OpenSSL, OpenCL, cmake and GMP

sudo apt install gcc-10 g++-10 libssl-dev opencl-headers cmake libgmp3-dev

Install NTL

The NTL can be installed from source by

wget https://libntl.org/ntl-11.5.1.tar.gz
tar -xvf ntl-11.5.1.tar.gz
cd ntl-11.5.1/src
./configure SHARED=on
make -j
sudo make install

If NTL is not installed under default folder, you need to specify the category of NTL during compilation by

cmake .. -DNTL_PATH="PATH_TO_NTL"

Install Boost

DeltaBoost requires boost >= 1.75.0. Since it may not be available on official apt repositories, you may need to install manually.

Download and unzip boost 1.75.0.

wget https://boostorg.jfrog.io/artifactory/main/release/1.75.0/source/boost_1_75_0.tar.bz2
tar -xvf boost_1_75_0.tar.bz2

Install dependencies for building boost.

sudo apt-get install build-essential autotools-dev libicu-dev libbz2-dev libboost-all-dev

Start building.

./bootstrap.sh --prefix=/usr/
./b2
sudo ./b2 install

Reproduce Main Results (Master Script)

We provide a master script to reproduce the main results in the paper. The script will automatically download the datasets, build DeltaBoost, run the experiments, and summary results. The results will be saved in fig/ and out/ directory. Simply run

bash run.sh

Prepare Data

Install Python Environment

DeltaBoost requires Python >= 3.9. The required packages have been included in python-utils/requirements.txt. Install necessary modules by

pip install -r requirements.txt

Download and Preprocess Datasets

Download datasets and remove instances from samples.

bash download_datasets.sh

This script will download 5 datasets from LIBSVM wesbite. After downloading and unzipping, some instances will be removed from these datasets. The removing ratio is 0.1% and 1% by default. The time of removal may take several minutes. If more ratios is needed, you can change the -r option of remove_sample.py. After the preparation, there should exist a data/ directory with the following structure.

Important: download_datasets.sh is only tested for fresh execution. If a download is terminated and needed to restart, please remove the data folder by rm -rf data/ beore the next execution.

data
├── cadata
├── cadata.test
├── cadata.train
├── cadata.train.delete_1e-02
├── cadata.train.delete_1e-03
├── cadata.train.remain_1e-02
├── cadata.train.remain_1e-03
├── codrna.test
├── codrna.train
├── codrna.train.delete_1e-02
├── codrna.train.delete_1e-03
├── codrna.train.remain_1e-02
├── codrna.train.remain_1e-03
├── covtype
├── covtype.test
├── covtype.train
├── covtype.train.delete_1e-02
├── covtype.train.delete_1e-03
├── covtype.train.remain_1e-02
├── covtype.train.remain_1e-03
├── gisette.test
├── gisette.train
├── gisette.train.delete_1e-02
├── gisette.train.delete_1e-03
├── gisette.train.remain_1e-02
├── gisette.train.remain_1e-03
├── msd.test
├── msd.train
├── msd.train.delete_1e-02
├── msd.train.delete_1e-03
├── msd.train.remain_1e-02
└── msd.train.remain_1e-03

Build DeltaBoost

Build DeltaBoost by

mkdir build && cd build
cmake ..
make -j

An executable named build/bin/FedTree-train should be created. For convenience, you may create a symlink for this binary.

cd ..   # under root dir of DeltaBoost
ln -s build/bin/FedTree-train main

Usage of DeltaBoost

For simplicity, the usage guide assumes that the binary main has been created.

Basic Usage

DeltaBoost can be configured by a .conf file or/and the command line parameters. For example,

./main conf=conf/cadata.conf    # By .conf file
./main enable_delta=true nbr_size=10       # By parameters
./main conf=conf/cadata.conf enable_delta=true nbr_size=10  # By both methods

When both methods are applied, the parameters in the command line will overwrite the value in the .conf file.

Sure, here is a brief parameter guide in markdown format.

Parameter Guide

  • dataset_name (std::string)

    • Usage: The name of the dataset.
    • Default value: ""
  • save_model_name (std::string)

    • Usage: The name to save the model as.
    • Default value: ""
  • data (std::string)

    • Usage: Path to the training data.
    • Default value: "../dataset/test_dataset.txt"
  • test_data (std::string)

    • Usage: Path to the test data.
    • Default value: ""
  • remain_data (std::string)

    • Usage: Path to the remaining training data after deletion.
    • Default value: ""
  • delete_data (std::string)

    • Usage: Path to the deleted training data.
    • Default value: ""
  • n_parties (int)

    • Usage: The number of parties in the federated learning setting.
    • Default value: 2
  • mode (std::string)

    • Usage: The mode of federated learning (e.g., "horizontal" or "centralized").
    • Default value: "horizontal"
  • privacy_tech (std::string)

    • Usage: The privacy technique to use (e.g., "he" or "none").
    • Default value: "he"
  • learning_rate (float)

    • Usage: The learning rate for the gradient boosting decision tree.
    • Default value: 1
  • max_depth (int)

    • Usage: The maximum depth of the trees in the gradient boosting decision tree.
    • Default value: 6
  • n_trees (int)

    • Usage: The number of trees in the gradient boosting decision tree.
    • Default value: 40
  • objective (std::string)

    • Usage: The objective function for the gradient boosting decision tree (e.g., "reg:linear").
    • Default value: "reg:linear"
  • num_class (int)

    • Usage: The number of classes in the data.
    • Default value: 1
  • tree_method (std::string)

    • Usage: The method to use for tree construction (e.g., "hist").
    • Default value: "hist"
  • lambda (float)

    • Usage: The lambda parameter for the gradient boosting decision tree.
    • Default value: 1
  • verbose (int)

    • Usage: Controls the verbosity of the output.
    • Default value: 1
  • enable_delta (std::string)

    • Usage: Enable or disable the delta boosting parameter ("true" or "false").
    • Default value: "false"
  • remove_ratio (float)

    • Usage: The ratio of data to be removed in delta boosting.
    • Default value: 0.0
  • min_diff_gain (int)

    • Usage: (Please provide the usage)
    • Default value: ""
  • max_range_gain (int)

    • Usage: (Please provide the usage)
    • Default value: ""
  • n_used_trees (int)

    • Usage: The number of trees to be used in delta boosting.
    • Default value: 0
  • max_bin_size (int)

    • Usage: The maximum bin size in delta boosting.
    • Default value: 100
  • nbr_size (int)

    • Usage: The neighbor size in delta boosting.
    • Default value: 1
  • gain_alpha (float)

    • Usage: The alpha parameter for the gain calculation in delta boosting.
    • Default value: 0.0
  • delta_gain_eps_feature (float)

    • Usage: The epsilon parameter for the gain calculation with respect to features in delta boosting.
    • Default value: 0.0
  • delta_gain_eps_sn (float)

    • Usage: The epsilon parameter for the gain calculation with respect to sample numbers in delta boosting.
    • Default value: 0.0
  • hash_sampling_round (int)

    • Usage: The number of rounds for hash sampling in delta boosting.
    • Default value: 1
  • n_quantized_bins (int)

    • Usage: The number of quantized bins in delta boosting.
    • Default value: ""
  • seed (int)

    • Usage: The seed for random number generation.
    • Default value: ""

Reproduce Main Results (Step by Step)

Before reproducing the main results, please make sure that the binary main has been created. All the time reported are done on two AMD EPYC 7543 32-Core Processor using 96 threads. If your machine does not have the required threads, you may

  • reduce the number of seeds, for example, to 5. However, this increases the variance of the calculated Hellinger distance.
  • reduce the require threads, for example, to taskset -c 0-11. However, this increases the running time. If you want to use all the threads, simply remove taskset -c 0-x before the command.

First, create necessary folders to store results.

mkdir -p cache out fig

Removing in one tree (Table 4,5)

To test removing in a single tree with Deltaboost, simply run

bash test_remove_deltaboost_tree_1.sh 100  # try 100 seeds

This script finishes in 6 hours. After the execution, two folders will appear under the project root:

  • out/remove_test/tree1 contains accuracy of each model on five datasets.
  • cache/ contains two kinds of information:
    • original model, deleted model, and retrained model in json format.
    • detailed per-instance prediction in csv format. This information is used to calculate the Hellinger distance.

To extract the information in a latex table, run

# in project root
cd python-utils
python plot_results.py -t 1

The scripts extracts the accuracy and Hellinger distance of DeltaBoost into Latex table. The cells of baselines to be manually filled in are left empty in this table.

Two files of summarized outputs are generated in out/:

  • out/accuracy_table_tree1.csv: Results of accuracy in Table 4. An example is shown below.
,,0.0874\textpm 0.0002,,,0.0873\textpm 0.0005
,,0.0874\textpm 0.0002,,,0.0873\textpm 0.0005
,,0.0873\textpm 0.0002,,,0.0872\textpm 0.0007
,,0.2611\textpm 0.0001,,,0.2610\textpm 0.0001
,,0.2611\textpm 0.0001,,,0.2611\textpm 0.0001
,,0.2611\textpm 0.0001,,,0.2610\textpm 0.0000
,,0.0731\textpm 0.0020,,,0.0787\textpm 0.0042
,,0.0731\textpm 0.0020,,,0.0786\textpm 0.0043
,,0.0731\textpm 0.0020,,,0.0790\textpm 0.0043
-,-,0.1557\textpm 0.0034,-,-,0.1643\textpm 0.0066
-,-,0.1557\textpm 0.0034,-,-,0.1643\textpm 0.0065
-,-,0.1558\textpm 0.0034,-,-,0.1644\textpm 0.0066
-,-,0.1009\textpm 0.0003,-,-,0.1009\textpm 0.0003
-,-,0.1009\textpm 0.0003,-,-,0.1009\textpm 0.0003
-,-,0.1009\textpm 0.0003,-,-,0.1009\textpm 0.0003
  • out/forget_table_tree1.csv: Results of Hellinger distance in Table 5. An example is shown below.
,,0.0002\textpm 0.0051,,,0.1046\textpm 0.2984
,,0.0000\textpm 0.0014,,,0.0070\textpm 0.0515
,,0.0162\textpm 0.1260,,,0.0300\textpm 0.1521
,,0.0000\textpm 0.0005,,,0.0069\textpm 0.0467
,,0.0007\textpm 0.0022,,,0.0070\textpm 0.0081
,,0.0000\textpm 0.0004,,,0.0051\textpm 0.0065
-,-,0.0058\textpm 0.0157,-,-,0.0087\textpm 0.0113
-,-,0.0034\textpm 0.0121,-,-,0.0033\textpm 0.0048
-,-,0.0041\textpm 0.0044,-,-,0.0126\textpm 0.0101
-,-,0.0028\textpm 0.0036,-,-,0.0093\textpm 0.0079

These two results might be slightly different from the results in the paper due to the randomness of the training process. However, the distance between $M_d$ and $M_r$ is very small, which is consistent as the results in the paper.

Removing in Multiple trees (Table 7)

To test removing in 10 trees with Deltaboost, simply run

bash test_remove_deltaboost_tree_10.sh 100  # try 100 seeds

The script finishes in 2-3 days. After the execution, two folders will appear under the project root:

  • out/remove_test/tree10 contains accuracy of each model on five datasets.
  • cache/ contains two kinds of information:
    • original model, deleted model, and retrained model in json format.
    • detailed per-instance prediction in csv format. This information is used to calculate the Hellinger distance.

To extract the information in a latex table, run

# in project root
cd python-utils
python plot_results.py -t 10

The script extracts the accuracy and Hellinger distance of DeltaBoost into Latex table. The cells of baselines to be manually filled in are left empty in this table.

Two files of summarized outputs are generated in out/:

  • out/accuracy_table_tree10.csv: Results of accuracy in Table 7(a). An example is shown below.
,,0.0616\textpm 0.0011,,,0.0617\textpm 0.0010
,,0.0617\textpm 0.0011,,,0.0618\textpm 0.0010
,,0.0617\textpm 0.0011,,,0.0617\textpm 0.0010
,,0.2265\textpm 0.0069,,,0.2265\textpm 0.0069
,,0.2264\textpm 0.0069,,,0.2265\textpm 0.0068
,,0.2264\textpm 0.0067,,,0.2255\textpm 0.0066
,,0.0509\textpm 0.0043,,,0.0490\textpm 0.0038
,,0.0509\textpm 0.0043,,,0.0490\textpm 0.0038
,,0.0508\textpm 0.0041,,,0.0497\textpm 0.0046
-,-,0.1272\textpm 0.0055,-,-,0.1396\textpm 0.0068
-,-,0.1274\textpm 0.0055,-,-,0.1400\textpm 0.0068
-,-,0.1273\textpm 0.0055,-,-,0.1399\textpm 0.0072
-,-,0.1040\textpm 0.0006,-,-,0.1040\textpm 0.0006
-,-,0.1040\textpm 0.0006,-,-,0.1040\textpm 0.0006
-,-,0.1041\textpm 0.0006,-,-,0.1040\textpm 0.0005
  • out/forget_table_tree10.csv: Results of Hellinger distance in Table 7(b). An example is shown below.
,,0.0130\textpm 0.0100,,,0.0088\textpm 0.0079
,,0.0129\textpm 0.0100,,,0.0089\textpm 0.0078
,,0.0112\textpm 0.0089,,,0.0118\textpm 0.0096
,,0.0112\textpm 0.0090,,,0.0118\textpm 0.0096
,,0.0106\textpm 0.0073,,,0.0312\textpm 0.0169
,,0.0106\textpm 0.0073,,,0.0312\textpm 0.0167
-,-,0.0240\textpm 0.0169,-,-,0.0247\textpm 0.0159
-,-,0.0239\textpm 0.0160,-,-,0.0249\textpm 0.0149
-,-,0.0194\textpm 0.0106,-,-,0.0249\textpm 0.0127
-,-,0.0194\textpm 0.0106,-,-,0.0248\textpm 0.0126

These two results might be slightly different from the results in the paper due to the randomness of the training process. However, the distance between $M_d$ and $M_r$ is very small, which is consistent as the results in the paper.

Efficiency (Table 6)

To test the efficiency, we need to perform a clean retrain of GBDT. To train a 10-tree GBDT, run

bash test_remove_gbdt_efficiency.sh 10

The script retrain GBDT on five datasets with two removal ratios for one time since the GBDT is deterministic. The script finishes in 10 minutes. After the execution, the efficiency and speedup can be summarized by

python plot_time.py -t 10

The expected output should be like

Thunder	& DB-Train	& DB-Remove	& Speedup (Thunder) \\
 12.410	&  8.053 \textpm 3.976	 &  0.156 \textpm 0.047	 & 79.34x \\
 12.143	&  7.717 \textpm 4.134	 &  0.160 \textpm 0.035	 & 75.82x \\
 15.668	&  52.253 \textpm 4.796	 &  1.482 \textpm 2.260	 & 10.57x \\
 16.015	&  52.333 \textpm 4.107	 &  1.874 \textpm 3.364	 & 8.55x \\
 50.213	&  66.658 \textpm 7.747	 &  0.956 \textpm 0.265	 & 52.51x \\
 47.089	&  65.322 \textpm 7.235	 &  1.123 \textpm 0.259	 & 41.95x \\
 12.434	&  6.038 \textpm 5.198	 &  0.068 \textpm 0.042	 & 183.03x \\
 12.524	&  4.704 \textpm 3.282	 &  0.053 \textpm 0.037	 & 237.99x \\
 22.209	&  53.451 \textpm 3.659	 &  3.523 \textpm 0.812	 & 6.30x \\
 24.067	&  54.221 \textpm 2.952	 &  3.422 \textpm 0.700	 & 7.03x \\

The time may vary due to the environment and hardwares, but the speedup is consistently significant as that in the Table 6 of the paper.

We also provide a script to running the baselines: sklearn and xgboost for efficiency comparison. Note that the performance of xgboost vary significantly by version. For example, some versions favors high-dimensional datasets but performs slower on large low-dimensional datasets. We adopt the default version of conda xgboost==1.5.0 in our experiments. To run the baselines, run

taskset -c 0-95 python baseline.py  # Also limit the number of threads to 96

This script is expected to finish in 10 minutes. The output contains the accuracy and training time (excluding loading data) of baselines. The expected output should be like

Got X with shape (58940, 8), y with shape (58940,)
Scaling y to [0,1]
Got X with shape (271617, 8), y with shape (271617,)
Scaling y to [0,1]
sklearn GBDT training time: 1.209s
sklearn GBDT error: 0.0577
=====================================
Got X with shape (460161, 54), y with shape (460161,)
Scaling y to [0,1]
Got X with shape (116203, 54), y with shape (116203,)
Scaling y to [0,1]
sklearn GBDT training time: 21.309s
sklearn GBDT error: 0.1974
=====================================
Got X with shape (5940, 5000), y with shape (5940,)
Scaling y to [0,1]
Got X with shape (1000, 5000), y with shape (1000,)
Scaling y to [0,1]
sklearn GBDT training time: 21.941s
sklearn GBDT error: 0.0600
=====================================
Got X with shape (16347, 8), y with shape (16347,)
Scaling y to [0,1]
Got X with shape (4128, 8), y with shape (4128,)
Scaling y to [0,1]
sklearn GBDT training time: 0.601s
sklearn GBDT error: 0.8558
=====================================
Got X with shape (459078, 90), y with shape (459078,)
Scaling y to [0,1]
Got X with shape (51630, 90), y with shape (51630,)
Scaling y to [0,1]
sklearn GBDT training time: 372.924s
sklearn GBDT error: 0.8819
=====================================
Got X with shape (59476, 8), y with shape (59476,)
Scaling y to [0,1]
Got X with shape (271617, 8), y with shape (271617,)
Scaling y to [0,1]
[10:06:19] WARNING: ../src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
XGBoost training time: 9.131s
XGBoost error: 0.0405
=====================================
Got X with shape (464345, 54), y with shape (464345,)
Scaling y to [0,1]
Got X with shape (116203, 54), y with shape (116203,)
Scaling y to [0,1]
[10:06:29] WARNING: ../src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
XGBoost training time: 13.075s
XGBoost error: 0.1558
=====================================
Got X with shape (5994, 5000), y with shape (5994,)
Scaling y to [0,1]
Got X with shape (1000, 5000), y with shape (1000,)
Scaling y to [0,1]
[10:06:47] WARNING: ../src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
XGBoost training time: 13.260s
XGBoost error: 0.0320
=====================================
Got X with shape (16496, 8), y with shape (16496,)
Scaling y to [0,1]
Got X with shape (4128, 8), y with shape (4128,)
Scaling y to [0,1]
XGBoost training time: 8.966s
XGBoost RMSE: 0.1182
=====================================
Got X with shape (463252, 90), y with shape (463252,)
Scaling y to [0,1]
Got X with shape (51630, 90), y with shape (51630,)
Scaling y to [0,1]
XGBoost training time: 20.309s
XGBoost RMSE: 0.1145
=====================================
Got X with shape (59476, 8), y with shape (59476,)
Scaling y to [0,1]
Got X with shape (271617, 8), y with shape (271617,)
Scaling y to [0,1]
Random Forest training time: 0.278s
Random Forest error: 0.1073
=====================================
Got X with shape (464345, 54), y with shape (464345,)
Scaling y to [0,1]
Got X with shape (116203, 54), y with shape (116203,)
Scaling y to [0,1]
Random Forest training time: 2.656s
Random Forest error: 0.2360
=====================================
Got X with shape (5994, 5000), y with shape (5994,)
Scaling y to [0,1]
Got X with shape (1000, 5000), y with shape (1000,)
Scaling y to [0,1]
Random Forest training time: 0.280s
Random Forest error: 0.0650
=====================================
Got X with shape (16496, 8), y with shape (16496,)
Scaling y to [0,1]
Got X with shape (4128, 8), y with shape (4128,)
Scaling y to [0,1]
Random Forest training time: 0.387s
Random Forest accuracy: 0.1312
=====================================
Got X with shape (463252, 90), y with shape (463252,)
Scaling y to [0,1]
Got X with shape (51630, 90), y with shape (51630,)
Scaling y to [0,1]
Random Forest training time: 229.927s
Random Forest accuracy: 0.1170
Got X with shape (59476, 8), y with shape (59476,)
Scaling y to [0,1]
Got X with shape (271617, 8), y with shape (271617,)
Scaling y to [0,1]
Decision Tree training time: 0.122s
Decision Tree error: 0.0669
=====================================
Got X with shape (464345, 54), y with shape (464345,)
Scaling y to [0,1]
Got X with shape (116203, 54), y with shape (116203,)
Scaling y to [0,1]
Decision Tree training time: 2.289s
Decision Tree error: 0.2225
=====================================
Got X with shape (5994, 5000), y with shape (5994,)
Scaling y to [0,1]
Got X with shape (1000, 5000), y with shape (1000,)
Scaling y to [0,1]
Decision Tree training time: 2.464s
Decision Tree error: 0.0680
=====================================
Got X with shape (16496, 8), y with shape (16496,)
Scaling y to [0,1]
Got X with shape (4128, 8), y with shape (4128,)
Scaling y to [0,1]
Decision Tree training time: 0.058s
Decision Tree accuracy: 0.1382
=====================================
Got X with shape (463252, 90), y with shape (463252,)
Scaling y to [0,1]
Got X with shape (51630, 90), y with shape (51630,)
Scaling y to [0,1]
Decision Tree training time: 35.572s
Decision Tree accuracy: 0.1185

Note that the training time of baselines in this example is longer than that in Table 6 due to the different CPU. Nonetheless, the speedup of DeltaBoost is still similarly significant, thus the conclusion is not affected.

Memory Usage (Table 8)

The peak memory usage can be easily observed during the training, which is however hard to be recorded by a script. Since the memory consumption is almost consistent during the training, the recommended approach is to manually monitor the peak memory usage of the process in the system monitor, e.g., htop.

Accuracy (Figure 9)

The accuracy of baselines is output by the same command as testing efficiency.

python baseline.py

The accuracy of DeltaBoost has also recorded in the previous logs.

The default max number of trees is 10, which is sufficient to obtain a promising accuracy. To test the accuracy of baselines with 100 trees, run

python baseline.py -t 100

Since each baseline algorithm is run for only once, this script is expected to finish in 10 minutes.

Next, we also need to obtain the results of DeltaBoost with 100 trees. To do so, run

bash test_accuracy.sh 10  # run 10 times

This procedure takes around 1-2 days. For more efficient testing, you can reduce the number of repeats by changing the parameter from 10 to a smaller number. This will result in larger variance in the results.

After obtaining all the results, run

python plot_results.py -acc -t 10   # (10 trees)
python plot_results.py -acc -t 100  # (100 trees)

Two images will be generated in fig/, named

acc-tree10.png
acc-tree100.png

Both images are similar to Fig. 9 in the paper.

Ablation Study (Figure 10, 11)

The ablation study includes six bash scripts.

ablation_bagging.sh
ablation_iteration.sh
ablation_nbins.sh
ablation_quantization.sh
ablation_ratio.sh
ablation_regularization.sh

These scripts can be run in a single script test_all_ablation.sh by

bash test_all_ablation.sh 50  # run 50 times

This combined script takes around 1-2 days. If you want to run the ablation study in a shorter time, you can reduce the number of repeats by changing the parameter from 50 to a smaller number. This will result in larger variance in the results.

To plot all the figures of ablation study into fig/ablation, run

python plot_ablation.py

This plotting process takes around 10 minutes. The major time cost is calculating Hellinger distance.

Citation

If you find this repository useful in your research, please cite our paper:

@article{wu2023deltaboost,
  author = {Wu, Zhaomin and Zhu, Junhui and Li, Qinbin and He, Bingsheng},
  title = {DeltaBoost: Gradient Boosting Decision Trees with Efficient Machine Unlearning},
  year = {2023},
  issue_date = {June 2023},
  publisher = {Association for Computing Machinery},
  address = {New York, NY, USA},
  volume = {1},
  number = {2},
  url = {https://doi-org.libproxy1.nus.edu.sg/10.1145/3589313},
  doi = {10.1145/3589313},
  journal = {Proc. ACM Manag. Data},
  month = {jun},
  articleno = {168},
  numpages = {26},
  keywords = {data deletion, gradient boosting decision trees, machine unlearning}
}

About

GBDT-based model with efficient unlearning (SIGMOD 2023)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published