Skip to content

cBioLab/PrivacyProtectedArtificialGenomes

Repository files navigation

PrivacyProtectedArtificialGenomes

Generating artificial human genomes using GAN with privacy-preserving techniques (gradient clipping).

Installation and Preparation

  1. Clone this repository

    git clone https://github.com/cBioLab/PrivacyProtectedArtificialGenomes
    cd PrivacyProtectedArtificialGenomes
    
  2. Create conda environment

    conda create -n ppag python=3.9
    conda activate ppag
    pip install -r requirements.txt
    
  3. Unzip data.zip

    unzip GAN_2000/data.zip
    unzip GAN_805_random/data.zip
    unzip GAN_805_EAS/data.zip
    
  4. Integrate separately stored model information in sample directories (2000 SNP only)

    Before executing this code, the following information is stored separately.

    • Generator
    • Discriminator
    • Optimizer of the generator
    • Optimizer of the discriminator
    python model_concat.py baseline ./GAN_2000
    python model_concat.py clipping ./GAN_2000
    python model_concat.py dp ./GAN_2000
    
  5. Execute each experimentals

    You can experiment with the following:

    • Membership inference attacks
    • Genotype imputation
    • Model's training
    • Generate artificial genomes from trained models

    The following chapters describe each experiment.

Note

Below codes are also written in scripts directory. You can either execute the following code directly or run the sh file.

Membership Inference Attacks

You can test membership inference attacks.

When executing, specify several arguments:

  • model_dir: Path to the directory of the target model. It is under the work_dir.
  • model_name: File name of the model in model_dir.
  • model_type: Type of the target model. Choose from [Baseline, Clipping, DP].

2000 SNP model

We have used dropout layer for the training using 2000 SNP dataset, so you need to specify below argument.

  • dropout: Dropout rate. We used 0.1.

Gradient Clipping Model / Differential Privacy Model

Targetting gradient clipping model and differential privacy model, you need to specify below arguments.

  • apply_dp: The parameter that shows the use of Opacus.
  • sigma: The parameter that determines the amount of noise added during training. 0 for clipping model and 0.04 for differential privacy model.
  • c: The parameter that determines the clipping value used during training. 0.5 for both models.

Examples

Targeting baseline model. (2000 SNP dataset)

python main.py --work_dir ./GAN_2000 --wb_attack --model_dir models/samples/baseline --model_name baseline.pt --model_type Baseline --dropout 0.1

Targeting clipping model. (2000 SNP dataset)

python main.py --work_dir ./GAN_2000 --wb_attack --model_dir models/samples/clipping --model_name clipping.pt --model_type Clipping --dropout 0.1 --apply_dp --sigma 0 -c 0.5

Targeting differential privacy model. (2000 SNP dataset)

python main.py --work_dir ./GAN_2000 --wb_attack --model_dir models/samples/dp --model_name dp.pt --model_type DP --dropout 0.1 --apply_dp --sigma 0 -c 0.5

Targeting differential privacy model. (805 SNP dataset, random split)

python main.py --work_dir ./GAN_805_random --wb_attack --model_dir models/samples/dp --model_name dp.pt --model_type DP --apply_dp --sigma 0 -c 0.5

Genotype Imputation

You can test genotype imputation using IMPUTE2.

Note

First, please download IMPUTE2 from the official website and place the executable file in the modules/imputation directory.

When executing, specify several arguments:

  • ref_type: The type of dataset used for reference. Choose from [1KG, GAN, Clipping, DP].
  • ref_haps_size: The number of haplotypes used for the reference.

If using artificial genomes

  • model_dir: Path to the directory of the target model. It is under the work_dir.
  • ag_file_name: The file name of the artificial genome in the model_dir. Zip files are also supported.

Examples

Use real data with 4000 haplotypes as a reference. (1KG_4000)

python main.py --work_dir ./GAN_2000/ --imputation --ref_type 1KG --ref_haps_size 4000

Use artificial data with 4000 haplotypes generated by baseline model as a reference. (Baseline_4000)

python main.py --work_dir ./GAN_2000/ --imputation --model_dir models/samples/baseline --ag_file_name 16000_output.hapt --ref_type GAN --ref_haps_size 4000

Use artificial data with 20000 haplotypes generated by clipping model as a reference. (Clipping_20000)

python main.py --work_dir ./GAN_2000/ --imputation --model_dir models/samples/clipping --ag_file_name 16000_output_regen.hapt.zip --ref_type Clip --ref_haps_size 20000

Use artificial data with 40000 haplotypes generated by differential privacy model as a reference. (DP_40000)

python main.py --work_dir ./GAN_2000/ --imputation --model_dir models/samples/dp --ag_file_name 16000_output_regen.hapt.zip --ref_type DP --ref_haps_size 40000

Training

There are sample models available, so you can conduct experiments without training the model yourself, but it is also possible to train the model using dataset.

Specify the parameters of the study as arguments. See main.py for description of each parameter.

Example

Create a baseline model using a 805 SNP random dataset.

python main.py --train --work_dir ./GAN_805_random --out_dir models/baseline  --g_learn 0.0001 --d_learn 0.0008 --epochs 16000 --save_that 1000 --norm None --ag_size 4000

Create a model applying gradient clipping using a 805 SNP excluding East Asians dataset.

python main.py --train --work_dir ./GAN_805_EAS --out_dir models/clip  --g_learn 0.0001 --d_learn 0.0008 --epochs 16000 --save_that 1000 --norm None --ag_size 4000 --apply_dp --sigma 0 -c 0.5

Create a model applying differential privacy using a 2000 SNP dataset.

python main.py --train --work_dir ./GAN_2000 --out_dir models/dp  --g_learn 0.00008 --d_learn 0.00064 --epochs 16000 --save_that 1000 --dropout 0.1 --norm None --ag_size 4000 --apply_dp --sigma 0.04 -c 0.5

Regenerate

Generate new artificial genomes from the model that has already been created.

Specify the following arguments:

  • model_dir: Path to the directory of the target model. It is under the work_dir.
  • model_name: File name of the model in model_dir.
  • ag_size: Number of artificial genomes to be generated.

Example

From a baseline model using a 805 SNP random dataset, generate 10000 haplotypes.

python main.py --regenerate --work_dir ./GAN_805_random --model_dir models/samples/baseline --model_name baseline.pt --ag_size 10000

From a gradient clipping model using a 2000 SNP dataset, generate 10000 haplotypes.

python main.py --regenerate --work_dir ./GAN_2000 --model_dir models/samples/clipping --model_name clipping.pt --ag_size 10000

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published