🙋 Please let us know if you find out a mistake or have any suggestions!
🌟 If you find this resource helpful, please consider to star this repository and cite our research:
Sicheng Feng, Siyu Li, Luonan Chen, Shengquan Chen. Unveiling potential threats: backdoor attacks in single-cell pretrained models. 2024.
We use python 3.9 from Anaconda. We provide two conda environments for the experiments: base.yml
and geneformer.yml
. The base.yml
is for the scGPT and scBERT experiments, while the geneformer.yml
is for the GeneFormer experiments.
To install all dependencies:
conda env create -f base.yml
# or
conda env create -f geneformer.yml
- Example datasets from [scGPT]
- Example datasets from [GeneFormer]
- Datasets from [Tabula Sapiens Single-Cell Dataset]
Place the downloaded contents under Yourpath4Dataset
to reproduce the experiments.
You can download the pretrained models from [scGPT] (whole-human), [scBERT] and [GeneFormer], then place the downloaded contents under Yourpath4PretrainedModels
to reproduce the experiments.
- Download datasets and pretrained models, then place them under
rightpath
and adjust the path-params in the scripts. - Then you can try to reproduce the experiments with the provided scripts. For example, you can evaluate on Human Pancreas datasets by:
nohup ./run.sh & # for scGPT_Exp
The commands to run the experiments are as follows:
nohup ./run.sh & # for scGPT_Exp
nohup ./run.sh & # for scBERT_Exp
nohup ./run.sh & # for GeneFormer_Exp
...
# or you can run the experiments in tmux or screen
./run_diff_batch.sh # for scGPT_Exp
./run_diff_feature.sh # for scGPT_Exp
...
The poison-related code is in the poison_utils.py
or poison_trigger.py
. You can find them in each experiment's folder.
The folder tree is as follows:
├── LICENSE
├── README.md -- introduction about the project
├── figures -- use for show up
│ └── fig1.png
├── requirements.txt -- requirements for installation
│── scGPT_Exp
│ ├── test -- the attack pipeline
│ │ ├── run.sh
│ │ ├── run_diff_batch.sh -- explore the impact of batch effects
│ │ ├── run_diff_feature.sh -- explore the impact of feature selection
│ │ ├── run_3datasets.sh
│ │ └── scBackdoor.py
│ └── utils -- the scGPT items
│ ├── detect_tools.py
│ ├── poison_trigger.py
│ ├── preprocess.py
│ ├── print_tools.py
│ └── tools.py
├── GeneFormer_Exp
│ ├── geneformer -- the GeneFormer items
│ │ ├── __init__.py
│ │ ├── classifier.py
│ │ ├── classifier_utils.py
│ │ ├── collator_for_classification.py
│ │ ├── emb_extractor.py
│ │ ├── evaluation_utils.py
│ │ ├── gene_median_dictionary.pkl
│ │ ├── gene_name_id_dict.pkl
│ │ ├── in_silico_perturber.py
│ │ ├── in_silico_perturber_stats.py
│ │ ├── perturber_utils.py
│ │ ├── poison_utils.py
│ │ ├── pretrainer.py
│ │ ├── token_dictionary.pkl
│ │ └── tokenizer.py
│ ├── run.sh -- the attack pipeline
│ └── geneformer_scBackdoor.py
└── scBERT_Exp
├── attn_sum_save.py
├── finetune.py
├── lr_baseline_crossorgan.py
├── performer_pytorch -- the scBERT items
│ ├── __init__.py
│ ├── performer_pytorch.py
│ └── reversible.py
├── poison_utils.py
├── predict.py
├── preprocess.py
├── pretrain.py
├── run.sh -- the attack pipeline
├── run_3datasets.sh
└── utils.py
-
scGPT: Towards Building a Foundation Model for Single-Cell Multi-omics Using Generative AI, Nature Methods 2024. [GitHub Repo]
-
Transfer learning enables predictions in network biology, Nature 2023. [Huggingface Repo]
-
scBERT as a Large-scale Pretrained Deep Language Model for Cell Type Annotation of Single-cell RNA-seq Data, Nature Machine Intelligence 2022. [GitHub Repo]
We sincerely thank the authors of the following open-source projects: