Skip to content

PTransIPs: Identification of SARS-CoV-2 phosphorylation sites based on protein pretrained model embedding and transformer

Notifications You must be signed in to change notification settings

SkywalkerLuke/PTransIPs

 
 

Repository files navigation

PTransIPs

PTransIPs: Identification of SARS-CoV-2 phosphorylation sites based on protein pretrained model embedding and transformer [Paper]

Approach

flowchart

Usage

Step1: Generate two pretrained embedding

(Optional, there are complete embeddings for Y sites in the data folder.)

Input1: Sequence

fasta/csv sequence file

Input2: Sequence pretrained embedding

To generate the sequence pretrained embedding, use pretrained_embedding_generate.py to do the following steps:

$ !pip install torch transformers sentencepiece h5py
$ python model_train_test/pretrained_embedding_generate.py

For detailed guide in this part, please refer to ProtTrans.

Input3: Structure pretrained embedding

First git clone the EMBER2 project, and then move the file pretrained_embedding_generate.py into the EMBER2 folder to use the model for generating the predicted structures for the current sequence.

To generate the sequence pretrained embedding, use pretrained_embedding_generate.py to do the following steps:

$ git clone https://github.com/kWeissenow/EMBER2.git
$ cp model_train_test/structure_embedding_generate.py EMBER2/
$ python EMBER2/structure_embedding_generate.py -i "data/Y-train.fa" -o "EMBER2/output"
$ python EMBER2/structure_embedding_generate.py -i "data/Y-test.fa" -o "EMBER2/output"

For detailed guide in this part, please refer to EMBER2.

Step2: Training PTransIPs Model

You can proceed directly to this step, as the requisite pretrained embeddings of dataset (Y sites) have been uploaded to GitHub.

Run train.py to train the PTransIPs model in PTransIPs_model.py.

$ python model_train_test/train.py

Step3: Evaluate the model performance on independent testset

Download the PTransIPs model

You can proceed directly to this step, if you have downloaded the models and put it into the PTransIPs folder

Run model_performance _evaluate.py to evaluate the model performance on independent testset.

$ python model_train_test/model_performance_evaluate.py

This function will create files PTransIPs_test_prob.npy and PTransIPs_text_result.txt, represent the prediction probability and performance of PTransIPs, respectively.

Step4: Other Visual analysis

You can proceed directly to this step, if you have downloaded the models and put it into the PTransIPs folder

You can also see the results directly in the GitHub.

Run umap.py to generate umap visualization figures.

$ python model_train_test/umap_test_Y.py

Run Generate_tfseq.py files to generate sequence for Two Sample Logo analysis.

$ python model_train_test/Generate_tfseq_Y.py

About

PTransIPs: Identification of SARS-CoV-2 phosphorylation sites based on protein pretrained model embedding and transformer

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 95.4%
  • Jupyter Notebook 4.6%