Copyright © Wei MEI, MLMS™—all rights reserved. 🀤
A simple neural network training algorithm for accurate protein secondary structure prediction (PSSpred )! See documentation for more details.
PSSpred (Protein Secondary Structure prediction) is a simple neural network training algorithm for accurate protein secondary structure prediction. It first collects multiple sequence alignments using PSI-BLAST. Amino-acid frequence and log-odds data with Henikoff weights are then used to train secondary structure, separately, based on the Rumelhart error backpropagation method. The final secondary structure prediction result is a combination of 7 neural network predictors from different profile data and parameters. The program is freely downloadable on this page.
We have a community chat at Gitter. Feel free to ask us anything there. We have a very welcoming and helpful community.
No installation is needed!
Simply fork this project and edit the file `seq.fasta`
(file path: src/PSSpred_v4/seq.fasta) in `FASTA Format`
in your own repository, then you can acquire the outputs through github worflow in about 8 minutes, and download them via artifacts link. The output files contains two results, one for `seq.dat`
(PSSpred prediction in I-TASSER format), one for `seq.dat.ss`
(the original confidence file). If you want to check more results, you need to edit github workflow file PSSPred.yml:
name: PSSpred
on:
push:
branches:
- master
jobs:
build_docs_and_deploy:
runs-on: ubuntu-latest
name: running PSSpred
steps:
- name: Checkout
uses: actions/checkout@master
- name: running perl
run: |
echo "Initializing the program....................."
echo "---------------------------------------------"
cd ../
mkdir output
echo "output file already created!"
echo "---------------------------------------------"
cd PSSpred/
cd src/
mkdir nr
cd nr/
wget -O nr.tar.gz https://zhanggroup.org/PSSpred/nr.tar.gz
tar -xvf nr.tar.gz
echo "nr.tar.gz already unpacked!"
echo "Show the path of this file: "
pwd
cd ../
cd PSSpred_v4/
./PSSpred.pl seq.fasta
cp seq.dat /home/runner/work/PSSpred/output/
cp seq.dat.ss /home/runner/work/PSSpred/output/
cp blast.out /home/runner/work/PSSpred/output/
cd /home/runner/work/PSSpred/output/
ls
pwd
- uses: actions/upload-artifact@v2
with:
name: output results
path: /home/runner/work/PSSpred/output/
Not familiar with `FASTA format`
? Don't panick, this project is very user-friendly. You can type the following protein sequence:
MVLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRVKHLKTEAEMKASEDLKKHGVTVLTALGAILKKKGHHEAELKPLAQSHATKHKIPIKYLEFISEAIIHVLHSRHPGNFGADAQLELGAMNKAFRKDIAAKYKELGYQG
in `seq_1.txt`
simply, and upload to the directory (path: src/PSSpred_v4/). Wait for almost 8 minutes (check Appveyor build status: pending? failing? passing?), download the output files when the job is done.
image: Ubuntu
install:
- sh: cd src/
- sh: mkdir nr
- sh: cd nr/
- sh: wget -O nr.tar.gz https://zhanggroup.org/PSSpred/nr.tar.gz
- sh: tar -xvf nr.tar.gz
- sh: cd ../PSSpred_v4/
- sh: ./PSSpred.pl seq_1.txt
- sh: pwd
# Skip project specific build phase.
build: off
test_script:
- "ls"
- "pwd"
artifacts:
- path: src\PSSpred_v4\seq.dat
name: seq.dat
- path: src\PSSpred_v4\seq.dat.ss
name: seq.dat.ss
- path: src\PSSpred_v4\protein.fasta
name: protein.fasta
If you prefer to use CircleCI other than Appveyor, it is alright. Just edit the `seq_2.txt`
(file path: src/PSSpred_v4/seq_2.txt) and commit. For example, you can use the following protein sequence and generatre the secondary structure prediction by your own. Also, change the `./PSSpred.pl seq_2.txt`
to `./PSSpred.pl XXX.txt`
if uploading input files with different file names, by editing the following `config.yml`
file.
version: 2
jobs:
build: # name of your job
machine: # executor type
image: ubuntu-1604:201903-01 # # recommended linux image - includes Ubuntu 16.04, docker 18.09.3, docker-compose 1.23.1
steps:
- checkout
- run: |
cd src/
mkdir nr
cd nr/
wget -O nr.tar.gz https://zhanggroup.org/PSSpred/nr.tar.gz
tar -zxvf nr.tar.gz
echo "nr.tar.gz already unpacked!"
echo "Show the path of this file:"
pwd
cd ../
cd PSSpred_v4/
./PSSpred.pl seq_2.txt
ls
- store_artifacts:
path: src/PSSpred_v4/seq.dat
destination: seq.dat
- store_artifacts:
path: src/PSSpred_v4/seq.dat.ss
destination: seq.dat.ss
- store_artifacts:
path: src/PSSpred_v4/protein.fasta
destination: protein.fasta
To get the git version do
$ git clone https://github.com/nickcafferry/PSSpred.git
Or simply download the repository using the official Github CLI
$ gh repo clone nickcafferry/PSSpred
You can also click here to download PSSpred package version 4, and v3, v2, v1. Also, you can download the whole package by clicking source code.zip or source code.tar.gz.
Simply edit the file `seq.fasta`
, or `seq_1.txt`
or `seq_2.txt`
, or you can upload your own sequence file and change the workflow file (PSSPred.yml, appveyor.yml, config.yml) correspondinlgy.
Sequences are expected to be represented in the standard IUB/IUPAC amino acid and nucleic acid codes, with these exceptions:
- lower-case letters are accepted and are mapped into upper-case;
- a single hyphen or dash can be used to represent a gap of indeterminate length;
- in amino acid sequences, U and * are acceptable letters (see below).
- any numerical digits in the query sequence should either be removed or replaced by appropriate letter codes (e.g., N for unknown nucleic acid residue or X for unknown amino acid residue).
The nucleic acid codes are:
A --> adenosine M --> A C (amino)
C --> cytidine S --> G C (strong)
G --> guanine W --> A T (weak)
T --> thymidine B --> G T C
U --> uridine D --> G A T
R --> G A (purine) H --> A C T
Y --> T C (pyrimidine) V --> G C A
K --> G T (keto) N --> A G C T (any)
- gap of indeterminate length
The accepted amino acid codes are:
A ALA alanine P PRO proline
B ASX aspartate or asparagine Q GLN glutamine
C CYS cystine R ARG arginine
D ASP aspartate S SER serine
E GLU glutamate T THR threonine
F PHE phenylalanine U selenocysteine
G GLY glycine V VAL valine
H HIS histidine W TRP tryptophan
I ILE isoleucine Y TYR tyrosine
K LYS lysine Z GLX glutamate or glutamine
L LEU leucine X any
M MET methionine * translation stop
N ASN asparagine - gap of indeterminate length
seq.txt is fasta file at current directory (the only input file). If you know about FASTA format, you can always use that format.
output files:
seq.dat seq.dat.ss
PSSpred.pl consists of three steps:
a. prepare and run PSI-BLAST b. prepare mtx, pssm.txt, profw, freqccw, freqccwG c. run PSSpred and generate output files
Input file: seq_1.txt(src/PSSpred_v4/seq_1.txt)
MESLVPGFNEKTHVQLSLPVLQVRDVLVRGFGDSVEEVLS
EARQHLKDGTCGLVEVEKGVLPQLEQPYVFIKRSDARTAP
HGHVMVELVAELEGIQYGRSGETLGVLVPHVGEIPVAYRK
VLLRKNGNKGAGGHSYGADLKSFDLGDELGTDPYEDFQEN
WNTKHSSGVTRELMRELNGG
1 MET 1 9 # the first column stands for numbers in order
2 GLU 1 9 # the second column is the amino acid code (see `About Protein Sequence` for more details)
3 SER 1 8 # the third one represents the secondary structure code: 1<->helix, 2<->coil, 4<->strand
4 LEU 1 8 # the fourth one represents the confidence score: 1-9
5 VAL 1 8
6 PRO 1 8
7 GLY 1 8
8 PHE 1 7
9 ASN 1 6
10 GLU 1 3
11 LYS 1 1
12 THR 4 3
13 HIS 4 6
14 VAL 4 8
15 GLN 4 9
16 LEU 4 9
17 SER 4 8
18 LEU 4 6
19 PRO 4 5
20 VAL 4 5
180 coil helix beta # 180: the total number of sequence
# Protein secondary structure: coil, helix, beta
1 M C 0.958 0.024 0.012 # the third column: the most possible secondary structure (C-coil, H-helix, E-strand)
2 E C 0.900 0.043 0.046 # the second column: input sequence
3 S C 0.871 0.072 0.061 # the first column: enumeration number
4 L C 0.872 0.064 0.067 # 4-6 columns: probability of corresponding protein secondary structure
5 V C 0.891 0.053 0.062
6 P C 0.902 0.042 0.061
7 G C 0.886 0.046 0.070
8 F C 0.808 0.086 0.096
9 N C 0.715 0.124 0.154
10 E C 0.620 0.124 0.272
11 K C 0.546 0.053 0.416
12 T E 0.364 0.013 0.636
13 H E 0.220 0.007 0.782
14 V E 0.105 0.005 0.902
15 Q E 0.069 0.004 0.936
16 L E 0.076 0.005 0.928
17 S E 0.112 0.005 0.895
18 L E 0.204 0.005 0.800
19 P E 0.230 0.008 0.760
20 V E 0.229 0.012 0.760
FASTA format is a text-based format for representing either nucleotide sequences or peptide sequences, in which base pairs or amino acids are represented using single-letter codes. A sequence in FASTA format begins with a single-line description, followed by lines of sequence data. The description line is distinguished from the sequence data by a greater-than (">") symbol in the first column. It is recommended that all lines of text be shorter than 80 characters in length.
An example sequence in FASTA format is:
>gi|186681228|ref|YP_001864424.1| phycoerythrobilin:ferredoxin oxidoreductase
MNSERSDVTLYQPFLDYAIAYMRSRLDLEPYPIPTGFESNSAVVGKGKNQEEVVTTSYAFQTAKLRQIRA
AHVQGGNSLQVLNFVIFPHLNYDLPFFGADLVTLPGGHLIALDMQPLFRDDSAYQAKYTEPILPIFHAHQ
QHLSWGGDFPEEAQPFFSPAFLWTRPQETAVVETQVFAAFKDYLKAYLDFVEQAEAVTDSQNLVAIKQAQ
LRYLRYRAEKDPARGMFKRFYGAEWTEEYIHGFLFDLERKLTVVK
This project welcomes contributions and suggestions. Most contributions require you to agree to a MIT LICENCE (MIT LIC) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit Code of Conduct.
Renxiang Yan, Dong Xu, Jianyi Yang, Sara Walker, Yang Zhang. A comparative assessment and analysis of 20 representative sequence alignment methods for protein structure prediction. Scientific Reports, 3: 2619 (2013).