A python package that generates 21 numerically encoded feature representation for protein sequences based on their physicochemical properties.
Note: ifeatpro is based on iFeature, a python based toolkit available at link. Here, we have packaged 21 alignment free feature encoding functions available in iFeature into a pip installable module for easy usage and improved accessibility of a protein feature encoding tool.
pip install ifeatpro
from ifeatpro.features import get_feature, get_all_features
import random
AA = "ACDEFGHIKLMNPQRSTVWY"
sequences = ["".join([random.choice(AA) for _ in range(150)]) for _ in range(5)]
!mkdir -p ifeatpro_data
fasta_file = "ifeatpro_data/seq.fa"
with open(fasta_file, 'w') as f:
for i, seq in enumerate(sequences):
f.write(f">enz_{i}")
f.write("\n")
f.write(seq)
f.write("\n")
ifeatpro contains 21 features which are capable of numerically encoding protein sequences based on their physicochemical properties. They are:
- aac
- apaac
- cksaagp
- cksaap
- ctdc
- ctdd
- ctdt
- ctriad
- dde
- dpc
- gaac
- gdpc
- geary
- gtpc
- ksctriad
- moran
- nmbroto
- paac
- qsorder
- socnumber
- tpc
Using get_all_features function, an user can create all the 21 physicochemical encoding based feature extraction techniques provided by ifeatpro. The first argument of this function denotes the fasta file that contains protein sequences while the second argument denotes the output directory where the files will be stored as csv files.
get_all_features(fasta_file, "./ifeatpro_data/")
An user can also create any one of the 21 feature extraction techniques available in ifeatpro using the get_feature function. The function takes the fasta file as the first argument, feature encoding type as the second argument and output directory where the file will be stored as the third argument. For example if an user wants to create aac type feature encoding using the fasta_file that we created above and would like to store it in ifeatpro_data directory, they can run the following command:
get_feature(fasta_file, "aac", "ifeatpro_data/")
To get a detailed description of the feature extraction techniques used in ifeatpro, please refer to the Supplementary Document of the paper link to be added soon.
Other modules that can be used to generate numerical encoding of protein sequences are: