Skip to content
/ PIDGINv3 Public

Protein target prediction using random forests and reliability-density neighbourhood analysis

License

Notifications You must be signed in to change notification settings

lhm30/PIDGINv3

Repository files navigation

Prediction IncluDinG INactivity (PIDGIN) Version 3

license Documentation Status betarelease

Author : Lewis Mervin, lewis.mervin@cantab.net

Supervisor : Dr. A. Bender

Protein target prediction using Random Forests (RFs) trained on bioactivity data from PubChem (extracted 07/06/18) and ChEMBL (version 24), using the RDKit and Scikit-learn, which employ a modification of the reliability-density neighbourhood Applicability Domain (AD) analysis by Aniceto [1]. This project is the sucessor to PIDGIN version 1 [2] and PIDGIN version 2 [3]. Target prediction with extended NCBI pathway and DisGeNET disease enrichment calculation is available as implemented in [4].

  • Molecular Descriptors : 2048bit Rdkit Extended Connectivity FingerPrints (ECFP) [5]
  • Algorithm: Random Forests with dynamic number of trees (see docs for details), class weight = 'balanced', sample weight = ratio Inactive:Active
  • Models generated at four different cut-off's: 100μM, 10μM, 1μM and 0.1μM
  • Models generated both with and without mapping to orthologues, as implemented in [3]
  • Pathway information from NCBI BioSystems
  • Disease information from DisGeNET
  • Target/pathway/disease enrichment calculated using Fisher's exact test and the Chi-squared test

Details for sizes across all activity cut-off's:

  Without orthologues With orthologues
Distinct Models 10,446 14,678
Distinct Targets [exhaustive total] 7,075 [7,075] 16,623 [60,437]
Total Bioactivities Over all models 39,424,168 398,340,769
Actives 3,204,038 35,009,629
Inactives [Of which are Sphere Exclusion (SE)] 36,220,130 [27,435,133] 363,331,140 [248,782,698]

Full details on all models are provided in the uniprot_information.txt files in the orthologue and no_orthologue directories

INSTRUCTIONS

Development occurs on GitHub.

Install with Conda

Documentation, installation and instructions are on ReadtheDocs.

IMPORTANT

  • Use the ReadtheDocs! You MUST download the models before running!
  • The program recognises as input line-separated SMILES in either .smi/.smiles or .sdf format
  • If the SMILES input contains data additional to the SMILES string, the first entries after the SMILES are automatically interpreted as identifiers (see the OpenSMILES specification §4.5) - although there are options to change this behaviour
  • Molecules are automatically standardized when running models (can be turned off)
  • Do not modify the 'pkls', 'ad_data' etc. names or directories
  • Files in the examples directory are included for testing as on the ReadtheDocs tutorials.
  • For installation and usage instructions, see the documentation.

License

PIDGINv3 is available under the GNU General Public License v3.0 (GPLv3).

References

[1]Aniceto, N, et al. A novel applicability domain technique for mapping predictive reliability across the chemical space of a QSAR: Reliability-density neighbourhood. J. Cheminform. 8: 69 (2016). aniceto_doi
[2]Mervin, L H., et al. Target prediction utilising negative bioactivity data covering large chemical space. J. Cheminform. 7: 51 (2015). mervin2015_doi
[3](1, 2) Mervin, L H., et al. Orthologue chemical space and its influence on target prediction. Bioinformatics. 34: 72–79 (2018). mervin2018_doi
[4]Mervin, L H., et al. Understanding Cytotoxicity and Cytostaticity in a High-Throughput Screening Collection. ACS Chem. Biol. 11: 11 (2016) mervin2016_doi
[5]Rogers D & Hahn M. Extended-connectivity fingerprints. J. Chem. Inf. Model. 50: 742-54 (2010). rogers_doi