Skip to content

Protein Function Prediction Embedded in a Multi-species World (Bioinformatics Advances)

License

Notifications You must be signed in to change notification settings

v0rtex20k/MUNDO

Repository files navigation

MUNDO: Protein Function Prediction Embedded in a Multi-species World

Authors:

  • Victor Arsenescu
  • Kapil Devkota
  • Mert Erden
  • Polina Shpilker
  • Matthew Werenski
  • Professor Lenore J Cowen

(Published to Bioinformatics Advances 9/29/21)


 

MUNDO has four main steps:

  • Network Processing
  • Landmark Selection
  • Co-embeddings
  • Function Prediction

In order for each network to be processed successfully, the name of each organism and the name of the database where its interaction file was downloaded is needed. If the name of either organism or the database is not recognized, network processing will fail

A list of supported databases is given below:

  • BIOGRID
  • BIOPLEX
  • DIP
  • GENEMANIA
  • GIANT
  • HUMANNET
  • REACTOME
  • STRING
  • LEGACY_BIOGRID (internal Tufts BCB group format)

A mapping of supported organism UniProt IDs to their common names is given below:

  • ARATH : Arabidopsis thaliana (Mouse-Ear Cress)
  • CAEEL : Caenorhabditis elegans (Worm)
  • CANLF : Canis lupus familiaris (Dog)
  • CHICK : Gallus gallus domesticus (Chicken)
  • DICTY : Dictyostelium discoideum (Slime Mold)
  • DROME : Drosophila Melanogaster (Fly)
  • HUMAN : Homo Sapiens (Human)
  • MOUSE : Mus musculus (Mouse)
  • PIG : Sus scrofa (Pig)
  • RAT : Rattus norvegicus (Rat)
  • SCHPO : Schizosaccharomyces pombe (Fission Yeast)
  • YEAST : Saccharomyces cerevisiae (Baker's Yeast)

Once the networks have been processed using network_processing.py, several BLASTP files will be saved to the directory specified by the user. They will look like HUMAN_source1.txt, MOUSE_target3.txt, etc

After locating these files, please do the following:

  • Visit NCBI's BLASTP tool at https://blast.ncbi.nlm.nih.gov/Blast.cgi?
  • On the search page, in the "Enter Query Sequence" box, select "Choose File". Select <organism name>_source_<n>.txt (n is some multiple of 500)
  • Select the "Align two or more sequences" checkbox
  • In the "Enter Subject Sequence" box, select "Choose File". Select <organism name>_target_<m>.txt (m is some multiple of batch size). Note that m may or may not equal n
  • Hit the "BLAST" button at the bottom of the page
  • After the search results load, download them as a TEXT file.
  • Repeat steps 4-6 with <organism name>_source_<m>.txt as the "Subject Sequence" and <organism name>_target_<n>.txt as the "Subject Sequence" (reciprocal direction)
  • Repeat steps 4-7 with each source and target file pair as needed (reccomended is at least 2 pairs, meaning 6 total BLASTP queries). Note that files with lower numbers (source1.txt, source2.txt) on average have more landmarks than higher number files
  • Run landmark_selection.py with the filepath to each pair of BLASTP results files (in only one direction). Note that landmark_selection.py appends the landmarks found in each pair to the existing reciprocal_best_hits.txt file. This calls blast_tools.py internally.
  • Run coembedding.py to obtain the combined embedding and target pairwise distance matrices. This calls linalg.py internally.
  • Run predict.py to obtain final accuracies. This calls metrics.py internally.

 

Additional Table S10 (Refer to Supplement)

 

Additional Table S10 (Refer to Supplement)

About

Protein Function Prediction Embedded in a Multi-species World (Bioinformatics Advances)

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages