GitHub - goldman-gp-ebi/protein-identification-manuscript

This repository contains all the Jupyter notebooks and scripts to reproduce the results of the paper A generalized protein identification method for novel and diverse sequencing technologies.

If you wish to use our method in your protein identification experiments, the dist directory contains a cleaned up version of the necessary files, a program implementation of our method, sample data and instructions to get you started.

Environment

python

Python 3.9.7 (default, Sep 16 2021, 13:09:58) 
[GCC 7.5.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.

python dependencies

pandas 1.3.4
seaborn 0.12.2
matplotlib 3.4.3
numpy 1.22.3
pyhmmer 0.6.3
pandarallel 1.6.1

HMMER

# hmmsearch :: search profile(s) against a sequence database
# HMMER 3.3.2 (Nov 2020); http://hmmer.org/
# Copyright (C) 2020 Howard Hughes Medical Institute.
# Freely distributed under the BSD open source license.
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Usage: hmmsearch [options] <hmmfile> <seqdb>

Basic options:
  -h : show brief help on version and usage

Running the jupyter notebooks (*.ipynb)

The code was written in Python v3.9.7. The notebooks depend upon the data generated from other notebooks and scripts for eg. to generate figures. Hence, they are ordered using numeric prefix in their order of execution.

00_database_statistics.ipynb
Please run other .py files now. They share the same temp directory so please run one file at a time to avoid conflicts. The scripts will generate results to be used by the following notebooks.
01-data-analysis.ipynb
02_plots.ipynb
03_combined_result_from_10_fragments.ipynb

It is recommended to run these files in a HPC environment with sufficient access to disk space, memory (200 - 300 GiB) and cores (~50). While the protein identification for a single sequence is fast, many of the scripts will attempt identification of each sequences in the database (N=20,181) for different combinations of parameters. Thus, some of the resulting files will be quite big and the process will take a long time. The scripts will also create several directories for temp files. There will be many temp files in those directores, but are cleared once the execution completes. This step will also take some time.

Funding

EU Horizon 2020 grant agreement no. 964363

Citation

Bikash Kumar Bhandari, Nick Goldman, A generalized protein identification method for novel and diverse sequencing technologies, NAR Genomics and Bioinformatics, Volume 6, Issue 3, September 2024, lqae126, https://doi.org/10.1093/nargab/lqae126

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Environment

Running the jupyter notebooks (*.ipynb)

Funding

Citation

About

Releases 1

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
data		data
dist		dist
figs		figs
.gitignore		.gitignore
00_database_statistics.ipynb		00_database_statistics.ipynb
01-data-analysis.ipynb		01-data-analysis.ipynb
01_full_length_different_probabilities.py		01_full_length_different_probabilities.py
02_1000_fragments_length_5_probabilitiy_0.9.py		02_1000_fragments_length_5_probabilitiy_0.9.py
02_10_fragments_length_100_different_probabilities.py		02_10_fragments_length_100_different_probabilities.py
02_10_fragments_length_10_different_probabilities.py		02_10_fragments_length_10_different_probabilities.py
02_10_fragments_length_15_different_probabilities.py		02_10_fragments_length_15_different_probabilities.py
02_10_fragments_length_25_different_probabilities.py		02_10_fragments_length_25_different_probabilities.py
02_10_fragments_length_50_different_probabilities.py		02_10_fragments_length_50_different_probabilities.py
02_10_fragments_length_5_different_probabilities.py		02_10_fragments_length_5_different_probabilities.py
02_plots.ipynb		02_plots.ipynb
03_combined_result_from_10_fragments.ipynb		03_combined_result_from_10_fragments.ipynb
03_full_length_different_probabilities_reduced_AA.py		03_full_length_different_probabilities_reduced_AA.py
04_1_fragment_length_100_different_probabilities_reduced_AA.py		04_1_fragment_length_100_different_probabilities_reduced_AA.py
04_1_fragment_length_50_different_probabilities_reduced_AA.py		04_1_fragment_length_50_different_probabilities_reduced_AA.py
05_full_length_probabilities_0.8_indels_all_aa_and_reduced_aa.py		05_full_length_probabilities_0.8_indels_all_aa_and_reduced_aa.py
06_1_fragment_length_100_probabilities_0.8_indels_all_aa_and_reduced_aa.py		06_1_fragment_length_100_probabilities_0.8_indels_all_aa_and_reduced_aa.py
06_1_fragment_length_50_probabilities_0.8_indels_all_aa_and_reduced_aa.py		06_1_fragment_length_50_probabilities_0.8_indels_all_aa_and_reduced_aa.py
07_full_length_all_aa_different_probabilites_reduced_indel_prob.ipynb		07_full_length_all_aa_different_probabilites_reduced_indel_prob.ipynb
08_substituting_new_aa.ipynb		08_substituting_new_aa.ipynb
09_ecoli_mouse_humans_hard_cases_and_repeat_proteins.ipynb		09_ecoli_mouse_humans_hard_cases_and_repeat_proteins.ipynb
README.md		README.md
functions.py		functions.py
functions1.py		functions1.py
functions2.py		functions2.py
test.hmm		test.hmm

goldman-gp-ebi/protein-identification-manuscript

Folders and files

Latest commit

History

Repository files navigation

Environment

Running the jupyter notebooks (*.ipynb)

Funding

Citation

About

Topics

Resources

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages