This repository contains all the Jupyter notebooks and scripts to reproduce the results of the paper A generalized protein identification method for novel and diverse sequencing technologies.
If you wish to use our method in your protein identification experiments, the dist directory contains a cleaned up version of the necessary files, a program implementation of our method, sample data and instructions to get you started.
python
Python 3.9.7 (default, Sep 16 2021, 13:09:58)
[GCC 7.5.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
python dependencies
pandas 1.3.4
seaborn 0.12.2
matplotlib 3.4.3
numpy 1.22.3
pyhmmer 0.6.3
pandarallel 1.6.1
# hmmsearch :: search profile(s) against a sequence database
# HMMER 3.3.2 (Nov 2020); http://hmmer.org/
# Copyright (C) 2020 Howard Hughes Medical Institute.
# Freely distributed under the BSD open source license.
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Usage: hmmsearch [options] <hmmfile> <seqdb>
Basic options:
-h : show brief help on version and usage
The code was written in Python v3.9.7. The notebooks depend upon the data generated from other notebooks and scripts for eg. to generate figures. Hence, they are ordered using numeric prefix in their order of execution.
- 00_database_statistics.ipynb
- Please run other
.py
files now. They share the sametemp
directory so please run one file at a time to avoid conflicts. The scripts will generate results to be used by the following notebooks. - 01-data-analysis.ipynb
- 02_plots.ipynb
- 03_combined_result_from_10_fragments.ipynb
It is recommended to run these files in a HPC environment with sufficient access to disk space, memory (200 - 300 GiB) and cores (~50). While the protein identification for a single sequence is fast, many of the scripts will attempt identification of each sequences in the database (N=20,181) for different combinations of parameters. Thus, some of the resulting files will be quite big and the process will take a long time. The scripts will also create several directories for temp files. There will be many temp files in those directores, but are cleared once the execution completes. This step will also take some time.
EU Horizon 2020 grant agreement no. 964363
- Bikash Kumar Bhandari, Nick Goldman, A generalized protein identification method for novel and diverse sequencing technologies, NAR Genomics and Bioinformatics, Volume 6, Issue 3, September 2024, lqae126, https://doi.org/10.1093/nargab/lqae126