Skip to content

A reconfigurable machine learning pipeline for labeling ORFs/proteins in bacteriophage genomes and metagenomic data.

License

Notifications You must be signed in to change notification settings

Dreycey/PhageScanner

Repository files navigation

Actions Status License: GPL-3.0 Code style: black Interogate DockerHub

Phage Scanner Logo

PhageScanner

PhageScanner is a command line tool for identifying phage virion proteins (PVPs) using metagenomic sequencing data as input. For comprehensive information about installation and usage, please visit the PhageScanner Wiki.

Subscribe to email list: Subscribe

Unsubscribe from email list: Unsubscribe

Installation (Mac and Linux)

Installing direct dependencies

The python dependencies can be installed using the requirements.txt file provided in the primary repository.

python -m pip install -r requirements.txt

Installing command line tool dependencies

There are several command line tools that PhageScanner uses within the pipeline: (1) CD-HIT, (2) BLAST, (3) Megahit, and (4) Phanotate. Many of these tools are commonly-used bioinformatics tools that you may already have installed. However, please refer to the PhageScanner Wiki if you'd like more guidance installing these dependencies.

Installing using Docker (Windows, Mac and Linux)

The easiest approach to using PhageScanner is to use Docker. Docker allows for PhageScanner to be usable on Windows and removes the need to install the command line tool dependencies. Follow the directions here: https://docs.docker.com/desktop/install/. For Windows, we used WSL2 to install docker.

Using the Docker image host on DockerHub

PhageScanner is host on DockerHub at https://hub.docker.com/r/dreyceyalbin/phagescanner. This allows for easily downloading the Docker image and running the tool after installing Docker.

  • Pull down the docker image from DockerHub
docker pull dreyceyalbin/phagescanner
  • Test that the help message prints
docker run --rm dreyceyalbin/phagescanner --help

Building Docker image locally

The docker image can be built locally to allow for more flexiblity. There are two steps involved in this process:

  • Navigate to the Docker/ directory and run:
docker build -t dreyceyalbin/phagescanner .
  • Test that the help message prints
docker run --rm dreyceyalbin/phagescanner --help

Pipeline Usage

There are three fundamental pipelines in the PhageScanner tool. Each of these pipelines feeds into the next: (1) Download the training dataset, (2) Training the machine learning models, (3) Using the models to annotate genomes and metagenomics datasets. Each pipelines is configurable to allow end-users extreme flexibility in creating new models to predict new variations of protein classes (ex. "Toxic Protein", "Phage Virion Protein", "Lysogenic"). Each example list below should be ran from the root directory if running the commands "as-is".

  1. Build the database
    • Basic usage
    python phagescanner.py database [-h] -c CONFIG -o OUT [--cdhit_path CDHIT_PATH (Default: 'cdihit')] [-v VERBOSITY]
    
    • Example (multiclass pvps)
    python phagescanner.py database -c configs/multiclass_config.yaml -o ./multiclass_database/ -v info
    
    • Example using Docker (multiclass pvps)
    docker run --rm \
        -v "$(pwd)/configs:/app/configs" \
        -v "$(pwd)/multiclass_database:/app/multiclass_database" \
        dreyceyalbin/phagescanner database -c /app/configs/multiclass_config.yaml -o /app/multiclass_database/ -v info
    
  2. Training and Test ML models
    • Basic usage
    python phagescanner.py train [-h] -c CONFIG -o OUT -db DATABASE_CSV_PATH [-v VERBOSITY]
    
    • Example (multiclass pvps)
    python phagescanner.py train -c configs/multiclass_config.yaml -o training_output --database_csv_path ./multiclass_database/ -v debug
    
    • Example using Docker (multiclass pvps)
    docker run --rm \
        -v "$(pwd)/configs:/app/configs" \
        -v "$(pwd)/multiclass_database:/app/multiclass_database" \
        -v "$(pwd)/training_output:/app/training_output" \
        dreyceyalbin/phagescanner train -c /app/configs/multiclass_config.yaml -o /app/training_output --database_csv_path /app/multiclass_database/ -v debug
    
  3. Run on metagenomic data, genomes or proteins
    • Basic usage
    python phagescanner.py predict [-h] -i INPUT -t TYPE ("reads", "genome", or "protein") -c CONFIG -o training_output -n NAME -tdir TRAINING_OUTPUT
                                [--megahit_path MEGAHIT_PATH (Default: 'megahit')] [--phanotate_path PHANOTATE_PATH (Default: 'phanotate.py')]
                                [--probability_threshold PROBABILITY_THRESHOLD] [-v VERBOSITY]
    
    • Example (genomes; though sequencing reads and proteins can be used as input)
    python phagescanner.py predict -t "genomes" -c configs/multiclass_config.yaml -n "OUTPREFIX" -tdir .\training_output\ -o prediction_output -i examples/GCF_000912975.1_ViralProj227117_genomic.fna -v debug
    
    • Example using Docker (genomes)
    docker run --rm \
        -v "$(pwd)/configs:/app/configs" \
        -v "$(pwd)/examples:/app/examples" \
        -v "$(pwd)/prediction_output:/app/prediction_output" \
        -v "$(pwd)/training_output:/app/training_output" \
        dreyceyalbin/phagescanner predict -t "genome" -c /app/configs/multiclass_config.yaml -o /app/prediction_output -n "OUTPREFIX" -tdir .\training_output\ -i /app/examples/GCF_000912975.1_ViralProj227117_genomic.fna -v debug
    

PhageScanner GUI

PhageScanner has a GUI for viewing the results of the prediction pipeline to allow for scraping proteins of interest. This GUI is a visual tool for viewing the results of the prediction pipeline. The benefit of this GUI is that it allows for vissually mining proteins that may be interesting for further analysis, or for observing where the proteins appear within a genome or contig (along with synteny).

Phage Scanner GUI

Usage

  1. Create images from the output of running the predict pipeline.
    • run the predict pipeline on genomes or reads
    python phagescanner.py predict -c configs/prediction.yaml -t "genome" -o prediction_output -n "genomes" -i examples/GCF_000912975.1_ViralProj227117_genomic.fna -v debug
    
    • use output from the predict pipeline to create images
    python phagescanner_gui.py create_images -p prediction_output/genomes_predictions.csv -o output_images/
    
  2. Open the GUI using the path to the prediction output and the images path.
    • Open the GUI
    python phagescanner_gui.py gui -p prediction_output/genomes_predictions.csv -o output_images/
    

About

A reconfigurable machine learning pipeline for labeling ORFs/proteins in bacteriophage genomes and metagenomic data.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published