Skip to content

StanfordBioinformatics/Hummingbird

Repository files navigation

Hummingbird: Efficient Performance Prediction for Executing Genomic Applications in the Cloud

Overview

Hummingbird is a Python framework that gives a variety of optimum instance configurations to run your favorite genomics pipeline on cloud platforms.

The input for this framework is the necessary information required to run a cloud job and it generates different instance configurations that the user can use to run the pipeline on the cloud. The user can choose from a variety of instance configurations, such as the fastest, the cheapest, and the most efficient. The detailed explanation on these configurations can be found in the latter section of this README.

The unique feature about Hummingbird is that it takes the input files, downsamples them, runs the whole computational pipeline on these downwsampled files and subsequently provides the user with different optimum instance configurations. Therefore, the users obtain the resulting configurations in a short amount of time compared to a run on the entire pipeline with the whole input file(s) for different instance configurations.

Currently, Hummingbird supports Google Cloud (GCP), Amazon Web Service (AWS) and Microsoft Azure, and we hope to add other cloud providers in the future.

Installation Instructions

Hummingbird can be installed using

pip install CloudHummingbird

It is recommended to use the --install-option="--prefix=$PREFIX_PATH" along with pip while installing Hummingbird. This would give users easy access to the sample configuration files located in conf/examples which the users might need to refer to while writing their own configuration file(s) for their own computational pipeline. Alternatively, the configuration files can be found here: <virtualenv_name>/lib/<python_ver>/site-packages/Hummingbird/conf/examples

Hummingbird requires pip and python 3 as prerequesites for installation.

It is highly recommended to use a virtual environment to isolate the execution environment. Please follow the instructions from the above link to create a virtual environment, and then activate it:

source <virtual-environment-name>/bin/activate

This section explains how to get started on Google Cloud, AWS and Azure.

This section provides instructions to execute a sample run of BWA on Google Cloud using Hummingbird

This section provides information about the configuration file and how to edit it

This section provides information about how to execute Hummingbird

This section provides a guide to interpret the results provided by Hummingbird

This section provides a guide for users who want to leverage the downsampling step in Hummingbird but have input files in formats different than BAM or fastq/fastq.gz

This section provides users a guide to alternative downsampling techniques other than the ones supported by Hummingbird

Section 8: Workflow Parser

This section explains how Hummingbird parses workflows provided by the user

This section explains how Hummingbird takes advantage of the container technology for execution

Section 10: I/O Profiling

This section explains how future versions of Hummingbird will profile I/O throughput as well

Section 11: Fault Tolerance

This section describes the fault tolerant capabilities of Hummingbird

This section lists all required components for running Hummingbird on a Cloud Platform provider.

  • Logo Credit: Camille Berry