Released metaWRAP v=0.1

bxlab · Jul 21, 2017 · a338737 · a338737
1 parent b6b975b
commit a338737
Show file tree

Hide file tree

Showing 39 changed files with 3,234 additions and 0 deletions.
diff --git a/LICENSE b/LICENSE
@@ -0,0 +1,7 @@
+Copyright 2017 German Uritskiy
+
+Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
diff --git a/README.md b/README.md
@@ -0,0 +1,154 @@
+## Introducing metaWRAP v=0.1 - a Comprehensive Metagenome Analysis Pipeline for Beginners
+
+
+ metaWRAP aims to be an easy-to-use inclusive wrapper program that accomplishes the most basic tasks in metagenomic analysis: QC, assembly, binning, visualization, and taxonomic profiling. While there is no single best approach for processing metagenomic data, metaWRAP is meant to be a fast and simple first pass program before you delve deeper into parameterization of your approach.
+
+  There are 5 major parts to the pipeline:
+
+    1) Read QC
+
+    2) Assembly
+
+    3) Binning
+
+    4) Taxonomic profiling
+
+    5) Visualization with Blobplots
+
+
+  ![Detailed pipeline walkthrough](http://i.imgur.com/VZmKuoZ.png)
+
+
+## INSTALATION
+
+ The core of metaWRAP is a bash script, so the program does not need any installation. Just download or clone the repository into your desired location. Just know that the core scripts need to be in the same location as the config.sh file, which contains all the paths to databases that metaWRAP uses. Once you have the scripts, go into the config.sh file, and edit all the paths to their correct locations. Have a look at the “DEPENDENCIES” section for more detail. Once everything is configured, you can run metaWRAP or any of the individual module scripts to display the help message.
+
+
+## DEPENDENCIES
+
+  Since this is a wrapper program, the biggest challenge in installing metaWRAP will likely be configuring all the dependencies correctly. Firstly, the path to the folder "meta-scripts", containing numerous scripts required for running this pipeline, needs to be configured in config.sh. Next, the following programs need to be installed in your PATH. NOTE: the versions of the programs may or may not be important.
+
+1)	BLAST (tested on v=2.6.0)
+2)	bmtagger (tested on v=3.101)
+3)	Bowtie2 (tested on v=2.3.2)
+4)	bwa (tested on v=0.7.15-r1140)
+5)	Checkm (tested on v=v1.0.7)
+6)	FastQC (tested on v=v0.11.5)
+7)	kraken (tested on v=0.10.6)
+8)	kronatools (tested on v=2.7)
+9)	megahit (tested on v=1.1.1-2)
+10)	metabat2 (tested on v=2.9.1)
+11)	perl (tested on v=5.22.0)
+12)	python (tested on v=2.7.1)
+13)	quast (tested on v=4.5)
+14)	R (tested on v=3.3.2)
+15)	samtools (tested on v=1.3.1)
+16)	SPAdes (tested on v=3.10.1)
+17)	trim_galore (tested on v=0.4.3)
+
+ The installation of most of these dependencies should not be difficult even in non-sudo environments with the use of conda. Future versions of metaWRAP are expected to have more detailed installation instructions, but for now you are on your own.
+
+
+## DATABASES
+
+Finally, you will need to download several databases and configure their paths in the config.sh file. This may be the longest step of the installation. Here is a full list of the databases:
+
+1)	Checkm_DB (1.4GB) – CheckM should prompt you to download this when you run it for the first time
+2)	KRAKEN standard database (161GB) – look at the official KRAKEN support website for download instructions 
+3)	RefSeq NCBI_nt (71GB) – look at the config.sh for download instructions
+4)	RefSeq NCBI_tax (283MB) – look at the config.sh for download instructions
+5)	Indexed hg38 (20GB) – look at the bmtagger manual for instructions
+
+
+## USAGE
+
+Once all the dependencies are in place, running metaWRAP is relatively simple. You can chose to run metaWRAP itself, which will run all of the modules, or run individual modules as you wish. 
+
+metaWRAP:
+Usage: ./metaWRAP [options] -o output_folder raw_readsA_1.fastq raw_readsA_2.fastq
+Options:
+
+	-o STR          output directory
+	-t INT          number of threads (default=1)
+	-m INT          memory in GB (default=4)
+	-1 STR		forward read file
+	-2 STR		reverse read file
+
+
+read_qc.sh:
+Usage: ./read_qc.sh [options] -1 reads_1.fastq -2 reads_2.fastq -o output_dir
+Options:
+
+	-1 STR          forward fastq reads
+	-2 STR          reverse fastq reads
+	-o STR          output directory
+	-t INT          number of threads
+
+
+assembly.sh:
+Usage: ./assembly.sh [options] -1 reads_1.fastq -2 reads_2.fastq -o output_dir -t THREADS -m MEMORY
+Options:
+
+	-1 STR          forward fastq reads
+	-2 STR          reverse fastq reads
+	-o STR          output directory
+	-m INT          memory in GB
+	-t INT          number of threads
+
+
+blobology.sh:
+Usage: ./blobology.sh [options] -a assembly.fasta-1 reads_1.fastq -2 reads_2.fastq -o output_dir
+Options:
+
+	-a STR		assembly fasta file
+	-1 STR          forward fastq reads
+	-2 STR          reverse fastq reads
+	-o STR          output directory
+	-t INT          number of threads
+	-n INT		number of contigs to plot (default=ALL)
+
+
+binning.sh:
+Usage: ./binning.sh [options] -a assembly.fa -o output_dir readsA_1.fastq readsA_2.fastq ... [readsX_1.fastq readsX_2.fastq]
+Options:
+
+	-a STR          metagenomic assembly
+	-o STR          output directory
+	-t INT          number of threads (default=1)
+	-m INT          memory in GB (default=4)
+
+
+kraken.sh:
+./kraken.sh [options] -o output_dir assembly.fasta reads_1.fastq reads_2.fastq ...
+Options:
+
+	-o STR          output directory
+	-t INT          number of threads
+	-d INT		read sampling depth (default=all)
+
+
+
+
+Here are examples of running metaWRAP and its sub-modules:
+
+./metaWRAP –t 24 –m 120 -o metawrap_out -1 sample_1.fastq -2 sample_2.fastq 
+./read_qc.sh –t 100 -1 reads_1.fastq -2 reads_2.fastq -o output_dir
+./assembly.sh [options] -1 reads_1.fastq -2 reads_2.fastq -o assembly_dir -t 120 -m 800
+./blobology.sh –t 10-m 20 -a final_assembly.fasta-1 reads_1.fastq -2 reads_2.fastq –o blobology_out
+./binning.sh –t 48 –m 100 -a assembly.fa –o binning_out sampleA_1.fastq sampleA_2.fastq sampleX_1.fastq sampleX_2.fastq
+./kraken.sh –t 24 –o kraken_out assembly.fasta reads_1.fastq reads_2.fastq
+
+
+
+###  System requirements
+ The resource requirements for this pipeline will vary greatly based on the number of reads you are processing, but I would advise against attempting to run it on anything less than 10 cores and 100GB RAM. With the help of conda, installing metaWRAP and its dependencies on a cluster (even without sudo privileges) should be relatively easy even for beginners.
+
+
+### Acknowledgements
+Author of pipeline: German Uritskiy.
+Principal Investigators: [James Taylor](http://bio.jhu.edu/directory/james-taylor/) and [Jocelyne DiRuggiero](http://bio.jhu.edu/directory/jocelyne-diruggiero/)
+Institution: Johns Hopkins, [Department of Cell, Molecular, Developmental Biology, and Biophysics](http://cmdb.jhu.edu/) 
+
+I do not claim to have any authorship of the many programs this pipeline uses. For questions, bugs, and suggestions, contact me at guritsk1@jhu.edu, or leave a comment on this github page.
+
+
diff --git a/assembly.sh b/assembly.sh
@@ -0,0 +1,185 @@
+#!/bin/bash -l
+
+##############################################################################################################################################################
+#
+# This script is meant to be a comprehensive solution for producing the best metagenomic assembly given paired end reads from one or more samples.
+# Ideally it should take in fully QC'd reads. First, the reads are assembled with metaSPAdes3.10, then all the reads that did not map back to the
+# contigs are re-assembled with MEGAHIT (which works better on lower coverage contigs. The resulting assemblies are combined, sorted, and short 
+# contigs are removed. The finall assembly is then QCed by QUAST.
+#
+# Author of pipeline: German Uritskiy. I do not have any authorship of the many programs this pipeline uses.
+# For questions, bugs, and suggestions, contact me at guritsk1@jhu.edu.
+# 
+##############################################################################################################################################################
+
+
+
+help_message () {
+	echo "Usage: ./assembly.sh [options] -1 reads_1.fastq -2 reads_2.fastq -o output_dir -t THREADS -m MEMORY"
+	echo "Options:"
+	echo ""
+	echo "	-1 STR          forward fastq reads"
+	echo "	-2 STR          reverse fastq reads" 
+	echo "	-o STR          output directory"
+	echo "	-m INT          memory in GB"
+	echo "	-t INT          number of threads"
+	echo "";}
+# function to print out error messages
+error () {
+	echo ""; echo "************************************************************************************************************************"
+	echo "*****************************************            ERROR!           **************************************************"
+	echo $1
+	echo "************************************************************************************************************************"; echo ""; exit 1; }
+# function to print out warning messages
+warning () {
+	echo "************************************************************************************************************************"
+	echo "*****************************************           WARNING!          **************************************************"
+	echo " "; echo $1; echo " "
+	echo "************************************************************************************************************************";}
+# funciton to print out comments throught the pipeline
+comm () {
+	echo ""; echo "----------------------------------------------------------------------------------------------"
+	echo $1
+	echo "----------------------------------------------------------------------------------------------"; echo ""; }
+
+
+
+########################################################################################################
+########################               LOADING IN THE PARAMETERS                ########################
+########################################################################################################
+
+
+# setting scripts and databases from config file (should be in same folder as main script)
+source ${0%/*}/config.sh
+
+
+threads=1; out="false"; read_1="false"; read_2="false"
+# Load in options
+while getopts ht:m:o:1:2: option; do
+	case "${option}" in
+		h) help_message; exit 1;;
+		t) threads=${OPTARG};;
+		m) mem=$OPTARG;;
+		o) out=${OPTARG};;
+		1) reads_1=$OPTARG;;
+		2) reads_2=$OPTARG;;
+	esac
+done
+
+
+
+
+########################################################################################################
+########################           MAKING SURE EVERYTHING IS SET UP             ########################
+########################################################################################################
+
+# check if all parameters are entered
+if [ "$out" = "false" ] || [ "$reads_1" = "false" ] || [ "$reads_2" = "false" ] || [ "$threads" = "false" ] || [ "$mem" = "false" ]; then 
+	help_message; exit 1
+fi
+
+# Checks for correctly configures meta-scripts folder
+if [ ! -s $SOFT/sort_contigs.py ]; then
+	error "The folder $SOFT doesnt exist. Please make sure the meta-scripts path is configured correctly in config.sh file"
+fi
+
+
+########################################################################################################
+########################                    BEGIN PIPELINE!                     ########################
+########################################################################################################
+
+
+
+
+
+echo " "
+echo "########################################################################################################"
+echo "########################               ASSEMBLING WITH METASPADES               ########################"
+echo "########################################################################################################"
+echo " "
+
+mkdir $out
+
+comm "Using reads $reads_1 and $reads_2 for assembly,"
+if [ ! -f "$reads_1" ] || [ ! -f "$reads_2" ]; then error "Read files $reads_1 and/or $reads_2 dont exist. Exiting."; fi
+
+metaspades.py --tmp-dir ${out}/metaspades.tmp -t $threads -m $mem -o ${out}/metaspades -1 $reads_1 -2 $reads_2 
+rm -r ${out}/metaspades.tmp
+if [ ! -f "${out}/metaspades/scaffolds.fasta" ]; then error "Something went wrong with metaSPAdes assembly. Exiting."; fi
+
+
+
+echo " "
+echo "########################################################################################################"
+echo "########################              SORTING OUT UNASSEMBLED READS             ########################"
+echo "########################################################################################################"
+echo " "
+
+#take only the scaffolds over 1.5kb (the minimum needed for binning with metaBAT):
+${SOFT}/rm_short_contigs.py 1500 ${out}/metaspades/scaffolds.fasta > ${out}/metaspades/long_scaffolds.fasta
+if [ ! -s ${out}/metaspades/long_scaffolds.fasta ]; then
+	error "metaSPAdes produced no contigs over 1.5kb, so the rest of the pipeline wont work. Exiting."; fi
+
+
+bwa index ${out}/metaspades/long_scaffolds.fasta
+
+#sort out and store reads that dont map back to the assembly:
+bwa mem -t $threads ${out}/metaspades/long_scaffolds.fasta $reads_1 $reads_2 | grep -v NM:i: > ${out}/unused_by_metaspades.sam
+${SOFT}/sam_to_fastq.py ${out}/unused_by_metaspades.sam > ${out}/unused_by_metaspades.fastq
+#rm ${out}/unused_by_metaspades.sam
+
+if [[ ! -s ${out}/unused_by_metaspades.fastq ]]; then error "Something went wrong with pulling out unassembled reads. Exiting."; fi
+
+
+
+echo " "
+echo "########################################################################################################"
+echo "########################        ASSEMBLING REMAINDER READS WITH MEGAHIT         ########################"
+echo "########################################################################################################"
+echo " "
+
+rm -r ${out}/megahit
+megahit -r ${out}/unused_by_metaspades.fastq -o ${out}/megahit -t $threads -m ${mem}000000000
+if [ ! -f "${out}/megahit/final.contigs.fa" ]; then error "Something went wrong with reassembling with Megahit. Exiting."; fi
+mv ${out}/unused_by_metaspades.fastq ${out}/metaspades/
+
+
+echo " "
+echo "########################################################################################################"
+echo "########################         COMBINE AND FORMAT THE TWO ASSEMBLIES          ########################"
+echo "########################################################################################################"
+echo " "
+
+${SOFT}/fix_megahit_contig_naming.py ${out}/megahit/final.contigs.fa > ${out}/megahit/fixed.contigs.fa
+${SOFT}/rm_short_contigs.py 500 ${out}/megahit/fixed.contigs.fa > ${out}/megahit/long.contigs.fa
+
+cp ${out}/metaspades/long_scaffolds.fasta ${out}/combined_assembly.fasta
+cat ${out}/megahit/long.contigs.fa >> ${out}/combined_assembly.fasta
+${SOFT}/sort_contigs.py ${out}/combined_assembly.fasta > ${out}/final_assembly.fasta
+if [[ ! -s ${out}/final_assembly.fasta ]]; then error "Something went wrong with joining the assemblies. Exiting."; fi
+rm ${out}/combined_assembly.fasta
+
+
+echo " "
+echo "########################################################################################################"
+echo "########################             RUNNING ASSEMBLY QC WITH QUAST             ########################"
+echo "########################################################################################################"
+echo " "
+
+cp ${out}/metaspades/scaffolds.fasta ${out}/metaspades/metaspades_assembly.fasta
+quast -t $threads -o ${out}/QUAST_out -m 500 ${out}/final_assembly.fasta ${out}/metaspades/metaspades_assembly.fasta
+cp ${out}/QUAST_out/report.html ${out}/assembly_report.html
+
+if [[ ! -s ${out}/assembly_report.html ]]; then error "Something went wrong with running QUAST. Exiting."; fi
+
+
+echo " "
+echo "########################################################################################################"
+echo "########################      ASSEMBLY PIPELINE COMPLETED SUCCESSFULLY!!!       ########################"
+echo "########################################################################################################"
+echo " "
+
+
+
+
+