Skip to content

Commit

Permalink
Released metaWRAP v=0.1
Browse files Browse the repository at this point in the history
  • Loading branch information
German Uritskiy authored and German Uritskiy committed Jul 21, 2017
1 parent b6b975b commit a338737
Show file tree
Hide file tree
Showing 39 changed files with 3,234 additions and 0 deletions.
7 changes: 7 additions & 0 deletions LICENSE
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
Copyright 2017 German Uritskiy

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
154 changes: 154 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,154 @@
## Introducing metaWRAP v=0.1 - a Comprehensive Metagenome Analysis Pipeline for Beginners


metaWRAP aims to be an easy-to-use inclusive wrapper program that accomplishes the most basic tasks in metagenomic analysis: QC, assembly, binning, visualization, and taxonomic profiling. While there is no single best approach for processing metagenomic data, metaWRAP is meant to be a fast and simple first pass program before you delve deeper into parameterization of your approach.

There are 5 major parts to the pipeline:

1) Read QC

2) Assembly

3) Binning

4) Taxonomic profiling

5) Visualization with Blobplots


![Detailed pipeline walkthrough](http://i.imgur.com/VZmKuoZ.png)


## INSTALATION

The core of metaWRAP is a bash script, so the program does not need any installation. Just download or clone the repository into your desired location. Just know that the core scripts need to be in the same location as the config.sh file, which contains all the paths to databases that metaWRAP uses. Once you have the scripts, go into the config.sh file, and edit all the paths to their correct locations. Have a look at the “DEPENDENCIES” section for more detail. Once everything is configured, you can run metaWRAP or any of the individual module scripts to display the help message.


## DEPENDENCIES

Since this is a wrapper program, the biggest challenge in installing metaWRAP will likely be configuring all the dependencies correctly. Firstly, the path to the folder "meta-scripts", containing numerous scripts required for running this pipeline, needs to be configured in config.sh. Next, the following programs need to be installed in your PATH. NOTE: the versions of the programs may or may not be important.

1) BLAST (tested on v=2.6.0)
2) bmtagger (tested on v=3.101)
3) Bowtie2 (tested on v=2.3.2)
4) bwa (tested on v=0.7.15-r1140)
5) Checkm (tested on v=v1.0.7)
6) FastQC (tested on v=v0.11.5)
7) kraken (tested on v=0.10.6)
8) kronatools (tested on v=2.7)
9) megahit (tested on v=1.1.1-2)
10) metabat2 (tested on v=2.9.1)
11) perl (tested on v=5.22.0)
12) python (tested on v=2.7.1)
13) quast (tested on v=4.5)
14) R (tested on v=3.3.2)
15) samtools (tested on v=1.3.1)
16) SPAdes (tested on v=3.10.1)
17) trim_galore (tested on v=0.4.3)

The installation of most of these dependencies should not be difficult even in non-sudo environments with the use of conda. Future versions of metaWRAP are expected to have more detailed installation instructions, but for now you are on your own.


## DATABASES

Finally, you will need to download several databases and configure their paths in the config.sh file. This may be the longest step of the installation. Here is a full list of the databases:

1) Checkm_DB (1.4GB) – CheckM should prompt you to download this when you run it for the first time
2) KRAKEN standard database (161GB) – look at the official KRAKEN support website for download instructions
3) RefSeq NCBI_nt (71GB) – look at the config.sh for download instructions
4) RefSeq NCBI_tax (283MB) – look at the config.sh for download instructions
5) Indexed hg38 (20GB) – look at the bmtagger manual for instructions


## USAGE

Once all the dependencies are in place, running metaWRAP is relatively simple. You can chose to run metaWRAP itself, which will run all of the modules, or run individual modules as you wish.

metaWRAP:
Usage: ./metaWRAP [options] -o output_folder raw_readsA_1.fastq raw_readsA_2.fastq
Options:

-o STR output directory
-t INT number of threads (default=1)
-m INT memory in GB (default=4)
-1 STR forward read file
-2 STR reverse read file


read_qc.sh:
Usage: ./read_qc.sh [options] -1 reads_1.fastq -2 reads_2.fastq -o output_dir
Options:

-1 STR forward fastq reads
-2 STR reverse fastq reads
-o STR output directory
-t INT number of threads


assembly.sh:
Usage: ./assembly.sh [options] -1 reads_1.fastq -2 reads_2.fastq -o output_dir -t THREADS -m MEMORY
Options:

-1 STR forward fastq reads
-2 STR reverse fastq reads
-o STR output directory
-m INT memory in GB
-t INT number of threads


blobology.sh:
Usage: ./blobology.sh [options] -a assembly.fasta-1 reads_1.fastq -2 reads_2.fastq -o output_dir
Options:

-a STR assembly fasta file
-1 STR forward fastq reads
-2 STR reverse fastq reads
-o STR output directory
-t INT number of threads
-n INT number of contigs to plot (default=ALL)


binning.sh:
Usage: ./binning.sh [options] -a assembly.fa -o output_dir readsA_1.fastq readsA_2.fastq ... [readsX_1.fastq readsX_2.fastq]
Options:

-a STR metagenomic assembly
-o STR output directory
-t INT number of threads (default=1)
-m INT memory in GB (default=4)


kraken.sh:
./kraken.sh [options] -o output_dir assembly.fasta reads_1.fastq reads_2.fastq ...
Options:

-o STR output directory
-t INT number of threads
-d INT read sampling depth (default=all)




Here are examples of running metaWRAP and its sub-modules:

./metaWRAP –t 24 –m 120 -o metawrap_out -1 sample_1.fastq -2 sample_2.fastq
./read_qc.sh –t 100 -1 reads_1.fastq -2 reads_2.fastq -o output_dir
./assembly.sh [options] -1 reads_1.fastq -2 reads_2.fastq -o assembly_dir -t 120 -m 800
./blobology.sh –t 10-m 20 -a final_assembly.fasta-1 reads_1.fastq -2 reads_2.fastq –o blobology_out
./binning.sh –t 48 –m 100 -a assembly.fa –o binning_out sampleA_1.fastq sampleA_2.fastq sampleX_1.fastq sampleX_2.fastq
./kraken.sh –t 24 –o kraken_out assembly.fasta reads_1.fastq reads_2.fastq



### System requirements
The resource requirements for this pipeline will vary greatly based on the number of reads you are processing, but I would advise against attempting to run it on anything less than 10 cores and 100GB RAM. With the help of conda, installing metaWRAP and its dependencies on a cluster (even without sudo privileges) should be relatively easy even for beginners.


### Acknowledgements
Author of pipeline: German Uritskiy.
Principal Investigators: [James Taylor](http://bio.jhu.edu/directory/james-taylor/) and [Jocelyne DiRuggiero](http://bio.jhu.edu/directory/jocelyne-diruggiero/)
Institution: Johns Hopkins, [Department of Cell, Molecular, Developmental Biology, and Biophysics](http://cmdb.jhu.edu/)

I do not claim to have any authorship of the many programs this pipeline uses. For questions, bugs, and suggestions, contact me at guritsk1@jhu.edu, or leave a comment on this github page.


185 changes: 185 additions & 0 deletions assembly.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,185 @@
#!/bin/bash -l

##############################################################################################################################################################
#
# This script is meant to be a comprehensive solution for producing the best metagenomic assembly given paired end reads from one or more samples.
# Ideally it should take in fully QC'd reads. First, the reads are assembled with metaSPAdes3.10, then all the reads that did not map back to the
# contigs are re-assembled with MEGAHIT (which works better on lower coverage contigs. The resulting assemblies are combined, sorted, and short
# contigs are removed. The finall assembly is then QCed by QUAST.
#
# Author of pipeline: German Uritskiy. I do not have any authorship of the many programs this pipeline uses.
# For questions, bugs, and suggestions, contact me at guritsk1@jhu.edu.
#
##############################################################################################################################################################



help_message () {
echo "Usage: ./assembly.sh [options] -1 reads_1.fastq -2 reads_2.fastq -o output_dir -t THREADS -m MEMORY"
echo "Options:"
echo ""
echo " -1 STR forward fastq reads"
echo " -2 STR reverse fastq reads"
echo " -o STR output directory"
echo " -m INT memory in GB"
echo " -t INT number of threads"
echo "";}
# function to print out error messages
error () {
echo ""; echo "************************************************************************************************************************"
echo "***************************************** ERROR! **************************************************"
echo $1
echo "************************************************************************************************************************"; echo ""; exit 1; }
# function to print out warning messages
warning () {
echo "************************************************************************************************************************"
echo "***************************************** WARNING! **************************************************"
echo " "; echo $1; echo " "
echo "************************************************************************************************************************";}
# funciton to print out comments throught the pipeline
comm () {
echo ""; echo "----------------------------------------------------------------------------------------------"
echo $1
echo "----------------------------------------------------------------------------------------------"; echo ""; }



########################################################################################################
######################## LOADING IN THE PARAMETERS ########################
########################################################################################################


# setting scripts and databases from config file (should be in same folder as main script)
source ${0%/*}/config.sh


threads=1; out="false"; read_1="false"; read_2="false"
# Load in options
while getopts ht:m:o:1:2: option; do
case "${option}" in
h) help_message; exit 1;;
t) threads=${OPTARG};;
m) mem=$OPTARG;;
o) out=${OPTARG};;
1) reads_1=$OPTARG;;
2) reads_2=$OPTARG;;
esac
done




########################################################################################################
######################## MAKING SURE EVERYTHING IS SET UP ########################
########################################################################################################

# check if all parameters are entered
if [ "$out" = "false" ] || [ "$reads_1" = "false" ] || [ "$reads_2" = "false" ] || [ "$threads" = "false" ] || [ "$mem" = "false" ]; then
help_message; exit 1
fi

# Checks for correctly configures meta-scripts folder
if [ ! -s $SOFT/sort_contigs.py ]; then
error "The folder $SOFT doesnt exist. Please make sure the meta-scripts path is configured correctly in config.sh file"
fi


########################################################################################################
######################## BEGIN PIPELINE! ########################
########################################################################################################





echo " "
echo "########################################################################################################"
echo "######################## ASSEMBLING WITH METASPADES ########################"
echo "########################################################################################################"
echo " "

mkdir $out

comm "Using reads $reads_1 and $reads_2 for assembly,"
if [ ! -f "$reads_1" ] || [ ! -f "$reads_2" ]; then error "Read files $reads_1 and/or $reads_2 dont exist. Exiting."; fi

metaspades.py --tmp-dir ${out}/metaspades.tmp -t $threads -m $mem -o ${out}/metaspades -1 $reads_1 -2 $reads_2
rm -r ${out}/metaspades.tmp
if [ ! -f "${out}/metaspades/scaffolds.fasta" ]; then error "Something went wrong with metaSPAdes assembly. Exiting."; fi



echo " "
echo "########################################################################################################"
echo "######################## SORTING OUT UNASSEMBLED READS ########################"
echo "########################################################################################################"
echo " "

#take only the scaffolds over 1.5kb (the minimum needed for binning with metaBAT):
${SOFT}/rm_short_contigs.py 1500 ${out}/metaspades/scaffolds.fasta > ${out}/metaspades/long_scaffolds.fasta
if [ ! -s ${out}/metaspades/long_scaffolds.fasta ]; then
error "metaSPAdes produced no contigs over 1.5kb, so the rest of the pipeline wont work. Exiting."; fi


bwa index ${out}/metaspades/long_scaffolds.fasta

#sort out and store reads that dont map back to the assembly:
bwa mem -t $threads ${out}/metaspades/long_scaffolds.fasta $reads_1 $reads_2 | grep -v NM:i: > ${out}/unused_by_metaspades.sam
${SOFT}/sam_to_fastq.py ${out}/unused_by_metaspades.sam > ${out}/unused_by_metaspades.fastq
#rm ${out}/unused_by_metaspades.sam

if [[ ! -s ${out}/unused_by_metaspades.fastq ]]; then error "Something went wrong with pulling out unassembled reads. Exiting."; fi



echo " "
echo "########################################################################################################"
echo "######################## ASSEMBLING REMAINDER READS WITH MEGAHIT ########################"
echo "########################################################################################################"
echo " "

rm -r ${out}/megahit
megahit -r ${out}/unused_by_metaspades.fastq -o ${out}/megahit -t $threads -m ${mem}000000000
if [ ! -f "${out}/megahit/final.contigs.fa" ]; then error "Something went wrong with reassembling with Megahit. Exiting."; fi
mv ${out}/unused_by_metaspades.fastq ${out}/metaspades/


echo " "
echo "########################################################################################################"
echo "######################## COMBINE AND FORMAT THE TWO ASSEMBLIES ########################"
echo "########################################################################################################"
echo " "

${SOFT}/fix_megahit_contig_naming.py ${out}/megahit/final.contigs.fa > ${out}/megahit/fixed.contigs.fa
${SOFT}/rm_short_contigs.py 500 ${out}/megahit/fixed.contigs.fa > ${out}/megahit/long.contigs.fa

cp ${out}/metaspades/long_scaffolds.fasta ${out}/combined_assembly.fasta
cat ${out}/megahit/long.contigs.fa >> ${out}/combined_assembly.fasta
${SOFT}/sort_contigs.py ${out}/combined_assembly.fasta > ${out}/final_assembly.fasta
if [[ ! -s ${out}/final_assembly.fasta ]]; then error "Something went wrong with joining the assemblies. Exiting."; fi
rm ${out}/combined_assembly.fasta


echo " "
echo "########################################################################################################"
echo "######################## RUNNING ASSEMBLY QC WITH QUAST ########################"
echo "########################################################################################################"
echo " "

cp ${out}/metaspades/scaffolds.fasta ${out}/metaspades/metaspades_assembly.fasta
quast -t $threads -o ${out}/QUAST_out -m 500 ${out}/final_assembly.fasta ${out}/metaspades/metaspades_assembly.fasta
cp ${out}/QUAST_out/report.html ${out}/assembly_report.html

if [[ ! -s ${out}/assembly_report.html ]]; then error "Something went wrong with running QUAST. Exiting."; fi


echo " "
echo "########################################################################################################"
echo "######################## ASSEMBLY PIPELINE COMPLETED SUCCESSFULLY!!! ########################"
echo "########################################################################################################"
echo " "





Loading

0 comments on commit a338737

Please sign in to comment.