-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path02_Metagenomics_Humann3.Rmd
176 lines (135 loc) · 6.64 KB
/
02_Metagenomics_Humann3.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
---
title: "Metagenomics workflow using RIVM-Toolbox"
subtitle: "Humann3 pipeline"
author: "Sudarshan A. Shetty"
date: "`r date()`"
output:
html_document:
toc: yes
toc_depth: 2
toc_float: true
editor_options:
chunk_output_type: console
---
In this workflow, we demonstrate the use of Humann3 pipeline setup at RIVM using Singularity for Whole Genome ShotGun metagenomics microbiome data. This is part of the RIVM-toolbox. The snakemake workflow from raw reads to taxonomic and functional profiles was developed by Jeroen Frank from Bioinformatics department at RIVM.
The source code is located here: https://gitlab.rivm.nl/frankj/humann_pipeline
A copy is forked to the RIVM-Microbiome GitLab account here: https://gitlab.rivm.nl/rivm-microbiome/humann_pipeline
First create a conda environment. This is a virtual environment which allows for convinient installation and management of different softwares. For some introduction check this [blog](https://towardsdatascience.com/virtual-environments-for-absolute-beginners-what-is-it-and-how-to-create-one-examples-a48da8982d4b). More information on conda can be found on this [website](https://docs.conda.io/projects/conda/en/latest/).
Conda is installed on individual servers at RIVM. You can login to `Wetenschappelijke werkplek`.
# Conda
## Create Conda Env
```{bash eval=F}
# Humann3 pipeline from frankj
conda create --name humann3_frankj -c bioconda snakemake
# activate the environment.
```
Before the environment is activated your terminal will start like this:
`(base) [shettys@rivm-biohn-l05p humann_pipeline-master]$`
## Activate Conda Env
```{bash eval=F}
conda activate humann3_frankj
```
After the :
`(humann3_frankj) [shettys@rivm-biohn-l05p humann_pipeline-master]$`
# Project setup
```{bash eval=F}
# got to your working directory
cd /data/BioGrid/username/projects
# make a new directory
mkdir my_wgs_project
cd my_wgs_project
```
Now download the raw files in a folder called `my_input_fqs` within the `my_wgs_project` folder. Also download the humann_pipeline from here https://gitlab.rivm.nl/frankj/humann_pipeline and unzip it.
```{bash eval=F}
# The folder structure it something like this
my_wgs_project
-- my_input_fqs\
-- myR1.fastq.gz
-- myR2.fastq.gz
-- humann_pipeline-master\
-- config # Config file where parameters can be changed
-- data # Not applicable here
-- envs # Software environment
-- humann.py # pipeline main function
-- log.txt # Logs
-- README.md # info
-- run_pipeline.sh # internal function
-- scripts # internal function
-- smk_profiles # internal function
-- snakefiles # internal function
```
Go to the `humann_pipeline-master` folder.
```{bash eval=F}
cd humann_pipeline-master/
```
We can now submit our analysis (job) to the HPC with following command.
# Run Pipeline HPC
```{bash eval=F}
bsub -J mytestWGS -q bio -n 1 -o log.txt python humann.py --fq_dir ../my_input_fqs/ --out_dir ../my_input_fqs/mytestWGS_Output
# Note:
# HPC LSF specific
# -J name of the job
# -q use bio cluster. There is also bio-prio cluster for priority jobs
# -n 1 use One core
# -o output log of cluster job
# Pipeline specific
# --fq_dir is the location of fastq.gz files here it is ../my_input_fqs/ that is one folder outside of humann_pipeline-master
# --out_dir is the location for storing all outputs from the pipeline
```
More information on HPC at RIVM: https://gitlab.rivm.nl/bioinformatics/docs/wikis/HPC/hosts_and_queues
https://gitlab.rivm.nl/bioinformatics/docs
Software tools used in this pipeline:
**Introduction**
The HUMAnN 3.0 pipeline performs the following tasks:
- Sequence data quality control reports - [FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/)
- Read length and quality filtering, adapter trimming - [BBMapBBDuk](https://jgi.doe.gov/data-and-tools/bbtools/bb-tools-user-guide/bbduk-guide/)
- Contaminant (human) sequence removal - [KneadData](https://huttenhower.sph.harvard.edu/kneaddata/)
- Estimate microbial composition - [MetaPhlAn 3.0](https://huttenhower.sph.harvard.edu/metaphlan/)
- Estimate microbial metabolic pathway abundance - [HUMAnN](https://huttenhower.sph.harvard.edu/human)
**Input data**
A folder containing paired-end metagenome sequencing data.
**Output data**
The main results are stored in the `results/` folder and include:
- metaphlan_bugs_list_merged.tsv
- [humann_genefamilies_KO_cpm_join.tsv](https://github.com/biobakery/humann#1-gene-families-file)
converted uniref90 to KO, normalised counts to cpm, joined tables
- [humann_pathabundance_cpm_join.tsv](https://github.com/biobakery/humann#2-pathway-abundance-file)
normalised counts to cpm, joined tables
- [humann_pathcoverage_join.tsv](https://github.com/biobakery/humann#3-pathway-coverage-file)
Joined tables
Intermediate and temporary data are automatically removed.
The output folder will have following folder
```{bash eval=F}
my_wgs_project
-- my_input_fqs\
-- myR1.fastq.gz
-- myR2.fastq.gz
-- mytestWGS_Output
-- fastqc
-- pre_qc
-- post-qc
-- fq_input
-- pseudo links to my_input_fqs
-- humann
-- per sample humann output
-- kneaddata
-- logs
-- results
-- humann_genefamilies_KO_cpm_join.tsv
-- humann_pathabundance_cpm_join.tsv
-- humann_pathcoverage_join.tsv
-- humann_genefamilies_KO_join.tsv
-- humann_pathabundance_join.tsv
-- metaphlan_bugs_list_merged.tsv
```
The `metaphlan_bugs_list_merged.tsv` file can be read into R using the `biomeUtils::readMergedMetaphlan()` function.
# Hand-off to RIVM-toolbox
```{r eval=FALSE}
path_to_file = "my_wgs_project/my_input_fqs/mytestWGS_Output/results/metaphlan_bugs_list_merged.tsv"
mpa.ps.lab8 <- readMergedMetaphlan(input_file_path = path_to_file,
find_sample_name_pattern = "_kneaddata_concat_metaphlan_bugs_list",
replace_sample_name_pattern = "")
```
**For more information contact:**
Sudarshan A. Shetty: sudarshan.shetty[@]rivm[dot]nl
Susana Fuentes: susana.fuentes[@]rivm[dot]nl