-
Notifications
You must be signed in to change notification settings - Fork 11
/
Copy pathgenome-mapping.qmd
473 lines (304 loc) · 19.1 KB
/
genome-mapping.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
---
title: Genome Mapping
author: Alexander Herbig, Alina Hiß, and Teresa Zeibig
bibliography: assets/references/genome-mapping.bib
---
:::{.callout-note collapse="true" title="Self guided: chapter environment setup"}
For this chapter's exercises, if not already performed, you will need to download the chapter's dataset, decompress the archive, and create and activate the conda environment.
Do this, use `wget` or right click and save to download this Zenodo archive: [10.5281/zenodo.13759321](https://doi.org/10.5281/zenodo.13759321), and unpack.
```bash
tar xvf genome-mapping.tar.gz
cd genome-mapping/
```
You can then create the subsequently activate environment with the following.
```bash
conda env create -f genome-mapping.yml
conda activate genome-mapping
```
:::
Mapping/aligning to a reference genome is one way of reconstructing genomic information from DNA sequencing reads.
This allows for identification of differences between the genome from your sample and the reference genome.
This information can be used for example for comparative analyses such as in phylogenetics. For a detailed explanation of the read alignment problem and an overview of concepts for solving it, please see [@Reinert2015].
In this session we will map two samples to the _Yersinia pestis_ (plague) genome using different parameter sets.
We will do this "manually" in the sense that we will use all necessary commands one by one in the terminal.
These commands usually run in the background when you apply DNA sequencing data processing pipelines.
We will be using the Burrows-Wheeler Aligner [@Li2010, [http://bio-bwa.sourceforge.net](http://bio-bwa.sourceforge.net)].
There are different algorithms implemented for different types of data (e.g. different read lengths).
Here, we use BWA backtrack (`bwa aln`), which is suitable for Illumina sequences up to 100bp.
Other algorithms are `bwa mem` and `bwa sw` for longer reads.
Your learning objectives:
1. **Understand the Basics**: You will be able to define mapping and describe the basic principles of metagenomic mapping and the different parameters used.
2. **Apply Mapping Techniques**: You will be able to apply metagenomic mapping techniques to align raw sequence data to a reference genome in a step-by-step manner.
3. **Use Bioinformatics Tools**: You will be able to use the command line to apply different metagenomic mappers and perform genotype analysis via multivcfanalyzer in the standard settings. You will be able to inspect results in the IGV viewer.
4. **Interpret Results**: You will be able to interpret the results of a mapping experiment and discuss their implications. You will also be able to understand the genotyping tool multiVCFanalycer.
5. **Be Aware and Able to Read Up**: You will know about the existence of multiple mapping algorithms and the importance of parameter research and adjustment. You will know that the IGV viewer is one option to inspect mapping results but not the only one.
## Reference Genome
For mapping we need a reference genome in FASTA format. Ideally we use a genome from the same species that our data relates to or, if not available, a closely related species.
The selection of the correct reference genome is highly relevant. E.g. if the chosen genome differs too much from the organism the data relates to, it might not be possible to map most of the reads.
Reference genomes can be retrieved from comprehensive databases such provided by the NCBI ([https://www.ncbi.nlm.nih.gov/](https://www.ncbi.nlm.nih.gov/)).
In your directory, you can find 2 samples and your reference.
As a first step we will index our reference genome (make sure you are inside your directory).
The first index we will generate is for `bwa`.
```bash
bwa index YpestisCO92.fa
```
The second index will be used by the genome browser we will apply to our results later on.
```bash
samtools faidx YpestisCO92.fa
```
We need to build a third index that is necessary for the genotyping step, which comes later after mapping.
```bash
picard CreateSequenceDictionary R=YpestisCO92.fa
```
## Mapping Parameters
We will be using `bwa aln`, but we need to specify parameters.
For now we will concentrate on the "seed length" and the "maximum edit distance". We will use the default setting for all other parameters during this session. The choice of the right parameters depend on many factors such as the type of data and the specific use case. One aspect is the mapping sensitivity, i.e. how different a read can be from the chosen reference and still be mapped. In this context we generally differentiate between _strict_ and _lenient_ mapping parameters.
As many other mapping algorithms `bwa` uses a so-called "seed-and-extend" approach. I.e. it initially maps the first _N_ nucleotides of each read to the genome with relatively few mismatches and thereby determines candidate positions for the more time-intensive full alignment.
A short seed length will generate more such candidate positions and therefore mapping will take longer, but it will also be more sensitive, i.e. there can be more differences between the read and the genome. Long seeds are less sensitive but the mapping procedure is faster.
In this session we will use the following two parameter sets:
**Lenient**
Allow for more mismatches → `-n 0.01`
Short seed length → `-l 16`
**Strict**
Allow for less mismatches → `-n 0.1`
Long seed length → `-l 32`
We will be working with pre-processed files (`sample1.fastq.gz`, `sample2.fastq.gz`), i.e. any quality filtering and removal of sequencing adapters is already done.
We will map each file once with lenient and once with strict parameters.
For this, we will make 4 separate directories, to avoid mixing up files.
```bash
mkdir sample1_lenient sample2_lenient sample1_strict sample2_strict
```
## Mapping Sample1
Let’s begin with a lenient mapping of sample1.
Go into the corresponding folder.
```bash
cd sample1_lenient
```
Perform the `bwa` alignment, here for sample1, and specify lenient mapping parameters.
```bash
bwa aln -n 0.01 -l 16 ../YpestisCO92.fa ../sample1.fastq.gz > reads_file.sai
```
Proceed with writing the mapping in `sam` format [@Li2009, [https://en.wikipedia.org/wiki/SAM\_(file_format)](<https://en.wikipedia.org/wiki/SAM_(file_format)>)].
```bash
bwa samse -r '@RG\tID:all\tLB:NA\tPL:illumina\tPU:NA\tSM:NA' ../YpestisCO92.fa reads_file.sai ../sample1.fastq.gz > reads_mapped.sam
```
Note that we have specified the sequencing platform (Illumina) by creating a so-called "Read Group" (`-r`).
This information is used later during the genotyping step.
Convert SAM file to binary format (BAM file).
```bash
samtools view -b -S reads_mapped.sam > reads_mapped.bam
```
For processing of `sam` and `bam` files we use `samtools` [@Li2009, [https://github.com/samtools/samtools](https://github.com/samtools/samtools)].
`-b` specifies to output in BAM format.
(`-S` specifies input is SAM, can be omitted in recent versions).
Now we sort the `bam` file → Sort alignments by leftmost coordinates.
```bash
samtools sort reads_mapped.bam > reads_mapped_sorted.bam
```
The sorted bam file needs to be indexed → more efficient for further processing.
```bash
samtools index reads_mapped_sorted.bam
```
Deduplication → Removal of reads from duplicated fragments.
```bash
samtools rmdup -s reads_mapped_sorted.bam reads_mapped_sorted_dedup.bam
```
```bash
samtools index reads_mapped_sorted_dedup.bam
```
Duplicated reads are usually a consequence of amplification of the DNA fragments in the lab. Therefore, they are not biologically meaningful.
We have now completed the mapping procedure.
Let's have a look at our mapping results.
```bash
samtools view reads_mapped_sorted_dedup.bam | less -S
```
(exit by pressing <kbd>q</kbd>)
We can also get a summary about the number of mapped reads. For this we use the `samtools idxstats` command ([http://www.htslib.org/doc/samtools-idxstats.html](http://www.htslib.org/doc/samtools-idxstats.html)).
```bash
samtools idxstats reads_mapped_sorted_dedup.bam
```
## Genotyping
The next step we need to perform is genotyping, i.e. the identification of all SNPs that differentiate the sample from the reference.
For this we use the 'Genome Analysis Toolkit' (`gatk`) [@DePristo2011, [http://www.broadinstitute.org/gatk/](http://www.broadinstitute.org/gatk/)]
It uses the reference genome and the mapping as input and produces an output in 'Variant Call Format (VCF)' ([https://en.wikipedia.org/wiki/Variant_Call_Format](https://en.wikipedia.org/wiki/Variant_Call_Format)).
Perform genotyping on the mapping file.
```bash
gatk3 -T UnifiedGenotyper -R ../YpestisCO92.fa -I reads_mapped_sorted_dedup.bam --output_mode EMIT_ALL_SITES -o mysnps.vcf
```
Let’s have a look…
```bash
cat mysnps.vcf | less -S
```
(exit by pressing <kbd>q</kbd>)
## Mapping and Genotyping for the other Samples/Parameters
Let's now continue with mapping and genotyping for the other samples and parameter settings.
### Sample2 Lenient
::: {.callout-note}
This is a larger file and lenient mapping takes longer so this file will likely take a few minutes. If you are short on time, proceed with the other sample/parameter settings first and come back to this later if there is time.
The entire code block can be copied as it is and executed. It’s composed of all the steps we executed individually earlier.
:::
```bash
cd ..
cd sample2_lenient
bwa aln -n 0.01 -l 16 ../YpestisCO92.fa ../sample2.fastq.gz > reads_file.sai
bwa samse -r '@RG\tID:all\tLB:NA\tPL:illumina\tPU:NA\tSM:NA' ../YpestisCO92.fa reads_file.sai ../sample2.fastq.gz > reads_mapped.sam
samtools view -b -S reads_mapped.sam > reads_mapped.bam
samtools sort reads_mapped.bam > reads_mapped_sorted.bam
samtools index reads_mapped_sorted.bam
samtools rmdup -s reads_mapped_sorted.bam reads_mapped_sorted_dedup.bam
samtools index reads_mapped_sorted_dedup.bam
gatk3 -T UnifiedGenotyper -R ../YpestisCO92.fa -I reads_mapped_sorted_dedup.bam --output_mode EMIT_ALL_SITES -o mysnps.vcf
```
### Sample1 Strict
```bash
cd ..
cd sample1_strict
bwa aln -n 0.1 -l 32 ../YpestisCO92.fa ../sample1.fastq.gz > reads_file.sai
bwa samse -r '@RG\tID:all\tLB:NA\tPL:illumina\tPU:NA\tSM:NA' ../YpestisCO92.fa reads_file.sai ../sample1.fastq.gz > reads_mapped.sam
samtools view -b -S reads_mapped.sam > reads_mapped.bam
samtools sort reads_mapped.bam > reads_mapped_sorted.bam
samtools index reads_mapped_sorted.bam
samtools rmdup -s reads_mapped_sorted.bam reads_mapped_sorted_dedup.bam
samtools index reads_mapped_sorted_dedup.bam
gatk3 -T UnifiedGenotyper -R ../YpestisCO92.fa -I reads_mapped_sorted_dedup.bam --output_mode EMIT_ALL_SITES -o mysnps.vcf
```
### Sample2 Strict
```bash
cd ..
cd sample2_strict
bwa aln -n 0.1 -l 32 ../YpestisCO92.fa ../sample2.fastq.gz > reads_file.sai
bwa samse -r '@RG\tID:all\tLB:NA\tPL:illumina\tPU:NA\tSM:NA' ../YpestisCO92.fa reads_file.sai ../sample2.fastq.gz > reads_mapped.sam
samtools view -b -S reads_mapped.sam > reads_mapped.bam
samtools sort reads_mapped.bam > reads_mapped_sorted.bam
samtools index reads_mapped_sorted.bam
samtools rmdup -s reads_mapped_sorted.bam reads_mapped_sorted_dedup.bam
samtools index reads_mapped_sorted_dedup.bam
gatk3 -T UnifiedGenotyper -R ../YpestisCO92.fa -I reads_mapped_sorted_dedup.bam --output_mode EMIT_ALL_SITES -o mysnps.vcf
```
## Comparing Genotypes
In order to combine the results from multiple samples and parameter settings we need to agregate and comparatively analyse the information from all the _vcf_ files.
For this we will use the software
`multivcfanalyzer` [@Bos2014-xe, [https://github.com/alexherbig/MultiVCFAnalyzer](https://github.com/alexherbig/MultiVCFAnalyzer)].
It produces various output files and summary statistics and can integrate gene annotations for SNP effect analysis as done by the program `SnpEff` [@Cingolani2012, [https://github.com/pcingola/SnpEff](https://github.com/pcingola/SnpEff)].
Run `multivcfanalyzer` on all 4 files at once.
First `cd` one level up (if you type `ls` you should see your 4 directories, reference, etc.).
```bash
cd ..
```
Then make a new directory.
```bash
mkdir vcf_out
```
And run the programme.
```bash
multivcfanalyzer NA YpestisCO92.fa NA vcf_out F 30 3 0.9 0.9 NA sample1_lenient/mysnps.vcf sample1_strict/mysnps.vcf sample2_lenient/mysnps.vcf sample2_strict/mysnps.vcf
```
Let’s have a look in the `vcf_out/` directory (`cd` into it).
```bash
cd vcf_out
```
Check the parameters we set earlier.
```bash
less -S info.txt
```
(exit by pressing <kbd>q</kbd>)
Check results:
```bash
less -S snpStatistics.tsv
```
(exit by pressing <kbd>q</kbd>)
The file content should look like this.
```bash
SNP statistics for 4 samples.
Quality Threshold: 30.0
Coverage Threshold: 3
Minimum SNP allele frequency: 0.9
sample SNP Calls (all) SNP Calls (het) coverage(fold) coverage(percent)
refCall allPos noCall discardedRefCall discardedVarCall filteredVarCall unhandledGenotype
sample1_lenient 213 0 16.38 92.69
4313387 4653728 293297 46103 728 0 0
sample1_strict 207 0 16.33 92.71
4314060 4653728 293403 45633 425 0 0
sample2_lenient 1274 0 9.01 83.69
3893600 4653728 453550 297471 7829 0 4
sample2_strict 1218 0 8.94 83.76
3896970 4653728 455450 295275 4815 0 0
```
First we find the most important parameter settings and then the table of results.
The first column contains the dataset name and the second column the number of called SNPs. The genome coverage and the fraction of the genome covered with the used threshold can be found in columns 4 and 5, respectively. For example, sample1 had 207 SNP calls with strict parameters. The coverage is about 16-fold and about 93% of the genome are covered 3 fold or higher (The coverage threshold we set was 3).
## Exploring the Results
For visual exploration of mapping results so-called "Genome Browsers" are used.
Here we will use the 'Integrative Genomics Viewer' (`igv`)' ([https://software.broadinstitute.org/software/igv/](https://software.broadinstitute.org/software/igv/)).
To open `igv`, simply type the following command and the app will open.
```bash
igv
```
Note that you cannot use the terminal while `igv` is open. If you want to use it anyways, open a second terminal via the bar on the bottom.
Load your reference (`YpestisCO92.fa`):
_→ Genomes → Load Genome from File_
![](assets/images/chapters/genome-mapping/IGV_load_genome.png)
Load your sorted and dedupp'ed _bam_ files (do this 4 times, once for each mapping):
_→ File → Load from File_
![](assets/images/chapters/genome-mapping/IGV_load_bam.png)
Try to explore the mapping results yourself. Here are some questions to guide you. Please also have a look at the examples below.
::: {.callout-tip title="Question" appearance="simple"}
- What differences do you observe between the samples and parameters?
- Differences in number of mapped reads, coverage, number of SNPs
- Do you see any global patterns?
- Which sample is more affected by changing the parameters?
- Which of the two samples might be ancient, which is modern?
:::
Let’s examine some SNPs.
Have a look at `snpTable.tsv`.
::: {.callout-tip title="Question" appearance="simple"}
Can you identify SNPs that were called with lenient but not with strict parameters or vice versa?
:::
Let’s check out some of these in `igv`.
::: {.callout-tip title="Question" appearance="simple"}
Do you observe certain patterns in these genomic regions?
:::
## Examples
Please find here a few examples for exploration. To get a better visualisation we only loaded `sample2_lenient` (top track) and `sample2_strict` (bottom track):
![](assets/images/chapters/genome-mapping/IGV_example_intro.png)
You can see all aligned reads in the current genomic region as stacks of grey arrows. In the middle of the image you see brown dashes in all of the reads. This is a SNP.
You also see sporadically green or red dashes in some reads but not all of them at a given position. These sporadic differences are DNA damage such as we typically find it for ancient DNA.
For jumping to a specific coordinate you need to enter it into the coordinate field at the top:
![](assets/images/chapters/genome-mapping/IGV_coordinate_example.png)
E.g. if you enter `12326942` after the colon in the coordinate field and hit enter, you will jump to the same position as in the screenshot above.
Let's have a look at some positions.
For example position `36472`:
![](assets/images/chapters/genome-mapping/IGV_SNP_36472.png)
In the middle of the image you see a SNP (`T`) that was called with strict parameters (bottom) but not with lenient parameters (top). But why would it not be called in the top track? It is not called because there are three reads that cover the same position, but do not contain the `T`. We can see that these reads have other difference to the reference at other positions. That's why they are not mapped with strict parameters. It is quite likely that they originate from a different species. This example demonstrates that sensitive mapping parameters might actually lead to a loss of certain SNP calls.
Does this mean that stricter parameters will always give us a clean mapping? Let's have a look at position `219200`:
![](assets/images/chapters/genome-mapping/IGV_SNP_219200.png)
You might need to zoom out a bit using the slider in the upper right corner.
So, what is going on here? We see a lot of variation in most of the reads. This is reduced a bit with strict mapping parameters (bottom track) but the effect is still quite pronounced. Here, we see a region that seems to be conserved in other species as well, so we have a lot of mapping from other organisms. We can't compensate that with stricter mapping parameters and we would have to apply some filtering on genotype level to remove this variation from our genotyping. Removing false positive SNP calls is important as it would interfere with downstream analyses such as phylogenomics.
Such regions can be fairly large. For example, see this 20 kb region around position `224750`:
![](assets/images/chapters/genome-mapping/IGV_SNP_224750.png)
## (Optional) clean-up
Let's clean up our working directory by removing all the data and output from this chapter.
The command below will remove the `/<PATH>/<TO>/genome-mapping` _as well as all of its contents_.
::: {.callout-tip}
## Pro Tip
Always be VERY careful when using `rm -r`.
Check 3x that the path you are specifying is exactly what you want to delete and nothing more before pressing ENTER!
:::
```bash
rm -r /<PATH>/<TO>/genome-mapping*
```
Once deleted we can move elsewhere (e.g. `cd ~`).
We can also get out of the `conda` environment with.
```bash
conda deactivate
```
Then to delete the conda environment.
```bash
conda remove --name genome-mapping --all -y
```
## Summary
- Mapping DNA sequencing reads to a reference genome is a complex procedure that requires multiple steps
- Mapping results are the basis for genotyping, i.e. the detection of differences to the reference
- The genotyping results can be aggregated from multiple samples and comparatively analysed e.g. in the context of phylogenomics
- The chosen mapping parameters can have a strong influence on the results of any downstream analysis
- This is particularly true when dealing with ancient DNA samples as they tend to contain DNA from multiple organisms.This can lead to mismapped reads and therefore incorrect genotypes, which can further influence downstream analyses
## References