diff --git a/docs/lesson04-taxonomic-annotation/01-community-structure.qmd b/docs/lesson04-taxonomic-annotation/01-community-structure.qmd index fa0cacc..fc6b085 100644 --- a/docs/lesson04-taxonomic-annotation/01-community-structure.qmd +++ b/docs/lesson04-taxonomic-annotation/01-community-structure.qmd @@ -2,602 +2,34 @@ title: "Extracting a Community Profile" --- -In the [previous episode](/docs/lesson03-qc-assembly/03-assembly.qmd) we generated a draft assembly using Flye from our long read Nanopore data. +In the [previous episode](/docs/lesson03-qc-assembly/03-rna-filtering.qmd) we filtered our samples to separate out rRNA and mRNA sequences, however for our taxonomic analysis we shall be using our unfiltered data. -Long reads can span regions which make them difficult to assemble with short reads such as regions with large repeats. Despite this, some long reads will be misassembled. In addition, the base accuracy of long reads is lower than that of short reads and some bases will be incorrectly assigned. Consequently it is common to correct these errors by "polishing" an assembly. We will use two polishing strategies: +We shall be using a tool called Metaphlan, which aligns sequences with reference genomes in order to estimate which taxon those sequences originate from. Because rRNA is often used as a taxonomic marker, we will want to use our unfiltered data. The filtered data will be used instead for our functional analysis next lesson. -1. Polishing with long reads using [Medaka](https://github.com/nanoporetech/medaka). This maps the raw long reads to the assembly to identify contigs that have been incorrectly joined. -2. Polishing with short reads using [Pilon](https://github.com/broadinstitute/pilon) which uses the more accurate short read data to correct incorrectly called bases in the assembly. More detail on the advantages and disadvantages of short and long read sequencing is covered in our [Statistically useful experimental design](https://cloud-span.github.io/experimental_design00-overview/) workshop in [Platform choice](https://cloud-span.github.io/experimental_design01-principles/01-platform/index.html). +### Taxonomic analysis methods -### Polishing with short reads +Often a metatranscriptomic analysis will also include data on the community structure of the microbiome being analysed. This data can be obtained in a few different ways: - +(For more on different methodologies see the [introduction to lesson 3](/docs/lesson03-qc-assembly/01-introduction-meta.qmd)). -![Diagram showing overlap of reads for polishing.](/images/04_polishing_diagram_v1.png){width="525" height="307" fig-alt="Diagram showing overlap of reads for polishing."} +### What is Metaphlan? -In the diagram, the long read assembly is shown at the top. The four short reads shown below the assembly have been aligned to it. The third position in the assembly is `A` but the three short reads that cover this region contain a `T`. The assembly probably contains a miscalled base but it can be corrected - or polished - at this position with the higher accuracy short read information. Typically short read polishing would be carried out three times with the base accuracy increasing each time. However, to reduce the compute requirements and the time required to finish the assembly, we will be performing it just once. - -## Why bother polishing? - -How important polishing is to your analysis will depend on what you need it for. Usually we generate metagenome assemblies so we can compare the sequences to a database and find out what taxa they belong to. - -You might **NOT** need to polish your assembly if: - -- you only need to taxa to the genus level (meaning single incorrect bases are not important) - -You **DO** need to polish your assembly if: - -- you want to identify taxa to the species level (if possible). This is a common requirement since one of the main advantages of whole genome sequencing over amplicon sequencing is that you can assign annotations to the species level. We will cover [Taxonomic annotations](/docs/lesson06-taxonomic-annotations/index.qmd) later in the course. -- you want to generate protein predictions or identify protein structure domains to determine the functionality of metagenomes. This is discussed in more detail in Watson and Warr (2019): [Errors in long-read assemblies can critically affect protein prediction](https://www.nature.com/articles/s41587-018-0004-z). - -## Polishing an assembly with long reads - -First we will polish the draft Flye assembly using the filtered raw long reads. As with the assembly, we need to use polishing software that is especially written for long read raw reads. - -[Medaka](https://github.com/nanoporetech/medaka) is a command line tool built by Oxford Nanopore Technologies which will polish an assembly by generating a consensus from raw Nanopore sequences using a recurrent neural network. - -We will be using one Medaka command, `medaka_consensus`. This pipeline will first align the raw reads to the draft assembly, then process this alignment to generate a pileup. The pileup is presented to a recurrent neural network in order to produce a consensus sequence. - -Medaka is installed on the AWS instance. Let's take a look at the help page for `medaka_consensus`: - -``` {.bash filename="Code"} -medaka_consensus -h -``` - -::: {.callout-note collapse="true"} -## `Output --- medaka_consensus` Help - -``` default -medaka 1.7.2 - -Assembly polishing via neural networks. Medaka is optimized -to work with the Flye assembler. - -medaka_consensus [-h] -i -d - - -h show this help text. - -i fastx input basecalls (required). - -d fasta input assembly (required). - -o output folder (default: medaka). - -g don't fill gaps in consensus with draft sequence. - -r use gap-filling character instead of draft sequence (default: None) - -m medaka model, (default: r941_min_hac_g507). - Choices: r103_fast_g507 r103_hac_g507 r103_min_high_g345 r103_min_high_g360 r103_prom_high_g360 r103_sup_g507 r1041_e82_400bps_fast_g615 r1041_e82_400bps_hac_g615 r1041_e82_400bps_sup_g615 r104_e81_fast_g5015 r104_e81_hac_g5015 r104_e81_sup_g5015 r104_e81_sup_g610 r10_min_high_g303 r10_min_high_g340 r941_e81_fast_g514 r941_e81_hac_g514 r941_e81_sup_g514 r941_min_fast_g303 r941_min_fast_g507 r941_min_hac_g507 r941_min_high_g303 r941_min_high_g330 r941_min_high_g340_rle r941_min_high_g344 r941_min_high_g351 r941_min_high_g360 r941_min_sup_g507 r941_prom_fast_g303 r941_prom_fast_g507 r941_prom_hac_g507 r941_prom_high_g303 r941_prom_high_g330 r941_prom_high_g344 r941_prom_high_g360 r941_prom_high_g4011 r941_prom_sup_g507 r941_sup_plant_g610 - Alternatively a .tar.gz/.hdf file from 'medaka train'. - -f Force overwrite of outputs (default will reuse existing outputs). - -x Force recreation of alignment index. - -t number of threads with which to create features (default: 1). - -b batchsize, controls memory use (default: 100). -``` -::: - -To use Medaka we need to specify certain parameters in the command, like we did when we ran Flye last session. The help page tells us that the basic format for medaka is `medaka_consensus [-h] -i -d 1`, indicating that the `-i` and `-d` flags are mandatory. - -Let's have a look at the flags and options we're going to use: - -| Flag/option | Meaning | Our input | -|-------------------|----------------------------------|-------------------| -| `-i` | Input basecalls (i.e. what we are polishing with) | `-i data/nano_fastq/ERR5000342_sub12_filtered` | -| `-d` | Input assembly (i.e. what is being polished) | `-d results/assembly/assembly.fasta` | -| `-m` | Neural network model to use (described in [the documentation](https://github.com/nanoporetech/medaka#models)) | `-m r941_min_hac_g507` | -| `-o` | Output directory | `-o results/medaka` | -| `-t` | Number of threads | `-t 8` | - -Once again we will run this command in the background and redirect the output to a file. This means we add `&>` to redirect the output and `&` to the very end to make it run in the background. - -Before we start let's make sure we're in the `cs_course` folder and create a directory called `medaka` for our output in the `results` folder. - -``` {.bash filename="Code"} -cd ~/cs_course -mkdir results/medaka -``` - -Now it's time to run Medaka! - -``` {.bash filename="Code"} -medaka_consensus \ --i data/nano_fastq/ERR5000342_sub12_filtered.fastq \ --d results/assembly/assembly.fasta \ --m r941_min_hac_g507 \ --o results/medaka \ --t 8 \ -&> results/medaka/medaka.out & -``` - -We can check the command is running using `jobs`: - -``` {.bash filename="Code"} -jobs -``` - -If it is successfully running you should see an output like: - -``` {.default filename="Output"} -[1]+ Running medaka_consensus -i data/nano_fastq/ERR3152367_sub5_filtered.fastq -d results/assembly/assembly.fasta -m r941_min_hac_g507 -o results/medaka -t 8 &> results/medaka/medaka.out & -``` - -We can also look in the output file (`medaka.out`) to check the progress of the command. In this case Medaka will take about three hours to run. - -``` {.bash filename="Code"} -less results/medaka/medaka.out -``` - -If the Medaka command has been run correctly you will see something like this near the start of the output: - -``` {.default filename="Output"} -Checking program versions -This is medaka 1.7.2 -This is medaka 1.7.2 -Program Version Required Pass -bcftools 1.16-20-g6f4732b 1.11 True -bgzip 1.16-16-g3c6f83f 1.11 True -minimap2 2.24 2.11 True -samtools 1.16.1-33-gbd942a0 1.11 True -tabix 1.16-16-g3c6f83f 1.11 True -Aligning basecalls to draft -Constructing minimap index. -[M::mm_idx_gen::0.515*0.99] collected minimizers -[M::mm_idx_gen::0.648*1.40] sorted minimizers -[M::main::0.877*1.29] loaded/built the index for 146 target sequence(s) -[M::mm_idx_stat] kmer size: 15; skip: 10; is_hpc: 0; #seq: 146 -[M::mm_idx_stat::0.910*1.28] distinct minimizers: 2598367 (94.68% are singletons); average occurrences: 1.076; average spacing: 5.350; total length: 14953273 -``` - -::: callout-note -## Help! - -Medaka may give you a warning along the lines of: - -``` {.default filename="Output"} -2023-03-31 12:22:21.572954: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory -2023-03-31 12:22:21.573012: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. -2023-03-31 12:22:24.690162: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Couldnot load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory -2023-03-31 12:22:24.690198: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. -``` - -Don't worry, you can safely ignore this warning wherever it pops up. It is telling us that it couldn't load a library required for parallel computing using GPUs. We are not using a GPU setup and so this warning is irrelevant. -::: - -::: callout-note -## Help! - -If you have entered the command incorrectly you will usually find out quite quickly! An easy way to tell is if you get an output in your terminal that starts with 'Exit' rather than 'Done': - -``` {.default filename="Output"} -[1]+ Exit 1 medaka_consensus -i data/nano_fastq/ERR5000342_sub12_filtered.fastq -d results/assembly/assembly.fasta -m r941_min_hac_g507 -o results/medaka -t 8 & results/medaka/medaka.out -``` - -Another way to tell is by looking inside the `medaka.out` log file. You will get some of the usual text output as Medaka loads its dependencies but once it starts processing the command it will tell you that something is wrong, e.g. - -``` {.default filename="Output"} -Creating fai index file /home/csuser/cs_course/results/assembly/assembly.fasta.fai -[E::fai_build3_core] Failed to open the file /home/csuser/cs_course/results/assembly/assembly.fasta -[faidx] Could not build fai index /home/csuser/cs_course/results/assembly/assembly.fasta.fai -Failed to run alignment of reads to draft. -``` - -Things to check if your command fails: - -- spelling/typos --- have you spelled the command correctly and typed the directory names without error? -- paths --- are all of your absolute/relative paths complete and accurate? (tip: using tab completion can help with peace of mind here, as you can check that your path definitely follows the right trail) -- flags --- have you included all the mandatory flags, including the one (`-`) or two (`--`) dashes that usually accompany them? -- output --- does the output in your log/.out file give you any clues as to where your error lies? e.g. in the example above the line "Failed to open the file /home/csuser/cs_course/results/assembly/assembly.fasta" suggests that this path might be wrong somehow. - -If you really can't find your mistake then the your instructors and/or other participants will be able to help you sort it out, so there is no need to panic! -::: - -Medaka first looks for the other programs that it needs (known as dependencies) and their versions. These dependencies are installed on the AWS instance. Once it confirms they are present, it begins by aligning the raw reads (basecalls) to the assembly using minimap. - -Once Medaka has completed the end of the file (which you can skip to by typing G) will contain something like: +Metaphlan (Metagenomic Phylogenetic Analysis) is one tool from the Biobakery suite of software, which all have bakery related names. Whilst this is a fun naming convention it can be somewhat confusing as to what each tool does. ``` {.bash filename="Code"} -less results/medaka/medaka.out -``` +metaphlan data/trimmedreads/SRR6820491_merged.fastq \ + --input_type fastq \ + --bowtie2db data/chocophlandb/ \ + --bowtie2out analysis/metaphlan/metaphlan_bowtie2.bz2 \ + --nproc 8 \ + --add_viruses \ + --tax_lev 's' \ + --min_cu_len 2000 \ + --read_min_len 70 \ + --stat_q 0.1 \ + --output_file analysis/metaphlan/metaphlan_output.txt \ + --min_ab 0.1 \ + &> analysis/metaphlan/log.txt & -``` {.default filename="Output"} -[20:51:16 - DataIndx] Loaded 1/1 (100.00%) sample files. -[20:51:16 - DataIndx] Loaded 1/1 (100.00%) sample files. -[20:51:16 - DataIndx] Loaded 1/1 (100.00%) sample files. -[20:51:16 - DataIndx] Loaded 1/1 (100.00%) sample files. -[20:51:16 - DataIndx] Loaded 1/1 (100.00%) sample files. -[20:51:16 - DataIndx] Loaded 1/1 (100.00%) sample files. -[20:51:16 - DataIndx] Loaded 1/1 (100.00%) sample files. -[20:51:17 - DataIndx] Loaded 1/1 (100.00%) sample files. -[20:51:17 - DataIndx] Loaded 1/1 (100.00%) sample files. -[20:51:17 - DataIndx] Loaded 1/1 (100.00%) sample files. -Polished assembly written to medaka/consensus.fasta, have a nice day. +##note if you want to re-run metaphlan you'll have to delete the original outputs ``` - -Once Medaka has completed (\~ 3 hours on average) we can take a look at the files it has generated. - -``` {.bash filename="Code"} -cd results/medaka -ls -``` - -``` {.default filename="Output"} -calls_to_draft.bam calls_to_draft.bam.bai consensus.fasta consensus.fasta.gaps_in_draft_coords.bed consensus_probs.hdf -``` - -Medaka has created multiple files: - -- `calls_to_draft.bam` - a BAM file containing the alignment of the raw reads (basecalls) to the draft assembly -- `calls_to_draft.bam.bai` --- an index file of the above BAM file -- `consensus.fasta` --- the consensus sequence, or polished assembly in our case in FASTA format -- `consensus.fasta.gaps_in_draft_coords.bed` --- a BED file containing information about the location of any gaps in the consensus sequence which can be used when visualising the assembly -- `consensus_probs.hdf` --- a file that contains the output of the neural network calculations and is not an output for end-users, so we don't need to worry about this file - -In our case we're interested in the polished assembly, so we want the `consensus.fasta` file. - -::: callout-note -## BAM and SAM Files - -A [SAM file](https://genome.sph.umich.edu/wiki/SAM), is a tab-delimited text file that contains information for each individual read and its alignment to the genome. The paper by [Heng Li et al.](http://bioinformatics.oxfordjournals.org/content/25/16/2078.full) provides the full specification. - -The compressed binary version of SAM is called a BAM file. We use this version to reduce size and to allow for indexing, which enables efficient random access of the data contained within the file. - -The file begins with a header, which is optional. The header describes the source of data, reference sequence, method of alignment etc. - these will change depending on the aligner being used. Following the header is the alignment section. Each line that follows corresponds to alignment information for a single read. There are 11 mandatory fields for essential mapping information and a variable number of other fields for aligner specific information. - -See [Genomics - Variant Calling](https://cloud-span.github.io/04genomics/01-variant_calling/index.html) for a deeper dive. -::: - -## Polishing with short reads - -We will be using the program [Pilon](https://github.com/broadinstitute/pilon) to further polish the draft assembly using the raw short reads. Pilon will improve a draft assembly by filling gaps, fixing misassemblies and correcting bases. You can read more about how it works in the paper [Pilon: An Integrated Tool for Comprehensive Microbial Variant Detection and Genome Assembly Improvement](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0112963). - -Bioinformatics programs are not built equally. Some programs, like Flye or Medaka, require very few input files as they will generate any that they need within the pipeline. Some programs however, require a lot of user input to generate the input files that are needed. - -Pilon is in the latter group of bioinformatics software, so we will need to do some pre-processing using other programs to create some of the inputs needed. - -### Generating the Pilon input files - -We will first use the program [BWA](https://github.com/lh3/bwa) to generate an alignment of the raw short reads against the draft genome in consensus.fasta. The steps will be: - -1. indexing the polished assembly, consensus.fasta with `bwa index`. Indexing allows the aligner to quickly find potential alignment sites for query sequences in a genome, which saves time during alignment. -2. creating a directory for the outputs of Pilon -3. aligning the short reads (the illumina data) to the assembly, consensus.fasta with `bwa mem` -4. converting the short read alignment to a BAM file `samtools view` -5. sorting the short read alignment with `samtools sort` -6. indexing the short read alignment with `samtools index` - -Make sure you are in the `cs_course` folder and use `bwa index` to index the consensus assembly: - -``` {.bash filename="Code"} -cd ~/cs_course -bwa index results/medaka/consensus.fasta -``` - -This should only take a few seconds to complete so we don't need to run the job in the background. Once the indexing is complete you should see an output like: - -``` {.default filename="Output"} -[bwa_index] Pack FASTA... 0.51 sec -[bwa_index] Construct BWT for the packed sequence... -[bwa_index] 5.86 seconds elapse. -[bwa_index] Update BWT... 0.10 sec -[bwa_index] Pack forward-only FASTA... 0.10 sec -[bwa_index] Construct SA from BWT and Occ... 1.81 sec -[main] Version: 0.7.17-r1188 -[main] CMD: bwa index consensus.fasta -[main] Real time: 8.704 sec; CPU: 8.395 sec -``` - -This will generate five additional files in the `medaka` directory with the file extensions `.pac`, `.bwt`, `.ann`, `.amb` and `.sa`. These files are used by BWA in step 3. - -Next we will make a directory in the results directory for our pilon polishing outputs: - -``` {.bash filename="Code"} -mkdir results/pilon -``` - -We will now do steps 3, 4 and 5 in one go by chaining them together with pipes. - -::: callout-note -## Chaining together commands with a pipe - -It is possible to chain together commands in Linux (and Unix) using a command known as "pipe". This allows the output from one command to be directly passed as input to other command without the need for using intermediate files. This is useful when the intermediate file is not needed and keeps your workspace tidy. The pipe command is the character `|` which is obtained with ⇧ Shift + \\ on most keyboards. - -You can use multiple pipes in a single line of commands but data will only go from the left to the right: - -`command1 | command2 | command3 | .... |` -::: - -We will be using two pipes to join three separate steps. First we will align the raw reads to the draft assembly, then convert the output to BAM format, before finally sorting this alignment to generate a sorted BAM file. Chaining the steps together together will only generate one final output file, avoiding the need to generate large intermediary files we don't need again between the other two steps. - -1. we will align the short reads (the illumina data) to the assembly, consensus.fasta with `bwa mem`: `bwa mem -t 8 results/medaka/consensus.fasta data/illumina_fastq/ERR4998593_1.fastq data/illumina_fastq/ERR4998593_2.fastq` -2. convert the short read alignment alignment to a BAM file with `samtools view`: `samtools view - -Sb` -3. sort the short read alignment with `samtools sort`: `samtools sort - -@4 -o pilon/short_read_alignment.bam` - -Here are the various flags/options used in these commands and what they mean: - - - -| Command | Flag/option | Meaning | -|-----------------------------|-------------------|------------------------| -| `bwa mem -t 8 [input assembly] [input short read file(s)]` | -t 8 | Number of threads (8) | -| `samtools view - -Sb` | \- | Take piped output from `bwa mem` as input | -| | -Sb | Convert from SAM to BAM format | -| `samtools sort - -@8 -o [filename]` | \- | Take piped output from `samtools view` as input | -| | \-@8 | Number of threads (8) | -| | -o \[filename\] | Output a file with name \[filename\] | - -This will take around 60 minutes so we will use redirection and `&` to print the output to a file and run the command in the background (just like in the previous lesson). We will also wrap our whole three-fold command in brackets so we run all three steps in the background. - -Add the pipes between these commands and run: - -``` {.bash filename="Code"} -(bwa mem -t 8 results/medaka/consensus.fasta data/illumina_fastq/ERR4998593_1.fastq data/illumina_fastq/ERR4998593_2.fastq | samtools view - -Sb | samtools sort - -@8 -o results/pilon/short_read_alignment.bam) &> results/pilon/alignment.out & -``` - -Once the command is running, you can check the process of this job by looking at the `alignment.out` file. - -``` {.bash filename="Code"} -less results/pilon/alignment.out -``` - -``` {.default filename="Output"} -[M::bwa_idx_load_from_disk] read 0 ALT contigs -[M::process] read 529802 sequences (80000102 bp)... -[M::process] read 529802 sequences (80000102 bp)... -[M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (2, 14035, 29, 8) -[M::mem_pestat] skip orientation FF as there are not enough pairs -[M::mem_pestat] analyzing insert size distribution for orientation FR... -[M::mem_pestat] (25, 50, 75) percentile: (333, 416, 538) -[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 948) -[M::mem_pestat] mean and std.dev: (435.53, 162.74) -[M::mem_pestat] low and high boundaries for proper pairs: (1, 1153) -[M::mem_pestat] analyzing insert size distribution for orientation RF... -[M::mem_pestat] (25, 50, 75) percentile: (891, 2304, 6802) -[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 18624) -[M::mem_pestat] mean and std.dev: (3479.45, 2918.28) -``` - -The command should take around 45 minutes to run. Once completed, the end of the `alignment.out` file should contain something like: - -``` {.default filename="Output"} -low and high boundaries for proper pairs: (1, 1222) -[M::mem_pestat] skip orientation RF as there are not enough pairs -[M::mem_pestat] skip orientation RR as there are not enough pairs -[M::mem_process_seqs] Processed 182762 reads in 37.961 CPU sec, 5.397 real sec -[main] Version: 0.7.17-r1188 -[main] CMD: bwa mem -t 8 ~/cs_course/results/medaka/consensus.fasta ~/cs_course/data/illumina_fastq/ERR4998593_1.fastq ~/cs_course/data/illumina_fastq/ERR4998593_2.fastq -[main] Real time: 2207.540 sec; CPU: 14851.383 sec -[bam_sort_core] merging from 7 files and 4 in-memory blocks... -``` - -We have now generated the `short_read_alignment.bam` file - this is a binary file (meaning it's not human readable) so we won't be checking its contents. - -Now carry out step 6, index the alignment: - -``` {.bash filename="Code"} -samtools index results/pilon/short_read_alignment.bam -``` - -::: callout-note -## Something to think about - -Why didn't we include this command in the sequence of pipes in the previous step? - -The answer is that we will need access to the BAM file produced for further analysis. If we included this step as part of a pipe the intermediate BAM file would not be saved. -::: - -This command will take around one or two minutes so we don't need to run it in the background. - -Once your prompt has returned you should also have a file named `short_read_alignment.bam.bai` in your pilon directory, which is the index. - -### Running Pilon - -Now we have generated the necessary input files we can finally run Pilon. - -Pilon is installed on the AWS instance and we can view the help documentation using: - -``` {.bash filename="Code"} -pilon --help -``` - -::: {.callout-note collapse="true"} -## Output --- Pilon help Documentation - -``` default -Pilon version 1.24 Thu Jan 28 13:00:45 2021 -0500 - - Usage: pilon --genome genome.fasta [--frags frags.bam] [--jumps jumps.bam] [--unpaired - unpaired.bam] [...other options...] - pilon --help for option details - - - INPUTS: - --genome genome.fasta - The input genome we are trying to improve, which must be the reference used - for the bam alignments. At least one of --frags or --jumps must also be given. - --frags frags.bam - A bam file consisting of fragment paired-end alignments, aligned to the --genome - argument using bwa or bowtie2. This argument may be specified more than once. - --jumps jumps.bam - A bam file consisting of jump (mate pair) paired-end alignments, aligned to the - --genome argument using bwa or bowtie2. This argument may be specified more than once. - --unpaired unpaired.bam - A bam file consisting of unpaired alignments, aligned to the --genome argument - using bwa or bowtie2. This argument may be specified more than once. - --bam any.bam - A bam file of unknown type; Pilon will scan it and attempt to classify it as one - of the above bam types. - --nanopore ont.bam - A bam file containing Oxford Nanopore read alignments. Experimental. - --pacbio pb.bam - A bam file containing Pacific Biosciences read alignments. Experimental. - OUTPUTS: - --output prefix - Prefix for output files - --outdir directory - Use this directory for all output files. - --changes - If specified, a file listing changes in the .fasta will be generated. - --vcf - If specified, a vcf file will be generated - --vcfqe - If specified, the VCF will contain a QE (quality-weighted evidence) field rather - than the default QP (quality-weighted percentage of evidence) field. - --tracks - This options will cause many track files (*.bed, *.wig) suitable for viewing in - a genome browser to be written. - CONTROL: - --variant - Sets up heuristics for variant calling, as opposed to assembly improvement; - equivalent to "--vcf --fix all,breaks". - --chunksize - Input FASTA elements larger than this will be processed in smaller pieces not to - exceed this size (default 10000000). - --diploid - Sample is from diploid organism; will eventually affect calling of heterozygous SNPs - --fix fixlist - A comma-separated list of categories of issues to try to fix: - "snps": try to fix individual base errors; - "indels": try to fix small indels; - "gaps": try to fill gaps; - "local": try to detect and fix local misassemblies; - "all": all of the above (default); - "bases": shorthand for "snps" and "indels" (for back compatibility); - "none": none of the above; new fasta file will not be written. - The following are experimental fix types: - "amb": fix ambiguous bases in fasta output (to most likely alternative); - "breaks": allow local reassembly to open new gaps (with "local"); - "circles": try to close circular elements when used with long corrected reads; - "novel": assemble novel sequence from unaligned non-jump reads. - --dumpreads - Dump reads for local re-assemblies. - --duplicates - Use reads marked as duplicates in the input BAMs (ignored by default). - --iupac - Output IUPAC ambiguous base codes in the output FASTA file when appropriate. - --nonpf - Use reads which failed sequencer quality filtering (ignored by default). - --targets targetlist - Only process the specified target(s). Targets are comma-separated, and each target - is a fasta element name optionally followed by a base range. - Example: "scaffold00001,scaffold00002:10000-20000" would result in processing all of - scaffold00001 and coordinates 10000-20000 of scaffold00002. - If "targetlist" is the name of a file, each line will be treated as a target - specification. - --verbose - More verbose output. - --debug - Debugging output (implies verbose). - --version - Print version string and exit. - HEURISTICS: - --defaultqual qual - Assumes bases are of this quality if quals are no present in input BAMs (default 10). - --flank nbases - Controls how much of the well-aligned reads will be used; this many bases at each - end of the good reads will be ignored (default 10). - --gapmargin - Closed gaps must be within this number of bases of true size to be closed (100000) - --K - Kmer size used by internal assembler (default 47). - --mindepth depth - Variants (snps and indels) will only be called if there is coverage of good pairs - at this depth or more; if this value is >= 1, it is an absolute depth, if it is a - fraction < 1, then minimum depth is computed by multiplying this value by the mean - coverage for the region, with a minumum value of 5 (default 0.1: min depth to call - is 10% of mean coverage or 5, whichever is greater). - --mingap - Minimum size for unclosed gaps (default 10) - --minmq - Minimum alignment mapping quality for a read to count in pileups (default 0) - --minqual - Minimum base quality to consider for pileups (default 0) - --nostrays - Skip making a pass through the input BAM files to identify stray pairs, that is, - those pairs in which both reads are aligned but not marked valid because they have - inconsistent orientation or separation. Identifying stray pairs can help fill gaps - and assemble larger insertions, especially of repeat content. However, doing so - sometimes consumes considerable memory. -``` -::: - -You can read more about the possible outputs Pilon can produce in the [Wiki](https://github.com/broadinstitute/pilon/wiki/Output-File-Descriptions). - -We can see there are many different options for pilon. We will be using the defaults for our assembly. - -- `--genome` --- this will be the output assembly from Medaka -- `--frags` --- the short reads we used to create the BAM alignment were paired-end fragments, so we need to specify this using this flag -- `--outdir` --- this specifies a directory for all the output - -Check you are in the `cs_course` folder and run Pilon: - -``` {.bash filename="Code"} -cd ~/cs_course -pilon --genome results/medaka/consensus.fasta --frags results/pilon/short_read_alignment.bam --outdir results/pilon &> results/pilon/pilon.out & -``` - -We can again keep track of the analysis by looking at the `pilon.out` file with `less`. - -``` {.bash filename="Code"} -less results/pilon/pilon.out -``` - -The top of the file: - -``` {.default filename="Output"} -Pilon version 1.24 Thu Jan 28 13:00:45 2021 -0500 -Genome: medaka/consensus.fasta -Fixing snps, indels, gaps, local -Input genome size: 18930486 -Scanning BAMs -short_read_alignment.bam: 68579742 reads, 0 filtered, 7344428 mapped, 6036898 proper, 167134 stray, FR 100% 401+/-206, max 1018 -Processing contig_1057:1-17080 -frags short_read_alignment.bam: coverage 36 -Total Reads: 9147, Coverage: 36, minDepth: 5 -Confirmed 10586 of 17080 bases (61.98%) -Corrected 234 snps; 40 ambiguous bases; corrected 22 small insertions totaling 38 bases, 101 small deletions totaling 141 bases -# Attempting to fix local continuity breaks -# fix break: contig_1057:780-2830 0 -0 +0 NoSolution -# fix break: contig_1057:3081-4075 0 -0 +0 NoSolution -# fix break: contig_1057:4312-5132 0 -0 +0 NoSolution -# fix break: contig_1057:5367-5452 0 -0 +0 NoSolution -# fix break: contig_1057:5798-6336 0 -0 +0 NoSolution -# fix break: contig_1057:6798-8683 0 -0 +0 NoSolution -# fix break: contig_1057:8886-9132 0 -0 +0 NoSolution -# fix break: contig_1057:9394-12244 0 -0 +0 NoSolution -``` - -When Pilon finishes (around 1.5 hours) the end of the file will contain something like: - -``` {.default filename="Output"} -Writing updated contig_1125_pilon to ./pilon.fasta -Writing updated contig_885_pilon to ./pilon.fasta -Writing updated contig_1130_pilon to ./pilon.fasta -Writing updated contig_1121_pilon to ./pilon.fasta -Writing updated contig_697_pilon to ./pilon.fasta -Mean frags coverage: 30 -Mean total coverage: 30 -``` - -Navigate into the `pilon` directory and have a look at the output files Pilon has produced. - -``` {.bash filename="Code"} -cd results/pilon -ls -``` - -``` {.default filename="Output"} -alignment.out pilon.fasta short_read_alignment.bam short_read_alignment.bam.bai -``` - -We can see pilon has produced a fasta file `pilon.fasta`, which is the newly polished assembly. This file is now our assembly. - -In the next episode we will assess the quality of this assembly and compare its quality to that of the unpolished assembly. - -::: callout-tip -## Recommended reading: - -While you're waiting for the polishing to finish, here's some things you might want to read about: - -- Comparison of combined assembly and polishing methods [Trycycler: consensus long-read assemblies for bacterial genomes](https://link.springer.com/article/10.1186/s13059-021-02483-z) -- Polishing strategy for ONT and Pacbio Hifi reads [Polishing high-quality genome assemblies](https://www.nature.com/articles/s41592-022-01515-1) -- Comparison of polishing of ONT data with alignment free tool Jasper compared to POLCA, NextPolish and ntEdit [JASPER: a fast genome polishing tool that improves accuracy and creates population-specific reference genomes](https://www.biorxiv.org/content/10.1101/2022.06.14.496115v1.full) -- Comparison of short read polishers including pilon to the polisher Polypolish [Polypolish: Short-read polishing of long-read bacterial genome assemblies](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1009802#) -- Pilon short read polisher paper [Pilon: An Integrated Tool for Comprehensive Microbial Variant Detection and Genome Assembly Improvement](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0112963) -- Accuracy of polishers including medaka for nanopore data [Nanpore consensus quality](https://github.com/rrwick/August-2019-consensus-accuracy-update#racon) -- Comparison of nanopore polishing tools [Comparative evaluation of Nanopore polishing tools for microbial genome assembly and polishing strategies for downstream analysis](https://www.nature.com/articles/s41598-021-00178-w) -:::