Skip to content

Latest commit

 

History

History
884 lines (651 loc) · 42 KB

README.md

File metadata and controls

884 lines (651 loc) · 42 KB

Polap: Plant Mitochondrial DNA Assembly Pipeline

Polap is a specialized pipeline designed to assemble plant mitochondrial DNA (mtDNA) using the Flye long-read assembler, supplemented by organellar gene annotation to guide contig selection. Inspired by ptGAUL (a chloroplast assembly pipeline), Polap integrates Flye's capabilities with gene annotation to identify and assemble mtDNA contigs from whole-genome data.

Description

Polap is a specialized meta-assembly pipeline that enhances the functionality of the Flye assembler. Serving as a Flye helper pipeline, Polap incorporates organellar gene annotation, aiding in the educated guess of selection of organelle-origin contigs from Flye's whole-genome assembly. The name "Polap" is inspired by the ptGAUL chloroplast genome assembly pipeline created by Wenbin Zhou, as Polap similarly focuses on organellar genomes, specifically plant mitochondrial DNA (mtDNA).

The pipeline combines the Flye assembler with FMLRC (a sequence polishing tool by Matt Holt) and applies sequencing depth analysis to identify contigs belonging to the mitochondrial genome. Leveraging both Flye-generated assemblies and organellar gene annotations, Polap helps users identify candidate mtDNA contigs based on sequencing depth and gene presence, enabling targeted selection and assembly of mtDNA.

Polap originated as an extension of ptGAUL, with a specific focus on assembling plant mtDNA from long-read sequencing data. The workflow begins by conducting a Flye whole-genome assembly, generating contigs without a reference genome. These contigs are then analyzed with BLAST to confirm the presence of mtDNA or chloroplast (ptDNA) genes. Using depth and copy number metrics, Polap allows users to identify likely mtDNA contigs, collects them, and performs a final targeted mtDNA assembly. The pipeline concludes with polishing using FMLRC, resulting in an accurate, high-quality mtDNA assembly.

Key Features

  • Rough organellar gene annotation: Enables Polap to select contigs from mitochondrial and chloroplast origins, focusing specifically on the target mtDNA.
  • Organellar seed contig selection: (Not tested yet!) Polap adds custom logic to filter and identify organelle-specific contigs.

How It Works

  1. Initial whole-genome assembly: Flye performs a whole-genome assembly on the input long reads. We need paired-end short-read data to estimate the genome size using JellyFish. The current version of Flye does not require a genome size for the genome assembly, but we have not test it yet. Because Polap still needs short-read data for a final polishing using FMLRC, it uses short-read data to estimate the genome size that might help Flye perform a whole-genome assembly with less memory and computing time.
  2. Organellar seed contig selection: Using BLAST, Polap identifies mtDNA-related contigs based on the presence of organelle genes. Users need to manually select mitochondrial-origin contigs using organelle gene annotation tables that are made by Flye and modified by Polap.
  3. Organelle-gename assembly: Selected mtDNA seed contigs and the input long-read sequencing data are aligned using Minimap2 and those reads that are well mapped on the input long reads are fed into Flye to assembly the organellar genome. We borrow the approach of ptGAUL in this long-read selection.
  4. Final Polishing: The selected mtDNA contigs undergo polishing with FMLRC for accurate genome assembly. Short-read sequence polishing uses FMLRC for high-quality, error-corrected assemblies of the target mitochondrial genome as suggested by ptGAUL.

Requirements

  • Operating System: Linux (not compatible with macOS or Windows)
  • Dependencies: Requires Bash (>= 3.0) and Miniconda
  • Installation: Can be installed via Bioconda's Polap conda package or manually from the Github source available at Polap

Quick Start

1. Install Miniconda:

mkdir -p ~/miniconda3
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda3/miniconda.sh
bash ~/miniconda3/miniconda.sh -b -u -p ~/miniconda3
rm ~/miniconda3/miniconda.sh

After installing, close and reopen your terminal application.

2. Install the Bioconda package, Polap:

source ~/miniconda3/bin/activate
conda config --add channels bioconda
conda config --add channels conda-forge
conda config --set channel_priority strict
conda create --name polap bioconda::polap
wget https://github.com/goshng/polap/archive/refs/tags/0.3.7.2.zip
unzip 0.3.7.2.zip
cd polap-0.3.7.2
conda env create -f src/polap-conda-environment-fmlrc.yaml
cd test
polap assemble --test

If you see this output, then you are ready to use the long-read assembly pipeline.

output: the assembly graph: o/1/assembly_graph.gfa

3. Test the short-read polishing:

Then, polish the provided sequence file, mt.0.fasta, which is a test file for the short-read polishing.

polap prepare-polishing
polap polish
diff mt.0.fa mt.1.fa

If you see no output from the command, diff mt.0.fa mt.0.fa, then you are ready to use the polishing procedure as well.

4. Use your data to assemble a plant mtDNA sequence:

Now, if you have your Illumina paired-end short- and Oxford Nanopore long-read data files, put those in a folder and rename them as follows. Assume that you have three fastq files:

SRR30757340_1.fastq <- the first short-read FASTQ file
SRR30757340_2.fastq <- the second short-read FASTQ file
SRR30757341.fastq   <- the long-read FASTQ file

Rename the three files as follows:

mv SRR30757340_1.fastq s1.fq
mv SRR30757340_2.fastq s2.fq
mv SRR30757341.fastq l.fq

Then, execute the Polap command as you do in the provided test data:

polap assemble1

Once it completes a whole-genome assembly, then try to find mtDNA seed contigs by displaying a plant organelle gene annotation table for the genes that were roughly annotated by Polap using the NCBI organelle protein sequences that were downloaded May, 2023 using the following command:

polap annotate view table
Contig Length Depth Copy MT PT Edge
edge_47 286811 69 10 25 1 47
edge_729 50873 64 9 9 1 729
edge_46 1532068 14 2 3 0 46
edge_75 4781975 12 2 3 1 75
edge_55 4374598 13 2 1 0 55
edge_718 27942 78 11 1 0 718
edge_732 17165 74 11 1 0 732

Second, open the whole-genome assembly file named o/0/30-contigger/graph_final.gfa using Bandage and search the genome assembly for those that were annotated with more mtDNA than ptDNA genes. In the plant organelle gene annotation table, each row represents a contig sequence annotated with at least one plant organelle gene. Although Polap does not automatically select mtDNA-related contigs, it could help you guess which ones might originate from plant mtDNA. The table has several columns for a row:

  1. Contig: the name of a contig sequence
  2. Length: the length of the contig sequence
  3. Depth: the read coverage of the contig sequence
  4. Copy: the depth value of the contig divided by the median of all the depth values
  5. MT: the number of rough mitochondrial genes annotated
  6. PT: the number of rough plastid genes annotated
  7. Edge: the edge number of the contig sequence (the same number of the contig name in the row)

You would choose contig names with the following properties:

  • MT > PT
  • Depth or Copy values are roughly in between possible depth values from nuclear and those plastid.

Now, go back to the Bandage genome assembly graph to locate one or some of your choice of candidate seed mtDNA contigs. Watch YouTube to see how you could copy contig names using the Bandage software. Here, you would select contigs with a depth range that you think mtDNA contigs should have.

edge_264, edge_266, edge_47, edge_718, edge_729, edge_732

In the example of YouTube, all depth values of the selected contigs are inclusivly in between 61x and 78x. Come back to the terminal where you have executed polap assemble1, and execute the following:

polap seeds bandage

Then, paste or type in your copy of contig names to the question:

Enter edges using Bandage (e.g., edge_265, edge_520, edge_425):

This would create o/0/mt.contig.name-1 with following content.

edge_47
edge_729
edge_718
edge_732
edge_264
edge_266

You could see seed contigs that were added to the previous annotation table:

polap seeds view

Under the column Seed of the annotation table, A denotes that the contig was part of the initial annotation, G that the contig was added to the seed set via the genome assembly graph, and X the contig was not part of the seed set though it was in the initial annotation.

Contig Length Depth Copy MT PT Seed
edge_28 385224 7 2 11 2 A
edge_30 129617 8 3 11 1 A
edge_33 100315 7 2 7 0 A
edge_2864 11463 284 95 6 4 X
edge_31 60195 8 3 4 1 A
edge_2803 79961 8 3 4 0 A
edge_29 129920 6 2 3 2 A
edge_2 384522 3 1 2 0 X
edge_1282 3250 517 172 2 1 X
edge_2817 47958 8 3 2 1 A
edge_1888 231323 3 1 1 0 X
edge_1510 123019 3 1 0 0 G
edge_2802 20149 13 4 0 0 G
edge_32 8681 18 6 0 0 G

You have prepared a list of seed contigs for an organelle-genome assembly. Execute the following:

polap assemble2

If you are very lucky, you might already have a good candidate for your mtDNA from the Flye whole-genome assembly. You might be able to locate a circular mtDNA assembly graph or a graph that you could traverse to generate a circular mtDNA sequence. In that case, you may not need the Polap procedure, and proceed to polish the mtDNA from the whole-genome assembly using your input short-read data. If you are not that lucky to have a good mtDNA candidate in the whole-genome assembly but are lucky enough to have contigs clustered together in the genome assembly graph like the one example of the YouTube.

After the Polap's organelle-genome assembly step like above, you have two paths to follow. One is to stop there and extract an mtDNA sequence by following a path in the organelle-genome assembly graph using the Bandage software. The other is to consider the organelle-genome assembly as a whole-genome assembly and repeat the same procedure of seed contig selection and one more organelle-genome assembly step. We will consider the other scenario later and assume that you are lucky to have an mtDNA sequence from the organelle-genome assembly.

Flye creates your long-read polished assembly graph as a file named o/1/assembly_graph.gfa. Extract the mitochondrial genome sequence by opening o/1/assembly_graph.gfa in Bandage. Save this sequence as mt.0.fasta for the long-read mitochondrial genome assembly. You should be able to extract a draft organelle genome sequence from the assembly by watching YouTube that I would use in an undergraduate lecture in Korean. The audio does not matter, so just watch the YouTube if you have never used the Bandage software before.

Now, you have prepared a mtDNA draft sequence in the FASTA file named mt.0.fasta. From here to the finished mtDNA genome, the credit is due to ptGAUL, which uses FMLRC for the short-read polishing step. Polap has a handy menu called polap prepare-polishing and polap polish for the procedure. The former command converts your short-read data to a file in a format FMLRC could use to polish your draft mtDNA genome sequence in the latter command. You could have executed the first command polap prepare-polishing before the Flye whole-genome assembly. Execute the following:

polap prepare-polishing
polap polish

Your final mtDNA genome assembly is in the file named mt.1.fa, that you use GeSeq to actually annotate. For more detailed installation and usage instructions, please read along with this README.

Detailed installation

Using Bioconda package

  1. Install Miniconda by following the installation procedure:

    $ curl -OL https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
    $ bash Miniconda3-latest-Linux-x86_64.sh
  2. Log out and log in back to activate a conda environment. The following prompt with base indicates that you are ready to use conda environments:

    (base) $
  3. Configure Conda so that it would allow to install Bioconda packages:

    (base) $ conda update -y -n base conda
    (base) $ conda config --prepend channels bioconda
    (base) $ conda config --prepend channels conda-forge
  4. Check your default channels and the order of the three channels:

    (base) $ conda config --show channels
    channels:
       - conda-forge
       - bioconda
       - defaults
  5. Install Polap:

    (base) $ conda create --name polap bioconda::polap
  6. Test Polap:

    (base) $ conda activate polap
    (polap) $ git clone https://github.com/goshng/polap.git
    (polap) $ cd polap/test
    (polap) $ polap assemble --test
  7. Check the output:

    output: the assembly graph: o/1/assembly_graph.gfa

We need one more step for the installation. Although it is not actually a part of Polap, we need a short-read polishing step with FMLRC, which required a separate conda environment. It may be complicated, but read along with this README where I explain why currently we have to do this one more installation step.

Short-read polishing Requirement: FMLRC

FMLRC required a python version 2, which was too old to be installed along with other Bioconda packages for Polap. So, we could not create a single Bioconda package for Polap. Instead, users could install one more conda environment for the short-read polishing step. Here's how to create a separate Conda environment for FMLRC while maintaining compatibility with Polap's environment. This setup ensures that FMLRC and Polap can function independently without conflicts. Following the previous setup, you'll need to create a new environment specifically for FMLRC. This will prevent any compatibility issues with Polap's main environment.

  1. Create the FMLRC Environment (note: we need the specific conda environment name of polap-fmlrc for Polap to use it in the short-read polishing step):

    (polap) $ conda deactivate
    (base) $ git clone https://github.com/goshng/polap.git
    (base) $ cd polap
    (base) $ conda env create -f src/polap-conda-environment-fmlrc.yaml
  2. Activate the FMLRC Environment:

    (base) $ conda activate polap-fmlrc
  3. Test-run of FMLRC for Sequence Polishing: Execute the FMLRC polishing commands as needed. Once complete, you can switch back to the Polap environment:

    (polap-fmlrc) $ cd test
    (polap-fmlrc) $ ../src/polap.sh prepare-polishing
    (polap-fmlrc) $ ../src/polap.sh polish
    (polap-fmlrc) $ conda deactivate
    (base) $

This way, you can seamlessly transition between the two environments, keeping dependencies separate for both Polap and FMLRC.

Using github source

If you're unable to install Polap via the Bioconda package, you can install it directly from the GitHub source. Ensure that Miniconda is installed beforehand, as Polap relies on tools, including Flye, Minimap2, Jellyfish, and others.

  1. Make sure that you have Miniconda ready before following next steps.

  2. Download the source code of Polap available at Polap's github website:

    (base) $ git clone https://github.com/goshng/polap.git
    (base) $ cd polap
  3. Setup the conda environments for Polap.

    (base) $ conda env create -f src/polap-conda-environment.yaml
  4. Setup the conda environment for FMLRC.

    (base) $ conda env create -f src/polap-conda-environment-fmlrc.yaml
  5. Activate polap conda environment.

    (base) $ conda activate polap
  6. Run Polap with a test dataset.

    (polap) $ cd test
    (polap) $ ../src/polap.sh assemble --test
  7. If you see a screen output ending with the something like this:

    output: the assembly graph: o/1/assembly_graph.gfa
  8. Polishing with a short-read data:

    (polap) $ conda deactivate
    (base) $ conda activate polap-fmlrc
    (polap-fmlrc) $ ../src/polap.sh prepare-polishing
    (polap-fmlrc) $ ../src/polap.sh polish

Your final mitochondrial genome sequence is mt.1.fa.

Usage with main Polap menus

After successfully running the test dataset with Polap, you have two options for executing Polap, depending on your installation method.

  1. If you installed Polap via Bioconda:
    Simply type polap in the command line, provided you have the Polap Conda environment activated:

    (polap) $ polap

    Be sure the prompt shows (polap), indicating the correct environment is active.

  2. If you installed Polap from the GitHub source:
    Clone the Polap repository into your data folder and run the main script. Ensure the Polap Conda environment is still active.

    (polap) $ git clone https://github.com/goshng/polap.git
    (polap) $ cd polap
    (polap) $ src/polap.sh

For simplicity, the instructions below assume you installed Polap using Bioconda. In cases where you clone the GitHub repository, replace polap with src/polap.sh as needed. If you're experienced with the command line, this setup will feel straightforward. Otherwise, follow the steps in this README closely.

Help Message

Polap operates primarily through subcommands, also called "menus." Key menus include assemble1, annotate, and assemble2, each with its own help message.

To view the general help message, run Polap without any options:

(polap) $ polap

To see help for a specific menu, such as assemble1, use:

(polap) $ polap assemble1 help

Input Data

  1. Set Up Your Input Folder:
    Create a folder and place your input dataset within it. This dataset should consist of:

    • One Oxford Nanopore long-read FASTQ file
    • Two short-read FASTQ files

    A set of sample FASTQ files is available in the test folder if you've cloned the Polap repository. Currently, only configurations with two short-read files are supported.

  2. File Requirements:
    If you don't specify input file paths with options, Polap will expect three files named l.fq, s1.fq, and s2.fq in the current working directory where you execute the Polap script. Make sure these files are uncompressed. You can also specify the files using the options -l, -a, and -b.

Whole-Genome Assembly

With your three FASTQ files prepared, you’re ready for the first step in Polap: running a whole-genome assembly using Flye.

  1. Menu Command Structure:
    In Polap, the primary command argument is referred to as the "menu," which specifies the task. You can view help for any menu by adding help as a second argument. To see the options for the assemble1 menu, use:

    (polap) $ polap assemble1 help
  2. Run Assembly:
    Start the whole-genome assembly with:

    (polap) $ polap assemble1

    Assembly times vary based on the dataset, so please be patient.

Plant Organelle Annotation

After completing the whole-genome assembly, you can visualize the genome assembly graph in o/0/30-contigger/graph_final.gfa using Bandage. This graph shows contigs, some originating from mitochondrial or plastid DNA, alongside nuclear DNA. You could use BLAST your assembly against organelle genes and provide those BLAST results to the Bandage software to locate potential mtDNA-related contigs. Polap can facilitate such process with the command named polap annotate, which is part of polap assemble1. So, you could display the annotation table that focuses on mtDNA-related contigs.

(polap) $ polap annotate view table
Contig Length Depth Copy MT PT Edge
edge_47 286811 69 10 25 1 47
edge_729 50873 64 9 9 1 729
edge_46 1532068 14 2 3 0 46
edge_75 4781975 12 2 3 1 75
edge_55 4374598 13 2 1 0 55
edge_718 27942 78 11 1 0 718
edge_732 17165 74 11 1 0 732

If you can identify mitochondrial DNA contigs manually, you can proceed with those to the organelle-genome assembly using polap assemble2. But, take your time to choose seed contigs in the following section.

Organelle Contig Selection

After completing the polap annotate step, you'll find organelle gene annotations in text files within the o/0/ directory:

  • contig-annotation-depth-table.txt
  • contig-annotation-table.txt
  • assembly_info_organelle_annotation_count-all.txt
  • assembly_info_organelle_annotation_count.txt

You could inspect these files to identify contigs with a higher number of mitochondrial (MT) genes compared to plastid (PT) genes. Execute the following command to display organelle gene annotation with a particular focus on mtDNA genes:

polap annotate view table

It displays a table that you could inspect to determine mtDNA seed contigs that you could try to use later in the Polap's organelle-genome assembly.

This seed contig selection step is subjective, so you may need to carefully review the annotation table files to choose contigs that likely originate from plant mitochondrial DNA. These annotation tables are derived from o/0/30-contigger/graph_final.gfa that Flye has generated. We modify contigs_stats.txt from Flye to create a similar file and provide information about each contig assembled from the whole-genome assembly.

You’ll use both the graph visualization of the genome assembly o/0/30-contigger/graph_final.gfa, and the annotation table files to select candidate contigs. Once identified, prepare a text file (o/0/mt.contig.name-1) with each selected edge sequence name on a new line. Note that edge sequence names should start with edge_, not contig_. An example file might look like this:

edge_47
edge_729
edge_718
edge_732
edge_264
edge_266

Your could visualize the whole-genome assembly graph using the Bandage software to select contigs. You would choose contig names with the following properties:

  • MT > PT
  • Depth or Copy values are roughly in between possible depth values from nuclear and those plastid.

Watch YouTube to see how you could copy contig names using the Bandage software. Here, you would select and copy contig names with a depth range that you think mtDNA contigs should have. Come back to the terminal where you have executed polap assemble1, and execute the following:

polap seeds bandage

Then, paste or type in your copy of contig names to the question:

Enter edges using Bandage (e.g., edge_265, edge_520, edge_425):

It generates a text file (o/0/mt.contig.name-1) with edge names that you paste. For more information on preparing a mitochondrial contig file, refer to the MT contig name section below.

Organelle-Genome Assembly

With the contig candidates selected, you are ready to proceed with the organelle-genome assembly:

(polap) $ polap assemble2

This command will initiate the second Flye assembly, generating an assembly graph file o/1/assembly_graph.gfa. You can view this organelle-genome assembly graph using Bandage.

Extracting a mitochondrial DNA sequence using Bandage

Extract the mitochondrial genome sequence by opening o/1/assembly_graph.gfa in Bandage. Save this sequence as mt.0.fasta for the long-read mitochondrial genome assembly. Your long-read polished assembly graph is o/1/assembly_graph.gfa You could extract a draft organelle genome sequence from the assembly graph by using Bandage. You could watch YouTube (in Korean) to see how you could extract a DNA sequence through a circular path.

Polishing

Polap follows the ptGAUL approach, which uses FMLRC to polish long-read assemblies with short-read data. To polish your assembly, activate the FMLRC environment and run the following commands:

$ conda activate polap-fmlrc
(polap-fmlrc) $ src/polap.sh prepare-polishing
(polap-fmlrc) $ src/polap.sh polish

Make sure that you have your draft genome sequence file ready before calling the polap polish command. You could use -p option to specify your draft sequence FASTA file.

The final polished mitochondrial genome sequence will be saved as mt.1.fa.

Uninstalling Polap

To completely remove Polap and its associated environments, follow these steps:

  1. Deactivate the Polap environment:

    (polap) $ conda deactivate
  2. Remove the Polap environment:

    (base) $ conda remove -n polap --all
  3. Remove the Polap-FMLRC environment:

    (base) $ conda remove -n polap-fmlrc --all

More

Sample Datasets

A sample dataset demonstrating Polap's capabilities is available on Figshare: Polap data analysis of 11 datasets.

MT Contig Name File

To prepare an MT contig name file, list the contig names you suspect are of mitochondrial origin. Select these contigs based on three key features:

  1. Genome assembly graph
  2. Presence of organelle genes
  3. Contig copy numbers

Contigs are labeled as edges, each suffixed with a number. To examine the whole-genome assembly graph, use Bandage to open o/0/30-contigger/graph_final.gfa. You can also view organelle annotations and copy numbers with the following command:

polap annotate view table

For additional guidance on identifying mtDNA contigs, a YouTube tutorial in Korean is available. The following example shows a Flye assembly table with gene counts, where the Copy column indicates the copy number, MT represents mitochondrial genes, and PT represents plastid genes. The Edge column lists edge numbers that make up each contig.

Example table:

Contig Length Depth Copy MT PT Edge
edge_47 286811 69 10 25 1 47
edge_729 50873 64 9 9 1 729
edge_46 1532068 14 2 3 0 46
edge_75 4781975 12 2 3 1 75
edge_55 4374598 13 2 1 0 55
edge_718 27942 78 11 1 0 718
edge_732 17165 74 11 1 0 732

After reviewing the table, edit o/0/mt.contig.name-1 to include the names of candidate mtDNA contigs. Each line should list an edge, without any empty lines. Here’s an example:

edge_47
edge_729
edge_718
edge_732
edge_264
edge_266

Genome assembly numbers

Polap generates all results in an output folder that you can change from the default o using option -o or --outdir. In the output folder, Polap uses number 0 to represent a whole-genome assembly and positive integer numbers for organelle-genome assemblies. You always have number 0 folder in your folder whenever you run the whole-genome assembly. Because a genome assembly number folder is the base folder that Flye creates, it has Flye's output folders including 10-assembly, 20-consensus, 30-contigger, and 40-polishing. If you have used Flye before, then the folder structure will be familiar to you. Polap adds more folders by following Flye's output folder structure; e.g., 01-contig.

Polap's has a few options that are often used or implicitly assumed when you execute. If you execute Polap with whole-genome and organelle-genome assemblies, then you do not need to specify them. But, if you need one more organelle-genome assembly, then you need to specify these options. If you create another mtDNA seed contig file, its name should following accordingly as well.

  • --inum 0 --jnum 1: default option values, 0 is the whole-genome assembly, and 1 is the organelle-genome assembly. You must have 0/mt.contig.name-1 for the seed contig name file.
  • --inum 0 --jnum 2: 0 is the whole-genome assembly, and 1 is the organelle-genome assembly. You must have '0/mt.contig.name-2' for the seed file.
  • --inum 1 --jnum 2: 1 is the whole-genome assembly, and 2 is the organelle-genome assembly. You must have '1/mt.contig.name-2' for the seed file.

The whole-genome assembly number should be less than the organelle-genome assembly although Polap would allow the case of the other way around.

Options

Menu options

  • assemble1: Flye whole-genome assembly
  • annotate: Organelle gene annotation. It is part of assemble1 menu.
  • assemble2: Flye organelle-genome assembly
  • prepare-polishing: FMLRC short-read polishing preparation
  • polish: FMLRC short-read polishing

General options

POLAP - Plant organelle DNA long-read assembly pipeline.
version v0.3.7-eadc43f

Usage: polap [<menu> [<menu2> [<menu3>]]] [-o|--outdir <arg>]
    [-i|--inum <arg>] [-j|--jnum <arg>] [-w|--single-min <arg>]
    [-l|--long-reads <arg>] [-a|--short-read1 <arg>] [-b|--short-read2 <arg>]
    [-m|--min-read-length <arg>] [-t|--threads <arg>] [--test] [--log <arg>]
    [--random-seed <arg>] [--version] [-h|--help]

Assemble mitochondrial DNA (mtDNA) in a single command (not tested yet):
    polap -l <arg> -a <arg> [-b <arg>]
    polap assemble -l <arg> -a <arg> [-b <arg>]

Perform a polishing of the mtDNA sequence utilizing the FMLRC protocol:
    polap prepare-polishing -a <arg> [-b <arg>]
    polap polish -p <arg> -f <arg>

To assemble mitochondrial DNA (mtDNA), follow a two-step process involving
manual seed contig selection:
    polap assemble1 -l <arg> -a <arg> [-b <arg>] [-m <arg>]
    polap assemble2 -i <arg> -j <arg> [-w <arg>] [-c <arg>]

To assemble mitochondrial DNA (mtDNA), follow a three-step process
that utilizes semi-automatic seed contig selection (not working yet):
    polap assemble1 -l <arg> -a <arg> [-b <arg>] [-m <arg>]
    polap seeds -i <arg> -j <arg>
    polap assemble2 -i <arg> -j <arg> [-w <arg>] [-c <arg>]

To assemble mitochondrial DNA (mtDNA), follow a series of sequential steps:
    polap init -o <arg>
    polap total-length-long -l <arg>
    polap find-genome-size -a <arg> [-b <arg>]
    polap reduce-data -l <arg> [-m <arg>]
    polap summary-reads -a <arg> [-b <arg>]
    polap flye1 [-t <arg>]
    polap edges-stats -i <arg>
    polap annotate -i <arg>
    polap seeds [-i <arg>] -j <arg>
    polap map-reads [-i <arg>] -j <arg>
    polap test-reads [-i <arg>] -j <arg> -w <arg> [-c <arg>]
    polap select-reads [-i <arg>] -j <arg> -w <arg> [-c <arg>]
    polap flye2 [-i <arg>] -j <arg>

Others menus:
    polap blast-genome -i <arg>
    polap count-genes -i <arg>
    polap flye-polishing -j <arg>
    polap make-menus
    polap clean-menus
    polap list

BioProject menus:
    polap get-bioproject --bioproject <arg>
    polap bioproject-prepare -o <arg>
    polap get-bioproject-sra --sra <arg>
    polap get-mtdna --species <arg>

Other options:
    [-p|--unpolished-fasta <arg>] [-f|--final-assembly <arg>]
    [-c|--coverage <arg>] [--flye-asm-coverage <arg>]
    [--bioproject <arg>] [--species <arg>] [--accession <arg>]
    [--query <arg>] [--subject <arg>]
    [--no-reduction-reads] [--no-coverage-check]
    [--plastid]
    [--archive <arg>]
    [--sra <arg>] [-g|--genomesize <arg>]

menu: assemble, assemble1, annotate, assemble2, flye-polishing,
    make-menus, list, clean-menus, cleanup, init,
    summary-reads, total-length-long, find-genome-size, reduce-data, flye1
    blast-genome, count-gene, seeds,
    prepare-seeds, map-reads, test-reads, select-reads, flye2,
    flye-polishing, prepare-polishing, polish,
    version

Options:
  -o, --outdir: output folder name (default: o)
    The option '-o' or '--outdir' specifies the output folder name,
    with a default value of 'o'. The output folder typically contains input
    files that are long-read and short-read data files. Input data files can
    be specified using the options provided by -l, -a, and -b.

  -l, --long-reads: long-reads data file in fastq format (default: l.fq)
    The option '-l' or '--long-reads' specifies the location of a long-reads
    data file in fastq format, with a default filename of 'l.fq'.

  -a, --short-read1: short-read fastq file 1 (default: s1.fq)
    The option '-a' or '--short-read1' specifies the first short-read fastq
    file to be used, with a default value of "s1.fq".

  -b, --short-read2: short-read fastq file 2 (default: s2.fq)
    The option '-b' or '--short-read2' specifies a short-read fastq file 2,
    with a default value of 's2.fq'. The second short-read data file,
    if provided, is considered optional.
    (Note: not tested yet; -a & -b are required.)

  -i, --inum: previous output number of organelle-genome assembly (default: 0)
    The option '-i' or '--inum' specifies the previous output number
    of an organelle-genome assembly, with a default value of '0'.
    The zero for this option specifies the whole-genome assembly.

  -j, --jnum: current output number of organelle-genome assembly (default: 1)
    The option '-j' or '--jnum' allows users to specify the current output number
    for an organelle-genome assembly, with a default value of '1'.

  -m, --min-read-length: minimum length of long reads (default: 3000)
    The option '-m' or '--min-read-length' specifies the minimum length of
    long reads, with a default value of 3000.

  -t, --threads: number of CPUs (default: maximum number of cores)
    The option '-t' or '--threads' specifies the number of CPU threads to
    utilize, with a default value equal to the maximum number of available
    cores.

  -c, --coverage: coverage for the organelle-genome assembly (default: 50)
    The option '-c' or '--coverage' specifies the coverage percentage for the
    organelle-genome assembly for controlling the data size.

  -w, --single-min: minimum mapped bases or PAF 11th column (default: 3000)
    This parameter ensures that the alignment level between a long-read and a
    seed contig is properly controlled. For plant mitochondrial DNAs, a DNA
    fragment size of approximately 3 kilobases appears to be more effective
    than the smaller 1-kilobase fragment. In the case of plastid DNAs, a
    fragment size of 1 kilobase (kb) might be more suitable, requiring an
    adjustment to the -m option accordingly.

  -g, --genomesize: expected genome size (default: estimated with a short-read dataset)
    Users can assemble an organelle genome when they have a genome size
    estimate. But, we require a short-read dataset to determine the genome size
    for the whole-genome assembly process. Polishing a long-read assembly
    necessitates the use of a short-read dataset.

  -p, --unpolished-fasta: polishing sequence in fasta format (default: mt.0.fasta)
    The option enables the polishing of sequences in a FASTA format,
    with the default output file being named 'mt.0.fasta'.

  -f, --final-assembly: final assembly in fasta format (default: mt.1.fa)
    The final assembly in FASTA format, with a default file name of 'mt.1.fa'.

  --no-reduction-reads: reduction of long-read data before assemble1
    In the process of whole-genome assembly, we utilize a reduced amount of
    long-read data. By default, we reduce the size of a long-read dataset prior
    to performing a whole-genome assembly.
    Note: The size of coverage is set by --coverage (default: 50)

  --no-coverage-check: coverage check before assemble2 step
    By default, in the process of assembling organelle genomes, we reduce
    the size of selected seed reads.
    Note: The size of coverage is set by --coverage (default: 50)

  --yes: always yes for a question or deletes output completely (off by default)

  --redo: redo a POLAP pipeline (off by default)
    The command specifies that any previously generated intermediate results
    should be disregarded and new calculations performed from scratch.

  --select-read-range: start,end,number for the range of read selection (default: 3000,39000,7)
    It specifies the values for ptGAUL read-selection minimum number of
    bases or ratios. For the start and end values of a ratio, real numbers must
    fall within the range of 0 to 1.
    Note: refer to the menu "test-reads" for help.

  --start-index: used by test-reads

  --random-seed: 5-digit number (default automatically assigned)
    To ensure reproducibility, you can supply a random number seed
    to facilitate sampling of reads.
    seqkit sample random seed; 11 used in seqkit sample.

  --flye-asm-coverage: Flye --asm-coverage (default: 30)
    Flye --asm-coverage is a parameter used with the assembly coverage of Flye.

  --no-flye-asm-coverage: no use of Flye --asm-coverage
    The flag '--no-flye-asm-coverage' indicates that we use Flye option
    neither --asm-coverage nor --genome-size in flye execution.
    This option is the same as --flye-asm-coverage set to 0.
    Note: not tested yet!

  --polap-reads: use intra- and inter-contig read selection (default: off)
    The default read selection is ptGAUL's approach.
    This option allows long reads that are mapped within a seed contig and
    between two contigs.

  --bridge-same-strand: (default: off)
    When linking two inverted repeats, enabling this feature ensures that
    the strands are equal for the two mapped IR contigs.
    Note: currently only plus strand is used.

  --log: log file (default: <output>/polap.log)
    The log file option allows users to specify a custom log file location,
    with a default setting of '<output>/polap.log'.

  --clock: display the start and ending time (default: off)
    The clock option allows users to display both the start and end times.

  --species: Species scientific name (no default)
  --sra: SRA data (no default)

  -v, --verbose: use multiple times to increase the verbose level
  --version: Prints version
  -h: Prints polap global help
  --help: Prints menu help

Example:
  git clone https://github.com/goshng/polap.git
  cd polap/test
  polap --test

Place your long-read and short-read files at a folder:
  long-read file: l.fq
  short-read file: s1.fq, s2.fq

Execute: polap init

Detailed description of some of the options

-o DIRECTORY, --outdir DIRECTORY : Write output files to DIRECTORY. The default output directory is o.

-l FILE, --long-reads FILE : Specify the long-read FASTQ file. If this option is not specified, the default long-read file will be used. It will be the l.fq at the current directory. The FASTQ file must be a regular file not compressed one.

-a FILE, --short-read1 FILE : Specify the first short-read FASTQ file. If this option is not specified, the default long-read file will be used. It will be the s1.fq at the current directory. The FASTQ file must be a regular file not compressed one. Two short-read data files are required.

-b FILE, --short-read2 FILE : Specify the second short-read FASTQ file. If this option is not specified, the default 1st short-read file will be used. It will be the s2.fq at the current directory. The FASTQ file must be a regular file not compressed one. Two short-read data files are required.

-w NUMBER, --pair-min NUMBER : Specify the minimum mapped bases or PAF 11th column (default: 3000).

-i NUMBER, --inum NUMBER : Specify the previous output number of organelle-genome assembly (default: 0).

-j NUMBER, --jnum NUMBER : Specify the current output number of organelle-genome assembly (default: 1).

-m NUMBER, --min-read-length NUMBER : Specify the minimum length of long-read sequences. If this option is not specified, the default minimum length will be used. It will be the 3000.

-t NUMBER, --threads NUMBER : Specify the numbre CPU cores to use (default: 0).

-c NUMBER, --coverage NUMBER : Specify coverage for the 2nd assembly (default: 30).

-p FILE, --unpolished-fasta FILE : Specify the unpolished FASTA file. If this option is not specified, the default 2nd short-read file will be used. It will be the mt.0.fasta at the current directory.

-f FILE, --final-assembly FILE : Specify the final genome assembly FASTA file. If this option is not specified, the default long-read file will be used. It will be the mt.1.fa at the current directory.

--version : Print version.

--help : Show usage message of a menu.

-h : Show usage message.

Authors

Copyright 2024 Sungshin Women's University. Released under the GNU General Public License version 3, this software carries no warranty of any kind.