DNA barcoding analysis pipeline designed to perform phylogenetic analysis on raw sanger sequencing data.
- Requirements
- Installation
- Example commands
- Full usage
- Method
- Interpeting output
- Acknowledgements
- License
DBA is designed to run on linux-based systems. The pipeline was specifcally tested on Ubuntu 22.04.2 LTS. Some tools and most dependencies used during analysis will be installed automatically when the conda environment is being constructed.
The following prerequisites should be installed prior to using DBA
- Anaconda3
- MEGA11
- Python 3.8+
- Java
- Jalview
- Note: Jalview should be a system-wide executable as:
jalview
- Note: Jalview should be a system-wide executable as:
- MEGA11
git clone https://github.com/mdcjansen/DBA
cd path/to/DBA
conda env create -f environment.yml
chmod a+x DBA.py
ln -s /path/to/DBA.py /path/to/anaconda3/envs/DBA/bin/DBA
conda activate DBA
This command will display all required and optional arguments in the format presented below
DBA
DBA -h "or" --help
DBA -i <input folder> -n <genbank reference NC_ID> -y <genbank reference YC_ID> -g <genbank outgroup NC_ID>
The input below was used to produce the case data available within this repository:
DBA -i case_data -n NC_009065 -y YP_001054869 -g NC_02842
usage: DBA -i <inputfolder> -n <genbank reference NC_ID> -y <genbank reference YP_ID> -g <genbank outgroup NC_ID> [options]
DBA is designed to automate the process of DNA barcoding by utilising standard Sanger sequencing data and user provided reference gene and protein accession numbers.
This pipeline obtains nucleic and protein sequences from genbank, followed by chromatogram production in .pdf format by utilising sangerseq viewer.
Next, BLASTx and BLASTn are run to assess query coverage against the reference gene.
MUSCLE is used to align the sequences, followed by manual review in jalview and phylogenetic analysis by MEGA.
List of arguments:
-h, --help show this help message and exit
-v, --version Prints program version
-i [input folder] Input folder containing per species folders, a single concatenated fasta file for analysis and a MEGA mao file for phylogenetic analysis
-n [genbank reference nucl acc no.]
genbank nucleotide accession number for the gene to be used as reference, usually starts with the identifier NC_
-y [genbank reference prot acc no.]
genbank protein accession number for the protein to be used as reference, usually starts with the identifier YC_
-g [genbank nucl acc no. for out group]
genbank nucleotide accession number for outgroup gene used during phylogenetic analysis, usually starts with the identifier NC_
-o [output folder] Output directory.
A default output folder will be produced in the following format will be created if it hasn't been specified:
barcoding_output_current-date_current-time
-t [cpu threads] Maximum amount of threads to be utilised during analysis.
Default: 20
-keep [] Keep all files produced during analysis.
Files are stored in 'workdir' folder at the location where this scripthas been executed.
Default: False
-rev [] Reverse compliment the fasta input.
Reverse complimented file is saved as an additional file.
Default: False
DBA requires a specific folder structure in order to perform the analysis.
A visualisation of this folder structure is given below where case_data is given as -i
argument:
.
├── ...
├── case_data
│ ├── sanger.fasta # FASTA file containing all sanger sequences to be used during analysis
│ ├── phylo.mao # MEGA11 generated file used to perform phylogenetic analysis
│ ├── species_01 # Folder with sequencing data of first species to be analysed
│ │ ├── sample_01 # Folder containing sanger sequencing data of the first sample
│ │ │ ├── sanger_data.ab1
│ │ │ └── ...
│ │ └── ...
│ ├── species_02 # Folder with sequencing data of second species to be analysed
│ │ ├── sample_01 # Folder containing sanger sequencing data of the first sample
│ │ │ ├── sanger_data.ab1
│ │ │ └── ...
│ │ └── ...
│ └── ...
└── ...
DBA will perform phylogenetic analysis as presented below:
- If specified, DBA will reverse complement the input fasta file
- DBA will obtain genbank information on input genbank accession numbers
- sangerseq_viewer is utilised to produce a chromatogram for each of the input species found
- BLAST is performed on the input fasta sequence against the input reference sequences
- Multple sequence alignment is performed with MUSCLE
- DBA initates Jalview for manual review and trimming of alignment data
- MEGA11 is used to perform phylogentic analysis and produce a consensus tree
- If specified, supplemental data used during analysis is moved to the output folder and the analysis ends.
DBA produces a single output folder where all results can be found. All procedures during analysis are logged and stored in the barcoding.log file
BLAST results are placed into two tsv files named blastn.tsv
and blastx.tsv
for BLASTn and BLASTx results respectively.
Standard BLAST tabular output format 6 is used to generate the output.
An example with case data and manually added header is presented below:
qseqid | sseqid | pident | length | mismatch | gapopen | qstart | qend | sstart | send | evalue | bitscore |
---|---|---|---|---|---|---|---|---|---|---|---|
RC_EF71414350_EF71414350 | NC_009065.1 | 98.214 | 672 | 8 | 4 | 13 | 680 | 5513 | 6184 | 0.0 | 1171 |
RC_EF71414351_EF71414351 | NC_009065.1 | 99.519 | 624 | 3 | 0 | 5 | 628 | 5507 | 6130 | 0.0 | 1136 |
RC_EF71414352_EF71414352 | NC_009065.1 | 97.765 | 671 | 9 | 6 | 5 | 670 | 5507 | 6176 | 0.0 | 1151 |
RC_EF71414353_EF71414353 | NC_009065.1 | 98.336 | 661 | 6 | 5 | 10 | 670 | 5513 | 6168 | 0.0 | 1155 |
RC_EF71414354_EF71414354 | NC_009065.1 | 99.522 | 628 | 3 | 0 | 6 | 633 | 5507 | 6134 | 0.0 | 1144 |
Sangerseq_viewer produces a chromatogram in pdf format for each species folder present in the input folder. These chromatograms display all sequences found inside a particular species folder and present mapped genbank genes. An example produced from case data is presented below:
MUSCLE produces multiple alignments files, all presented with the .fa suffix. The alignment with the highest column confidence is extracted from the diversified ensemble and opened in jalview to be reviewed by the user. Additionally, an html file will be produced for the diversified ensemble containing the letter confidence values of the ensemble.
MEGA generates phylogenetic trees in newick format using user specified parameters specified within the mao file. Newick trees and consensus tree are both present in the output folder. Summary and partition text files are also saved within the output.
DBA uses the following tools in the pipeline: