This software converts the results of PacBio assembly using FALCON, to a FASTG graph that can be visualized using Bandage.
python Falcon2Fastg.py [--only-output=reads|contigs]
This can be run in the output directory of FALCON assembly (2-asm-falcon). Please make sure to copy the preads4falcon.fasta file from the intermediate directory (1-preads_ovl) to the output directory (2-asm-falcon)
Falcon2Fastg needs the following 6 input files:
-
preads4falcon.fasta
-
sg_edges_list
-
utg_data (if
--only-output
is unset, or set tocontigs
) -
ctg_paths (if
--only-output
is unset, or set tocontigs
) -
p_ctg.fa (if
--only-output
is unset, or set tocontigs
) -
p_ctg_tiling_path (if
--only-output
is unset, or set tocontigs
)
Biopython (available at http://biopython.org/wiki/Download)
pyfaidx (available at https://github.com/mdshw5/pyfaidx)
Quick installation of dependencies:
pip install biopython pyfaidx # add --user if you don't have root
The output of the tool is two FASTG files (reads.fastg
and contigs.fastg
) that can be opened with
Bandage.
Additionally, the tool produces a CSV file : ReadsInContigs.csv that can be loaded with Bandage. This labels the reads according to the contigs that they are a part of, along with the mapping position within the contig.
Above is a sample Bandage visualization of a reads.fastg
file generated by
Falcon2Fastg from a FALCON assembly (a plant mitochondrial genome).
- Each node is a read, and each node is represented as a colored strip (colors are random)
- Edges represent the overlaps between reads found by FALCON (better viewed in the zoomed-in image below)
- Only the edges used in the string graph ("G" flagged in sg_edges_list) are used by Falcon2Fastg to produce the output file.
Zooming in on a smaller set of nodes shows the edges in black, connecting the colored nodes :
For benchmarking, Falcon2Fastg was run on the preads4falcon.fasta and sg_edges_list file produced by the E.coli test dataset provided with the Falcon install. Instructions on obtaining the dataset are here : https://github.com/PacificBiosciences/FALCON/wiki/Setup:-Complete-example
Execution of Falcon2Fastg took 2 minutes on a desktop computer (size of preads4falcon.fasta
: 449 MB).
The figure below represents a visualization of this E. coli data.
Falcon2Fastg can also be used to visualize the contigs produced by FALCON, and overlaps between them. The contig graph is created in contigs.fastg
. By default, Falcon2Fastg will output this file. You can choose that it outputs only the reads graph using the --only-output=reads
parameter.
To test this visualization mode, we assembled Drosophila melanogaster reads available at:
https://github.com/PacificBiosciences/DevNet/wiki/Drosophila-sequence-and-assembly
The input file was 2.2G in size (dmel_FALCON_preassembled_reads.fasta).
FALCON assembly parameters were not optimized, and were as follows :
length_cutoff = 3000, length_cutoff_pr = 6000, overlap_filtering_setting = --max_diff 100 --max_cov 100 --min_cov 20
The final p_ctgs.fa file had 642 contigs with total length ~27 Mbp.
Execution of Falcon2Fastg took 5 minutes on a desktop computer (size of preads4falcon.fasta
: 2.2 GB).
The figure below is the visualization of these D. mel. contigs (colors are random)
Bandage provides a way to visualize k-mer coverage, as reported by the assembler. As Falcon is a string graph assembler, it does not report such information. Ideally, to compute the coverage of a contig, one would need to re-map the reads back to the assembled contigs. Here, we report a more simple metric that is easy to compute from the output of Falcon.
Read density is calculated as (sum of length of all reads used by FALCON to construct the contig / length of contig). We believe that variation in read density reflects variation of coverage;
The figure below is a schematic of read density. The blue arrows represent reads that were used by Falcon to create the red (resp. black) contig. The contig above (black) has fewer reads within it. Its read density is around 2.0 The contig below (red) and has more reads within it. Its read density is around 5.0
The figure below is the visualization of the same D. mel. contigs, colored by read density.
Zooming in shows that bright red represents higher density (6.0x). Contigs colored black have a lower read density (2.0x)
The pyfaidx module is used to read an entire FASTA file into memory. If the size of your preads4falcon.fasta is greater than the amount of available RAM, it is advisable to run this computation on a server with greater available memory.
-
Reads within "contained" unitigs are not used in the calculation of Read density.
-
Read density is calculated by dividing total length of all reads in the contig by length of each contig (obtained from ctg_paths). Depending on the orientation, Falcon ignores either the first read or the last read while reporting a contig. Due to this, in the contigs.fastg file, the forward and rev_comp entries might have different read_densities and different lengths.
Any large differences are mostly restricted to short contigs, when one very long read at either extremity can affect the length of the contig.
- Read density is set to "1" for entries in reads.fastg, as this measure is only relevant for contigs.fastg
Please see the test/ directory for a small example dataset and output
FALCON can be installed following the instructions here : https://github.com/PacificBiosciences/FALCON/wiki/Setup:-Complete-example
Additional tools for visualizing read overlap can be found in the utils directory. Please consult utils/README.md for details
This content is released under MIT License. Please see LICENSE.md for details.
Primary author : Samarth Rangavittal, The Pennsylvania State University (szr165@psu.edu)
Rayan Chikhi, University of Lille 1
Jean-Stéphane Varré, University of Lille 1