-
Notifications
You must be signed in to change notification settings - Fork 5
Long read mapping
ecc_finder Version: v1.0.0
To identify eccDNA loci by mapping to a reference genome.
usage: ecc_finder.py map-ont <reference.idx> <query.fq> -r <reference.fa> (option)
A tool to detect eccDNA loci using ONT sequencing
positional arguments:
<reference.idx> index file of reference genome
<query.fq> query fastq/fasta file (uncompressed or bgzipped)
optional arguments:
-h, --help show this help message and exit
-r <query.fasta> reference genome fasta file (uncompressed or bgzipped)
map options:
-t INT number of CPU threads for mapping mode
-g STR reference genome size larger than 4Gb [yes]
-q INT minimum query length [200]
-a INT minimum alignment length [200]
--five-prime STR 5' adapter sequence (sense strand) [NULL]
--three-prime STR 3' adapter sequence (anti-sense strand) [NULL]
peak-calling options:
-l INT minimum length of a peak [200]
-d INT maximum distance between signif. sites [1000]
-p FLT maximum p-value [0.05]
validation options:
-n INT minimum copy number of tandem repeat in a long read [2]
-e FLT maximum allowed divergence rate between two consecutive repeats [0.25]
-s INT minimum period size of tandem repeat (>=2) [30]
--min-read INT filter locus by unique mapped read number [3]
--min-bound FLT filter locus at regions by boundary coverage (# aligned bases / boundary bases)[0.8]
--min-cov FLT minimum coverage of detected eccDNA loci [10]
output options:
-o PATH output directory [./eccFinder_output]
-w overwrite intermediate files
-x X add prefix to output [ecc.ont]
** The reference fasta, idx and query files are required **
ecc_finder maps sequences in <query.fa>
to the index file of reference <reference.idx>
. The <query.fa>
files can be uncompressed or bgzipped.
Please call -g yes
for the reference genome whose genome size is greater than 4G. The --split-prefix=tmp
parameter will be used in minimap2 to generate the proper SAM file with a header.
According to minimap2, "By default, minimap2 indexes 4 billion reference bases (4Gb) in a batch and map all reads against each reference batch. Given a reference longer than 4Gb, minimap2 is unable to see all the sequences and thus can't produce a correct SAM header. In this case, minimap2 doesn't output any SAM header." And therefore one solution is to increase index more reference bases in a batch, which is preferred if your machine has enough memory. Another solution is to use the --split-prefix option in a command line.
Use -a
to set minimum alignment length for query to remove satellites, 200bp as default.
Except for long reads of satellites, when performing self-alignment, linear reads will not repeat itself while circular reads will be repeated two or more times because it goes through the rolling circle amplification experimentally.
Therefore, the circular sequence will have a sub-read alignment in the same direction, and this sub-read alignment will be repeated two or more times on the same boundary.
Increase the value in --min-read
will produce locus with more supported number of tandemly repeated long reads, 3 as default.
By default, ecc_finder places all output and intermediate files in a directory named eccFinder_output
, but this can be changed with -o
. ecc_finder will not overwrite intermediate files that already exist in the output directory. This is to save time producing expensive alignment files. Users can set -w
to overwrite any preexisting files.
Use the -x
option to add the "ecc.ont" prefix to each sequence in the output.
All output is in eccFinder_output
, or whichever directory -o
specifies.
Overview
Col | Type | Description |
---|---|---|
1 | folder | Folder containing alignment files (align_files) |
2 | folder | Folder containing peak calling files (peak_files) |
3 | file | CSV file of the eccDNA loci: ecc.{ont,sr}.csv |
4 | file | FASTA file of the eccDNA sequence: ecc.{ont,sr}.fasta |
5 | file | BED file of genomic enriched sites: ecc.{ont,sr}.site.bed |
6 | file | Plot of the size distribution: ecc.{ont,sr}.distribution.png |
**ecc.{ont,sr}.fasta **
The eccDNA sequence in FASTA format.
**ecc.{ont,sr}.csv **
The eccDNA locus in csv format.
Col | Type | Description |
---|---|---|
1 | string | Reference sequence name |
2 | int | Reference start on original strand |
3 | int | Reference end on original strand |
4 | int | Circular read number at the locus |
5 | int | Repeat units of all circular reads |
6 | int | Read coverage at the locus |
7 | int | EccDNA sequence length |
ecc.{ont,sr}.distribution.png
The Size distribution of detected eccDNA in png format.