Skip to content

Long read mapping

Ivy edited this page Oct 6, 2021 · 12 revisions

ecc_finder Version: v1.0.0

To identify eccDNA loci by mapping to a reference genome.

Algorithm overview

Long read pipeline algorithm overview

Usage

usage: ecc_finder.py map-ont <reference.idx> <query.fq> -r <reference.fa> (option)

A tool to detect eccDNA loci using ONT sequencing

positional arguments:
  <reference.idx>    index file of reference genome
  <query.fq>         query fastq/fasta file (uncompressed or bgzipped)

optional arguments:
  -h, --help         show this help message and exit
  -r <query.fasta>   reference genome fasta file (uncompressed or bgzipped)

map options:
  -t INT             number of CPU threads for mapping mode
  -g STR             reference genome size larger than 4Gb [yes]
  -q INT             minimum query length [200]
  -a INT             minimum alignment length [200]
  --five-prime STR   5' adapter sequence (sense strand) [NULL]
  --three-prime STR  3' adapter sequence (anti-sense strand) [NULL]

peak-calling options:
  -l INT             minimum length of a peak [200]
  -d INT             maximum distance between signif. sites [1000]
  -p FLT             maximum p-value [0.05]

validation options:
  -n INT             minimum copy number of tandem repeat in a long read [2]
  -e FLT             maximum allowed divergence rate between two consecutive repeats [0.25]
  -s INT             minimum period size of tandem repeat (>=2) [30]
  --min-read INT     filter locus by unique mapped read number [3]
  --min-bound FLT    filter locus at regions by boundary coverage (# aligned bases / boundary bases)[0.8]
  --min-cov FLT      minimum coverage of detected eccDNA loci [10]

output options:
  -o PATH            output directory [./eccFinder_output]
  -w                 overwrite intermediate files
  -x X               add prefix to output [ecc.ont]

** The reference fasta, idx and query files are required **

mapping options

ecc_finder maps sequences in <query.fa> to the index file of reference <reference.idx>. The <query.fa> files can be uncompressed or bgzipped.

Please call -g yes for the reference genome whose genome size is greater than 4G. The --split-prefix=tmp parameter will be used in minimap2 to generate the proper SAM file with a header.

According to minimap2, "By default, minimap2 indexes 4 billion reference bases (4Gb) in a batch and map all reads against each reference batch. Given a reference longer than 4Gb, minimap2 is unable to see all the sequences and thus can't produce a correct SAM header. In this case, minimap2 doesn't output any SAM header." And therefore one solution is to increase index more reference bases in a batch, which is preferred if your machine has enough memory. Another solution is to use the --split-prefix option in a command line.

Use -a to set minimum alignment length for query to remove satellites, 200bp as default.

validating options

Except for long reads of satellites, when performing self-alignment, linear reads will not repeat itself while circular reads will be repeated two or more times because it goes through the rolling circle amplification experimentally.

Therefore, the circular sequence will have a sub-read alignment in the same direction, and this sub-read alignment will be repeated two or more times on the same boundary.

Increase the value in --min-read will produce locus with more supported number of tandemly repeated long reads, 3 as default.

Output

By default, ecc_finder places all output and intermediate files in a directory named eccFinder_output , but this can be changed with -o. ecc_finder will not overwrite intermediate files that already exist in the output directory. This is to save time producing expensive alignment files.

Use the -x option to add the "ecc.ont" prefix to each sequence in the output.

Overview

Col Type Description
1 folder Folder containing alignment files (align_files)
2 folder Folder containing peak calling files (peak_files)
3 file CSV file of the eccDNA loci: ecc.{ont,sr}.csv
4 file FASTA file of the eccDNA sequence: ecc.{ont,sr}.fasta
5 file BED file of genomic enriched sites: ecc.{ont,sr}.site.bed
6 file Plot of the size distribution: ecc.{ont,sr}.distribution.png

**ecc.{ont,sr}.fasta **

The eccDNA sequence in FASTA format.

**ecc.{ont,sr}.csv **

The eccDNA locus in csv format.

Col Type Description
1 string Reference sequence name
2 int Reference start on original strand
3 int Reference end on original strand
4 int Circular read number at the locus
5 int Repeat units of all circular reads
6 int Read coverage at the locus
7 int EccDNA sequence length

ecc.{ont,sr}.distribution.png

The Size distribution of detected eccDNA in png format.

Size_distribution

Video

Run Example1: watch the video Long-read-mapping_Video_example using the Arabidopsis eccDNA sequencing subsample in the folder test_samples.

Clone this wiki locally