-
Notifications
You must be signed in to change notification settings - Fork 3
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #15 from Runsheng/dev
Dev pull
- Loading branch information
Showing
16 changed files
with
256 additions
and
257 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,27 @@ | ||
# TrackCluster | ||
![PyPI](https://img.shields.io/pypi/v/trackcluster?color=green) | ||
|
||
Trackcluster is an isoform calling and quantification pipeline for long RNA/cDNA reads. | ||
|
||
|
||
Walkthrough for the using of trackcluster in isofrom calling in worms/mouse/human/virus datasets. | ||
|
||
## preprocessing | ||
1. Long read QC | ||
|
||
Some of the basic characteristics need to be known before the further analysis. For example, the read length and mapped | ||
read length; the read estimated quality, mapping quality and the read mapping rate. | ||
|
||
We recommend to use the following tools for the long read QC: | ||
|
||
Giraffe: https://github.com/lrslab/Giraffe_View | ||
|
||
|
||
2. Mapping Nanopore long reads to genome | ||
|
||
The mapping process is the first step for the isoform calling. | ||
We recommend to use minimap2 for the mapping of the long reads to the genome. | ||
- Mapping the reads for eukaryotic genomes | ||
```bash | ||
minimap2 -ax splice -uf -k14 --secondary=no --MD -t 8 ref.fa read.fq.gz | samtools view -bS - | samtools sort -o read.bam - | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,33 @@ | ||
# long read QC for RNA reads | ||
The read length and the relative read length/real length (full length ratio) are essential for the downstream analysis. | ||
For instance, we will not suggest isoform calling with full length ratio lower than 10%. However, the read counting can still | ||
be done. | ||
|
||
### Basic statistics | ||
|
||
Some of the basic characteristics need to be known before the further analysis. For example, the read length and mapped | ||
read length; the read estimated quality, mapping quality and the read mapping rate. | ||
|
||
We recommend to use the following tools for the long read QC: | ||
|
||
Giraffe: https://github.com/lrslab/Giraffe_View | ||
|
||
### Full length read ratio estimation | ||
|
||
1. Common case, with no 5' indicator. | ||
|
||
The sequencing of most of the long reads are started from the 3' end (PolyA site), so 5' indicator like the splicing leader | ||
or artificial sequence added after de-capping could indicate if one read is likely to be full length. But for most of the | ||
sequencing reads, we do not have this resource. As a result, we will use >95% to estimate the full length ratio. | ||
|
||
2. Special case, with 5' indicator. | ||
- Splicing leaders: Some species using both cis and trans splicing, like C. elegans. The 80% trans-spliced transcripts have the splicing leader sequence at the 5' end of | ||
mRNA reads. In this case, we can use the splicing leader sequence to estimate the full length ratio. The remained splicing | ||
leader sequences are ~22nt long short sequences hanging at the 5' end of the reads. | ||
- Artificial sequence: Some of the long reads are added with artificial sequence after de-capping. | ||
The artificial sequence could be used to indicate full length read. For example, the Cappble-seq reads are generated by | ||
decapping the 5' G cap and adding the 5' artificial sequence. Some old fashion ways like 5'RACE will also give the users | ||
some 5' sequences to indicate the start of a transcript. | ||
|
||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,19 @@ | ||
# Trackcluster design | ||
## History of ONT read accuracy and trackcluster | ||
The direct-RNA sequencing from ONT used to have very low quaility, especially for the non-human samples. The raw read quality | ||
is roughly 85% for RNA001 kit with R9.4.1 flowcell. With a 15% error rate, the read junctions are also error prone, which | ||
makes the isoform calling and quantification using junctions very difficult. To accomendate the low quality reads, we have | ||
designed trackcluster by comparing the intersection of exon/intron regions to determine the distance between different reads, | ||
and try to correct the junctions after clustering all similar reads from one isoform. | ||
|
||
The regional intersection method (original trackcluster) worked well for RNA001/002 data in model organisms like _C. elegans_ and _Arabidopsis thaliana_, | ||
as the read count for each isoform is limited. The total yield for one flowcell is around 1-2Gb. And for one experiment, the | ||
overall yield is usually below 10Gb, and the gene expression for highly expressed genes are generally less than 50,000. | ||
However, the method is not fast enough as we have to calculate the intersection for every two reads who have an overlap. The | ||
time complexity is O(n^c), which is not acceptable for genes with expression higher than 50,000 (may take 24h for computation). | ||
|
||
|
||
With the new RNA002 kit and new basecall models, the read quality is improved to 92% for most samples. And 8% (instead of >15%) | ||
of error rate would allow for the junction self-correction before clustering. So we also included the junction self-correction | ||
methods, and also the clustering methods using junctions (trackclusterj method). The time complexity for trackclusterj is roughly O(nlogn) for most | ||
of the cases, which is acceptable for the high expressed genes. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,58 @@ | ||
#!/usr/bin/env python | ||
#-*- coding: utf-8 -*- | ||
# @Time : 19/2/2024 2:14pm | ||
# @Author : Runsheng | ||
# @File : bigg2seq.py | ||
|
||
""" | ||
This script is used to write the exon information from the bigg file | ||
can be used to output the | ||
transcript sequence, CDS sequence and ensemble like EXON(uppercase)+intron(lowercase or N) sequence. | ||
""" | ||
import argparse | ||
import os,sys,inspect | ||
from trackcluster.utils import fasta2dic | ||
|
||
currentdir = os.path.dirname(os.path.abspath(inspect.getfile(inspect.currentframe()))) | ||
parentdir = os.path.dirname(os.path.dirname(currentdir)) | ||
sys.path.insert(0,parentdir) | ||
|
||
from trackcluster.tracklist import read_bigg, write_bigg | ||
|
||
parser=argparse.ArgumentParser() | ||
parser.add_argument("-b", "--biggfile", | ||
help="the bigg bed file") | ||
parser.add_argument("-r", "--reference", | ||
help="the genome reference") | ||
parser.add_argument("-o", "--out", default="bigg.fasta", | ||
help="the output file name, default is bigg.fasta") | ||
parser.add_argument("-m", "--mode", default="exon", | ||
help="the format of the output, can be 'exon', 'cds', " | ||
"or 'ensembl': ensemble EXON(uppercase)+intron(lowercase) sequence" | ||
"default is 'exon' mode") | ||
|
||
args = parser.parse_args(args=None if sys.argv[1:] else ['--help']) | ||
|
||
# make a file using the functions | ||
outfile=args.out | ||
refdic=fasta2dic(args.reference) | ||
|
||
bigg_l=read_bigg(args.biggfile) | ||
with open (outfile, "w") as fw: | ||
if args.mode=="exon": | ||
for bigg in bigg_l: | ||
bigg.get_exon() | ||
bigg.bind_chroseq(refdic, gap=0, intron=False) | ||
name=bigg.name | ||
seq=bigg.seq_chro | ||
elif args.mode=="cds": | ||
for bigg in bigg_l: | ||
bigg.get_exon() | ||
bigg.get_cds() | ||
name=bigg.name | ||
seq=bigg.seq_cds | ||
|
||
for bigg in bigg_l: | ||
bigg.get_exon() | ||
fw.write(bigg.name+"\t"+str(bigg.exonlen)+"\t"+bigg.geneName+"\t"+bigg.ttype+"\n") |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.