Skip to content

Commit

Permalink
Merge pull request #15 from Runsheng/dev
Browse files Browse the repository at this point in the history
Dev pull
  • Loading branch information
Runsheng authored Oct 15, 2024
2 parents 62fccba + 45fcb93 commit fcb2de4
Show file tree
Hide file tree
Showing 16 changed files with 256 additions and 257 deletions.
5 changes: 4 additions & 1 deletion Readme.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,10 @@ Trackcluster is an isoform calling and quantification pipeline for long RNA/cDNA
- [Walkthrough](#walkthrough)

####
Hint: the ongoing development can be found in the ["dev"](https://github.com/Runsheng/trackcluster/tree/dev) branch.
The ongoing development can be found in the ["dev"](https://github.com/Runsheng/trackcluster/tree/dev) branch.
#### TODO:
1. fix "fusion" classification. 2. speed for clusterj. 3. splicing leader/5' indicator finding using pyssw.
4.add function to get the CDS start (maybe ATG) and CDS end.

## <a name="overview"></a>Overview
A pipeline for reference-based isoform identification and quantification using long reads. This pipeline was designed to use **only** long and nosisy reads to make a valid transcriptome. An indicator for the intact 5' could be very helpful to the pipeline, i.e, the splicing leader in the mRNA of nematodes.
Expand Down
27 changes: 27 additions & 0 deletions doc/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
# TrackCluster
![PyPI](https://img.shields.io/pypi/v/trackcluster?color=green)

Trackcluster is an isoform calling and quantification pipeline for long RNA/cDNA reads.


Walkthrough for the using of trackcluster in isofrom calling in worms/mouse/human/virus datasets.

## preprocessing
1. Long read QC

Some of the basic characteristics need to be known before the further analysis. For example, the read length and mapped
read length; the read estimated quality, mapping quality and the read mapping rate.

We recommend to use the following tools for the long read QC:

Giraffe: https://github.com/lrslab/Giraffe_View


2. Mapping Nanopore long reads to genome

The mapping process is the first step for the isoform calling.
We recommend to use minimap2 for the mapping of the long reads to the genome.
- Mapping the reads for eukaryotic genomes
```bash
minimap2 -ax splice -uf -k14 --secondary=no --MD -t 8 ref.fa read.fq.gz | samtools view -bS - | samtools sort -o read.bam -
```
33 changes: 33 additions & 0 deletions doc/source/QC.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
# long read QC for RNA reads
The read length and the relative read length/real length (full length ratio) are essential for the downstream analysis.
For instance, we will not suggest isoform calling with full length ratio lower than 10%. However, the read counting can still
be done.

### Basic statistics

Some of the basic characteristics need to be known before the further analysis. For example, the read length and mapped
read length; the read estimated quality, mapping quality and the read mapping rate.

We recommend to use the following tools for the long read QC:

Giraffe: https://github.com/lrslab/Giraffe_View

### Full length read ratio estimation

1. Common case, with no 5' indicator.

The sequencing of most of the long reads are started from the 3' end (PolyA site), so 5' indicator like the splicing leader
or artificial sequence added after de-capping could indicate if one read is likely to be full length. But for most of the
sequencing reads, we do not have this resource. As a result, we will use >95% to estimate the full length ratio.

2. Special case, with 5' indicator.
- Splicing leaders: Some species using both cis and trans splicing, like C. elegans. The 80% trans-spliced transcripts have the splicing leader sequence at the 5' end of
mRNA reads. In this case, we can use the splicing leader sequence to estimate the full length ratio. The remained splicing
leader sequences are ~22nt long short sequences hanging at the 5' end of the reads.
- Artificial sequence: Some of the long reads are added with artificial sequence after de-capping.
The artificial sequence could be used to indicate full length read. For example, the Cappble-seq reads are generated by
decapping the 5' G cap and adding the 5' artificial sequence. Some old fashion ways like 5'RACE will also give the users
some 5' sequences to indicate the start of a transcript.



19 changes: 19 additions & 0 deletions doc/source/design.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
# Trackcluster design
## History of ONT read accuracy and trackcluster
The direct-RNA sequencing from ONT used to have very low quaility, especially for the non-human samples. The raw read quality
is roughly 85% for RNA001 kit with R9.4.1 flowcell. With a 15% error rate, the read junctions are also error prone, which
makes the isoform calling and quantification using junctions very difficult. To accomendate the low quality reads, we have
designed trackcluster by comparing the intersection of exon/intron regions to determine the distance between different reads,
and try to correct the junctions after clustering all similar reads from one isoform.

The regional intersection method (original trackcluster) worked well for RNA001/002 data in model organisms like _C. elegans_ and _Arabidopsis thaliana_,
as the read count for each isoform is limited. The total yield for one flowcell is around 1-2Gb. And for one experiment, the
overall yield is usually below 10Gb, and the gene expression for highly expressed genes are generally less than 50,000.
However, the method is not fast enough as we have to calculate the intersection for every two reads who have an overlap. The
time complexity is O(n^c), which is not acceptable for genes with expression higher than 50,000 (may take 24h for computation).


With the new RNA002 kit and new basecall models, the read quality is improved to 92% for most samples. And 8% (instead of >15%)
of error rate would allow for the junction self-correction before clustering. So we also included the junction self-correction
methods, and also the clustering methods using junctions (trackclusterj method). The time complexity for trackclusterj is roughly O(nlogn) for most
of the cases, which is acceptable for the high expressed genes.
58 changes: 58 additions & 0 deletions script/bigg2seq.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
#!/usr/bin/env python
#-*- coding: utf-8 -*-
# @Time : 19/2/2024 2:14pm
# @Author : Runsheng
# @File : bigg2seq.py

"""
This script is used to write the exon information from the bigg file
can be used to output the
transcript sequence, CDS sequence and ensemble like EXON(uppercase)+intron(lowercase or N) sequence.
"""
import argparse
import os,sys,inspect
from trackcluster.utils import fasta2dic

currentdir = os.path.dirname(os.path.abspath(inspect.getfile(inspect.currentframe())))
parentdir = os.path.dirname(os.path.dirname(currentdir))
sys.path.insert(0,parentdir)

from trackcluster.tracklist import read_bigg, write_bigg

parser=argparse.ArgumentParser()
parser.add_argument("-b", "--biggfile",
help="the bigg bed file")
parser.add_argument("-r", "--reference",
help="the genome reference")
parser.add_argument("-o", "--out", default="bigg.fasta",
help="the output file name, default is bigg.fasta")
parser.add_argument("-m", "--mode", default="exon",
help="the format of the output, can be 'exon', 'cds', "
"or 'ensembl': ensemble EXON(uppercase)+intron(lowercase) sequence"
"default is 'exon' mode")

args = parser.parse_args(args=None if sys.argv[1:] else ['--help'])

# make a file using the functions
outfile=args.out
refdic=fasta2dic(args.reference)

bigg_l=read_bigg(args.biggfile)
with open (outfile, "w") as fw:
if args.mode=="exon":
for bigg in bigg_l:
bigg.get_exon()
bigg.bind_chroseq(refdic, gap=0, intron=False)
name=bigg.name
seq=bigg.seq_chro
elif args.mode=="cds":
for bigg in bigg_l:
bigg.get_exon()
bigg.get_cds()
name=bigg.name
seq=bigg.seq_cds

for bigg in bigg_l:
bigg.get_exon()
fw.write(bigg.name+"\t"+str(bigg.exonlen)+"\t"+bigg.geneName+"\t"+bigg.ttype+"\n")
8 changes: 6 additions & 2 deletions script/trackrun.py
Original file line number Diff line number Diff line change
Expand Up @@ -174,6 +174,8 @@ def cluster(self):
if args.prefix is None:
args.prefix=get_file_prefix(args.sample,sep=".")

find_novelgene=False if args.find_novelgene=="no" else True

#args = parser.parse_args(args=None if sys.argv[2:] else ['--help'])

flow_cluster_all_gene_novel(wkdir=args.folder,
Expand All @@ -188,7 +190,7 @@ def cluster(self):
intronweight=args.intronweight,
cutoff1=args.cutoff1,
cutoff2=args.cutoff2,
find_novelgene=args.find_novelgene)
find_novelgene=find_novelgene)


def clusterj(self):
Expand Down Expand Up @@ -224,6 +226,8 @@ def clusterj(self):
if args.prefix is None:
args.prefix=get_file_prefix(args.sample,sep=".")

find_novelgene=False if args.find_novelgene=="no" else True

flow_clusterj_all_gene_novel(wkdir=args.folder,
prefix=args.prefix,
nano_bed=args.sample,
Expand All @@ -233,7 +237,7 @@ def clusterj(self):
f2=args.intersect2,
count_cutoff=args.count,
batchsize=args.batchsize,
find_novelgene=args.find_novelgene)
find_novelgene=find_novelgene)

def count(self):
parser = argparse.ArgumentParser(
Expand Down
2 changes: 1 addition & 1 deletion test/clustercj_test.py
Original file line number Diff line number Diff line change
Expand Up @@ -80,7 +80,7 @@ def test_get_read_junction_dic(self):
print((df["1f940a05-24a1-4455-a506-b1aa04caf81a"]-df["087b20ed-78ba-48f9-a038-f23a9c4b75c6"]))
print((df["087b20ed-78ba-48f9-a038-f23a9c4b75c6"]-df["1f940a05-24a1-4455-a506-b1aa04caf81a"]))

def test_get_arrry_freq_is_junction_inside(self):
def __test_get_arrry_freq_is_junction_inside(self):
self.site_dic=get_junction_dic(self.bigg)
df=get_read_junction_dic(self.bigg, self.site_dic)
#j1=df["087b20ed-78ba-48f9-a038-f23a9c4b75c6"]
Expand Down
11 changes: 7 additions & 4 deletions test/clusterj_test.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
# @Time : 8/9/2018 10:53 AM
# @Author : Runsheng
# @File : cluster_test.py
import os

from trackcluster.clusterj import *
from trackcluster.tracklist import *
Expand Down Expand Up @@ -32,9 +33,10 @@ def setUp(self):
self.bigg_nano=bigg_nano
self.bigg_gff=bigg_gff
self.bigg=bigg_nano+bigg_gff
self.gene=gene


def test_get_corrected_junction(self):
def __test_get_corrected_junction(self):
import random
from copy import deepcopy
#bigg=self.bigg[0]
Expand Down Expand Up @@ -78,11 +80,12 @@ def test_is_bigg1_inside_bigg2_junction(self):

def test_junction_simple_merge(self):

print(len(self.bigg))
print(self.gene, "track size is ", len(self.bigg))
bigg_n=junction_simple_merge(self.bigg)

print(len(bigg_n))
write_bigg(bigg_n, "./test/genes/AT2G43410/tt.bed")
print(os.getcwd())
write_bigg(bigg_n, "./genes/{gene}/tt.bed".format(gene=self.gene))

def test_is_single_exon_in(self):
# get single and note single
Expand Down Expand Up @@ -110,7 +113,7 @@ def test_flow_junction_cluster(self):
#write_bigg(bigg_subread, "./test/genes/AT2G43410/AT2G43410_subread.bed")

def test_bigg_get_namedic(self):
bigg_list=read_bigg("genes/AT2G43410/AT2G43410_subread.bed")
bigg_list=read_bigg("./genes/AT2G43410/AT2G43410_nano.bed")
name_dic=bigg_get_namedic(bigg_list)
print(name_dic)

10 changes: 6 additions & 4 deletions test/track_test.py
Original file line number Diff line number Diff line change
Expand Up @@ -148,14 +148,16 @@ def test_orfs(self):
can only run with ce10 ref
"""
# ref_dic is a large external file
ref_dict=fasta2dic("/data/reference/ce10_ucsc.fa")
ref_dict=fasta2dic("/data/reference/ce11.fa")
bigg_one=self.bigg[30]
print(bigg_one)
bigg_one.bind_chroseq(ref_dict, gap=0, intron=False)
print((bigg_one.seq_chro))
ans=bigg_one.find_orfs_with_trans()
bigg_one.orf_find(refdic=ref_dict)
print((bigg_one.seq_mrna))

print(ans)
print(bigg_one)
print()
print(bigg_one.thickStart, bigg_one.thickEnd)

def test_orf(self):
pass
Expand Down
4 changes: 4 additions & 0 deletions test/utils_test.py
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,10 @@ def test_detail_file(self):
logger.debug("This is just a test, print number 100 as float" + "," + str(float(1) * 100))
logger.debug("In total %d line, %s" % (2, "just a test number"))

def test_find_longest_orf(self):
seq="AAATGAACTTGACGAGTAATGAG"
print(find_longest_orf(seq))

def tearDown(self):
self.fafile=None

Expand Down
2 changes: 1 addition & 1 deletion trackcluster/batch.py
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,7 @@ def process_one_junction_corrected_try(key, full=False, batchsize=500):
if bigg_nano_raw is None:
return 0

### add the bactsize part
### add the batch size part
n_count = 100
n = 0
bigg_nano = junction_pre(bigg_nano_raw, bigg_gff)
Expand Down
75 changes: 0 additions & 75 deletions trackcluster/clustercj.py
Original file line number Diff line number Diff line change
Expand Up @@ -147,81 +147,6 @@ def split_site(junction_array):
return groups


def _is_junction_inside(j1, j2):
"""
test if j1 is in j2, change into comp all junctions except single exon
still very slow, unused
:param j1: junction numpy array, with all aligned part
:param j2:
:return: 0 for contain, 3 for differ, 1 for j1 in j2, -1 for j2 in j1, "e" for error
"""
# do not consider the single exon track/read
# first to simplify the two unions from the large union

if numpy.array_equal(j1,j2):
return 0

# select the union of the junction list
pos_simple=[pos for pos, value in enumerate(j1+j2) if value>=1]
j1_s=j1[pos_simple]
j2_s=j2[pos_simple]

#print(j1_s)
#print(j2_s)

if numpy.sum(j1_s)==0 or numpy.sum(j2_s)==0:
return "is_junction_inside error: single exon"

j12_del=j1_s-j2_s
#freq_dic = get_array_freq(j12_del)

# filter the easy differ ones

#if freq_dic[-1] > 0 and freq_dic[1] > 0: # has more and less junction, must be 3
# return 3

# consider 3 types using group_l, this will be slower
group_l=split_site(j12_del)
#print(group_l)

# did not consider the len=0 and len=1
# single exon and equal junction should have been excluded
if len(group_l)>=3:
return 3
elif len(group_l)==2: # should be all 0, 1 or all 0 ,-1
# need to consider 5' or 3' missing
# 5' is belong, 3' is differ
a,b=group_l
if -1<a[0]+b[0]<1: # should not happen for [-1,-1], [1,1], but no overalp
return 3
if a[0]+ b[0]>=1: #j1_s > j2_s
if b[0]==0: # 5' missing
return -1
else: # 3' missing
return 3

elif a[0]+ b[0]<=-1: #j1_s < j2_s
if b[0]==0: # 5' missing
return 1
else: # 3' missing
return 3


def __compare_junction(j1, j2):
"""
Two junctions in j1, j2, generated by get_read_junction_dic
:param j1: numpy array, contian 0 or 1 only, the position is well aligned for all tracks,
:param j2:
:return: use 0:"equal", 1:"in", 2:"contain", 3:"diff"
does not conserider single exon tracks
"""
# do not consider the single exon track/read

# single junction first
pass




def get_corrected_dic(site_cov_dic, cov_cutoff=2, pos_cutoff=10):
"""
Expand Down
Loading

0 comments on commit fcb2de4

Please sign in to comment.