Merge pull request #15 from Runsheng/dev

Dev pull
Runsheng · Oct 15, 2024 · fcb2de4 · fcb2de4
2 parents 62fccba + 45fcb93
commit fcb2de4
Show file tree

Hide file tree

Showing 16 changed files with 256 additions and 257 deletions.
diff --git a/Readme.md b/Readme.md
@@ -11,7 +11,10 @@ Trackcluster is an isoform calling and quantification pipeline for long RNA/cDNA
 - [Walkthrough](#walkthrough)
 
 ####
-Hint: the ongoing development can be found in the ["dev"](https://github.com/Runsheng/trackcluster/tree/dev) branch.
+The ongoing development can be found in the ["dev"](https://github.com/Runsheng/trackcluster/tree/dev) branch.
+#### TODO: 
+1. fix "fusion" classification. 2. speed for clusterj. 3. splicing leader/5' indicator finding using pyssw. 
+4.add function to get the CDS start (maybe ATG) and CDS end.   
 
 ## <a name="overview"></a>Overview
 A pipeline for reference-based isoform identification and quantification using long reads. This pipeline was designed to use **only** long and nosisy reads to make a valid transcriptome. An indicator for the intact 5' could be very helpful to the pipeline, i.e, the splicing leader in the mRNA of nematodes. 

diff --git a/doc/index.md b/doc/index.md
@@ -0,0 +1,27 @@
+# TrackCluster
+![PyPI](https://img.shields.io/pypi/v/trackcluster?color=green)
+
+Trackcluster is an isoform calling and quantification pipeline for long RNA/cDNA reads.
+
+
+Walkthrough for the using of trackcluster in isofrom calling in worms/mouse/human/virus datasets.
+
+## preprocessing 
+1. Long read QC
+
+Some of the basic characteristics need to be known before the further analysis. For example, the read length and mapped
+read length; the read estimated quality, mapping quality and the read mapping rate. 
+
+We recommend to use the following tools for the long read QC:
+
+Giraffe: https://github.com/lrslab/Giraffe_View
+
+
+2. Mapping Nanopore long reads to genome
+
+The mapping process is the first step for the isoform calling. 
+We recommend to use minimap2 for the mapping of the long reads to the genome.
+    - Mapping the reads for eukaryotic genomes
+    ```bash
+    minimap2 -ax splice -uf -k14 --secondary=no --MD -t 8 ref.fa read.fq.gz | samtools view -bS - | samtools sort -o read.bam -
+    ```
diff --git a/doc/source/QC.md b/doc/source/QC.md
@@ -0,0 +1,33 @@
+# long read QC for RNA reads
+The read length and the relative read length/real length (full length ratio) are essential for the downstream analysis. 
+For instance, we will not suggest isoform calling with full length ratio lower than 10%. However, the read counting can still
+be done.
+
+### Basic statistics
+
+Some of the basic characteristics need to be known before the further analysis. For example, the read length and mapped
+read length; the read estimated quality, mapping quality and the read mapping rate. 
+
+We recommend to use the following tools for the long read QC:
+
+Giraffe: https://github.com/lrslab/Giraffe_View
+
+### Full length read ratio estimation
+
+1. Common case, with no 5' indicator. 
+
+The sequencing of most of the long reads are started from the 3' end (PolyA site), so 5' indicator like the splicing leader
+or artificial sequence added after de-capping could indicate if one read is likely to be full length. But for most of the
+sequencing reads, we do not have this resource. As a result, we will use >95% to estimate the full length ratio.
+
+2. Special case, with 5' indicator.
+-   Splicing leaders: Some species using both cis and trans splicing, like C. elegans. The 80% trans-spliced transcripts have the splicing leader sequence at the 5' end of 
+mRNA reads. In this case, we can use the splicing leader sequence to estimate the full length ratio. The remained splicing 
+leader sequences are ~22nt long short sequences hanging at the 5' end of the reads.
+-  Artificial sequence: Some of the long reads are added with artificial sequence after de-capping. 
+The artificial sequence could be used to indicate full length read. For example, the Cappble-seq reads are generated by 
+decapping the 5' G cap and adding the 5' artificial sequence. Some old fashion ways like 5'RACE will also give the users
+some 5' sequences to indicate the start of a transcript.
+
+
+
diff --git a/doc/source/design.md b/doc/source/design.md
@@ -0,0 +1,19 @@
+# Trackcluster design
+## History of ONT read accuracy and trackcluster
+The direct-RNA sequencing from ONT used to have very low quaility, especially for the non-human samples. The raw read quality
+is roughly 85% for RNA001 kit with R9.4.1 flowcell. With a 15% error rate, the read junctions are also error prone, which
+makes the isoform calling and quantification using junctions very difficult. To accomendate the low quality reads, we have
+designed trackcluster by comparing the intersection of exon/intron regions to determine the distance between different reads,
+and try to correct the junctions after clustering all similar reads from one isoform. 
+
+The regional intersection method (original trackcluster) worked well for RNA001/002 data in model organisms like _C. elegans_ and _Arabidopsis thaliana_, 
+as the read count for each isoform is limited. The total yield for one flowcell is around 1-2Gb. And for one experiment, the 
+overall yield is usually below 10Gb, and the gene expression for highly expressed genes are generally less than 50,000. 
+However, the method is not fast enough as we have to calculate the intersection for every two reads who have an overlap. The
+time complexity is O(n^c), which is not acceptable for genes with expression higher than 50,000 (may take 24h for computation).
+
+
+With the new RNA002 kit and new basecall models, the read quality is improved to 92% for most samples. And 8% (instead of >15%) 
+of error rate would allow for the junction self-correction before clustering. So we also included the junction self-correction
+methods, and also the clustering methods using junctions (trackclusterj method). The time complexity for trackclusterj is roughly O(nlogn) for most 
+of the cases, which is acceptable for the high expressed genes.
diff --git a/script/bigg2seq.py b/script/bigg2seq.py
@@ -0,0 +1,58 @@
+#!/usr/bin/env python
+#-*- coding: utf-8 -*-
+# @Time    : 19/2/2024 2:14pm
+# @Author  : Runsheng
+# @File    : bigg2seq.py
+
+"""
+This script is used to write the exon information from the bigg file
+can be used to output the
+transcript sequence, CDS sequence and ensemble like EXON(uppercase)+intron(lowercase or N) sequence.
+
+"""
+import argparse
+import os,sys,inspect
+from trackcluster.utils import fasta2dic
+
+currentdir = os.path.dirname(os.path.abspath(inspect.getfile(inspect.currentframe())))
+parentdir = os.path.dirname(os.path.dirname(currentdir))
+sys.path.insert(0,parentdir)
+
+from trackcluster.tracklist import read_bigg, write_bigg
+
+parser=argparse.ArgumentParser()
+parser.add_argument("-b", "--biggfile",
+                    help="the bigg bed file")
+parser.add_argument("-r", "--reference",
+                    help="the genome reference")
+parser.add_argument("-o", "--out", default="bigg.fasta",
+                    help="the output file name, default is bigg.fasta")
+parser.add_argument("-m", "--mode", default="exon",
+                    help="the format of the output, can be 'exon', 'cds', "
+                         "or 'ensembl': ensemble EXON(uppercase)+intron(lowercase) sequence"
+                         "default is 'exon' mode")
+
+args = parser.parse_args(args=None if sys.argv[1:] else ['--help'])
+
+# make a file using the functions
+outfile=args.out
+refdic=fasta2dic(args.reference)
+
+bigg_l=read_bigg(args.biggfile)
+with open (outfile, "w") as fw:
+    if args.mode=="exon":
+        for bigg in bigg_l:
+            bigg.get_exon()
+            bigg.bind_chroseq(refdic, gap=0, intron=False)
+            name=bigg.name
+            seq=bigg.seq_chro
+    elif args.mode=="cds":
+        for bigg in bigg_l:
+            bigg.get_exon()
+            bigg.get_cds()
+            name=bigg.name
+            seq=bigg.seq_cds
+
+    for bigg in bigg_l:
+        bigg.get_exon()
+        fw.write(bigg.name+"\t"+str(bigg.exonlen)+"\t"+bigg.geneName+"\t"+bigg.ttype+"\n")
diff --git a/script/trackrun.py b/script/trackrun.py
@@ -174,6 +174,8 @@ def cluster(self):
         if args.prefix is None:
             args.prefix=get_file_prefix(args.sample,sep=".")
 
+        find_novelgene=False if args.find_novelgene=="no" else True
+
         #args = parser.parse_args(args=None if sys.argv[2:] else ['--help'])
 
         flow_cluster_all_gene_novel(wkdir=args.folder,
@@ -188,7 +190,7 @@ def cluster(self):
                                  intronweight=args.intronweight,
                                  cutoff1=args.cutoff1,
                                  cutoff2=args.cutoff2,
-                                 find_novelgene=args.find_novelgene)
+                                 find_novelgene=find_novelgene)
 
 
     def clusterj(self):
@@ -224,6 +226,8 @@ def clusterj(self):
         if args.prefix is None:
             args.prefix=get_file_prefix(args.sample,sep=".")
 
+        find_novelgene=False if args.find_novelgene=="no" else True
+
         flow_clusterj_all_gene_novel(wkdir=args.folder,
                                      prefix=args.prefix,
                                      nano_bed=args.sample,
@@ -233,7 +237,7 @@ def clusterj(self):
                                      f2=args.intersect2,
                                      count_cutoff=args.count,
                                      batchsize=args.batchsize,
-                                     find_novelgene=args.find_novelgene)
+                                     find_novelgene=find_novelgene)
 
     def count(self):
         parser = argparse.ArgumentParser(

diff --git a/test/clustercj_test.py b/test/clustercj_test.py
@@ -80,7 +80,7 @@ def test_get_read_junction_dic(self):
         print((df["1f940a05-24a1-4455-a506-b1aa04caf81a"]-df["087b20ed-78ba-48f9-a038-f23a9c4b75c6"]))
         print((df["087b20ed-78ba-48f9-a038-f23a9c4b75c6"]-df["1f940a05-24a1-4455-a506-b1aa04caf81a"]))
 
-    def test_get_arrry_freq_is_junction_inside(self):
+    def __test_get_arrry_freq_is_junction_inside(self):
         self.site_dic=get_junction_dic(self.bigg)
         df=get_read_junction_dic(self.bigg, self.site_dic)
         #j1=df["087b20ed-78ba-48f9-a038-f23a9c4b75c6"]

diff --git a/test/clusterj_test.py b/test/clusterj_test.py
@@ -3,6 +3,7 @@
 # @Time    : 8/9/2018 10:53 AM
 # @Author  : Runsheng     
 # @File    : cluster_test.py
+import os
 
 from trackcluster.clusterj import *
 from trackcluster.tracklist import *
@@ -32,9 +33,10 @@ def setUp(self):
         self.bigg_nano=bigg_nano
         self.bigg_gff=bigg_gff
         self.bigg=bigg_nano+bigg_gff
+        self.gene=gene
 
 
-    def test_get_corrected_junction(self):
+    def __test_get_corrected_junction(self):
         import random
         from copy import deepcopy
         #bigg=self.bigg[0]
@@ -78,11 +80,12 @@ def test_is_bigg1_inside_bigg2_junction(self):
 
     def test_junction_simple_merge(self):
 
-        print(len(self.bigg))
+        print(self.gene, "track size is ", len(self.bigg))
         bigg_n=junction_simple_merge(self.bigg)
 
         print(len(bigg_n))
-        write_bigg(bigg_n, "./test/genes/AT2G43410/tt.bed")
+        print(os.getcwd())
+        write_bigg(bigg_n, "./genes/{gene}/tt.bed".format(gene=self.gene))
 
     def test_is_single_exon_in(self):
         # get single and note single
@@ -110,7 +113,7 @@ def test_flow_junction_cluster(self):
         #write_bigg(bigg_subread, "./test/genes/AT2G43410/AT2G43410_subread.bed")
 
     def test_bigg_get_namedic(self):
-        bigg_list=read_bigg("genes/AT2G43410/AT2G43410_subread.bed")
+        bigg_list=read_bigg("./genes/AT2G43410/AT2G43410_nano.bed")
         name_dic=bigg_get_namedic(bigg_list)
         print(name_dic)
 
diff --git a/test/track_test.py b/test/track_test.py
@@ -148,14 +148,16 @@ def test_orfs(self):
         can only run with ce10 ref
         """
         # ref_dic is a large external file
-        ref_dict=fasta2dic("/data/reference/ce10_ucsc.fa")
+        ref_dict=fasta2dic("/data/reference/ce11.fa")
         bigg_one=self.bigg[30]
+        print(bigg_one)
         bigg_one.bind_chroseq(ref_dict, gap=0, intron=False)
         print((bigg_one.seq_chro))
-        ans=bigg_one.find_orfs_with_trans()
+        bigg_one.orf_find(refdic=ref_dict)
+        print((bigg_one.seq_mrna))
 
-        print(ans)
-        print(bigg_one)
+        print()
+        print(bigg_one.thickStart, bigg_one.thickEnd)
 
     def test_orf(self):
         pass

diff --git a/test/utils_test.py b/test/utils_test.py
@@ -50,6 +50,10 @@ def test_detail_file(self):
         logger.debug("This is just a test, print number 100 as float" + "," + str(float(1) * 100))
         logger.debug("In total %d line, %s" % (2, "just a test number"))
 
+    def test_find_longest_orf(self):
+        seq="AAATGAACTTGACGAGTAATGAG"
+        print(find_longest_orf(seq))
+
     def tearDown(self):
         self.fafile=None
 

diff --git a/trackcluster/batch.py b/trackcluster/batch.py
@@ -51,7 +51,7 @@ def process_one_junction_corrected_try(key, full=False, batchsize=500):
     if bigg_nano_raw is None:
         return 0
 
-    ### add the bactsize part
+    ### add the batch size part
     n_count = 100
     n = 0
     bigg_nano = junction_pre(bigg_nano_raw, bigg_gff)

diff --git a/trackcluster/clustercj.py b/trackcluster/clustercj.py
@@ -147,81 +147,6 @@ def split_site(junction_array):
     return groups
 
 
-def _is_junction_inside(j1, j2):
-    """
-    test if j1 is in j2， change into comp all junctions except single exon
-    still very slow， unused
-    :param j1: junction numpy array, with all aligned part
-    :param j2:
-    :return: 0 for contain, 3 for differ, 1 for j1 in j2, -1 for j2 in j1, "e" for error
-    """
-    # do not consider the single exon track/read
-    # first to simplify the two unions from the large union
-
-    if numpy.array_equal(j1,j2):
-        return 0
-
-    # select the union of the junction list
-    pos_simple=[pos for pos, value in enumerate(j1+j2) if value>=1]
-    j1_s=j1[pos_simple]
-    j2_s=j2[pos_simple]
-
-    #print(j1_s)
-    #print(j2_s)
-
-    if numpy.sum(j1_s)==0 or numpy.sum(j2_s)==0:
-        return "is_junction_inside error: single exon"
-
-    j12_del=j1_s-j2_s
-    #freq_dic = get_array_freq(j12_del)
-
-    # filter the easy differ ones
-
-    #if freq_dic[-1] > 0 and freq_dic[1] > 0:  # has more and less junction, must be 3
-    #    return 3
-
-    # consider 3 types using group_l, this will be slower
-    group_l=split_site(j12_del)
-    #print(group_l)
-
-    # did not consider the len=0 and len=1
-    # single exon and equal junction should have been excluded
-    if len(group_l)>=3:
-        return 3
-    elif len(group_l)==2: # should be all 0, 1 or all 0 ,-1
-        # need to consider 5' or 3' missing
-        # 5' is belong, 3' is differ
-        a,b=group_l
-        if -1<a[0]+b[0]<1: # should not happen for [-1,-1], [1,1], but no overalp
-            return 3
-        if a[0]+ b[0]>=1: #j1_s > j2_s
-            if b[0]==0: # 5' missing
-                return -1
-            else: # 3' missing
-                return 3
-
-        elif a[0]+ b[0]<=-1: #j1_s < j2_s
-            if b[0]==0: # 5' missing
-                return 1
-            else: # 3' missing
-                return 3
-
-
-def __compare_junction(j1, j2):
-    """
-    Two junctions in j1, j2, generated by get_read_junction_dic
-    :param j1: numpy array, contian 0 or 1 only, the position is well aligned for all tracks,
-    :param j2:
-    :return: use 0:"equal", 1:"in", 2:"contain", 3:"diff"
-    does not conserider single exon tracks
-    """
-    # do not consider the single exon track/read
-
-    # single junction first
-    pass
-
-
-
 
 def get_corrected_dic(site_cov_dic, cov_cutoff=2, pos_cutoff=10):
     """