Learning objectives:
* Learn what transcriptome assembly is
* Learn to differentiate different types of assemblies
* Discuss how do assemblers work
* Learn to check the quality of a transcriptome assembly
In variant calling, we mapped reads to a reference and looked systematically for differences.
But what if you don't have a reference? How do you construct one?
The answer is de novo assembly, and the basic idea is you feed in your reads and you get out a bunch of contigs, that represent stretches of RNA present in the reads that don't have any long repeats or much significant polymorphism. Like everything else, the basic idea is that you run a program, feed in the reads, and get out a pile of assembled RNA.
Trinity, developed at the Broad Institute and the [Hebrew University of Jerusalem](http://www.cs.huji.ac.il/, represents a novel method for the efficient and robust de novo reconstruction of transcriptomes from RNA-seq data. We will be using the eel-pond protocol, which uses trinity, as our guide to doing RNA-seq assembly.
Boot an m1.medium Jetstream instance and log in.
Create and environment and install the Trinity assembler through conda
:
conda create -y -n trinity-env trinity khmer
where:
-y
says to install the pkgs without double checking with me-n
names the environment "trinity-env"
Now, let's activate that environment:
conda activate trinity-env
We will be using the same data as before (Schurch et al, 2016), so the following commands will create a new folder assembly
and link the trimmed data we prepared earlier in the newly created folder:
cd ~
ls -l quality/*qc* .
You should see the following:
-rw-rw-r-- 1 dibada dibada 59465832 Jul 11 15:45 quality/ERR458493.qc.fq.gz
-rw-rw-r-- 1 dibada dibada 58503534 Jul 11 15:46 quality/ERR458494.qc.fq.gz
-rw-rw-r-- 1 dibada dibada 58054460 Jul 11 15:46 quality/ERR458495.qc.fq.gz
-rw-rw-r-- 1 dibada dibada 102164315 Jul 11 15:44 quality/ERR458500.qc.fq.gz
-rw-rw-r-- 1 dibada dibada 101187090 Jul 11 15:44 quality/ERR458501.qc.fq.gz
-rw-rw-r-- 1 dibada dibada 100550095 Jul 11 15:45 quality/ERR458502.qc.fq.gz
If you don't see the data, you can download it using the following commands:
cd ~
mkdir -p quality
cd quality
curl -L https://osf.io/wfz34/download -o ERR458493.qc.fq.gz
curl -L https://osf.io/jxh4d/download -o ERR458494.qc.fq.gz
curl -L https://osf.io/zx7n3/download -o ERR458495.qc.fq.gz
curl -L https://osf.io/96mrj/download -o ERR458500.qc.fq.gz
curl -L https://osf.io/wc8yn/download -o ERR458501.qc.fq.gz
curl -L https://osf.io/sdtz3/download -o ERR458502.qc.fq.gz
cd ~/
mkdir -p assembly
cd assembly
ln -fs ~/quality/*.qc.fq.gz .
ls
In this section, we’ll apply digital normalization and variable-coverage k-mer abundance trimming to the reads prior to assembly. This has the effect of reducing the computational cost of assembly without negatively affecting the quality of the assembly. Although the appropriate approach would be to use all 6 samples, for time consideration we will be using just the first one, i.e. ERR458493.qc.fq.gz
normalize-by-median.py --ksize 20 --cutoff 20 -M 10G --savegraph normC20k20.ct --force_single ERR458493.qc.fq.gz
This tools works mainly for paired-end reads, combined with a one file containing single-end reads. Given that all our samples are single end, we'll use the --force_single
flag to force all reads to be considered as single-end. The parameter --cutoff
indicates that when the median k-mer coverage level is above this number the read is not kept. Also note the -M
parameter. This specifies how much memory diginorm should use, and should be less than the total memory on the computer you’re using. (See choosing hash sizes for khmer for more information.
(This step should take about 2-3 minutes to complete)
Now, run through all the reads and trim off low-abundance parts of high-coverage reads
filter-abund.py --threads 6 --variable-coverage --normalize-to 18 normC20k20.ct *.keep
The parameter --variable-coverage
requests that only trim low-abundance k-mers from sequences that have high coverage. The parameter --normalize-to
bases the variable-coverage cutoff on this median k-mer abundance.
(This step should take about 2-3 minutes to complete)
Trinity works both with paired-end reads as well as single-end reads (including simultaneously both types at the same time). In the general case, the paired-end files are defined as --left left.fq
and --right right.fq
respectively. If assembling from single-end reads (as in this case), we use the flag --single
.
First of all though, we need to make sure that there are no whitespaces in the header of the input fastq file. This is done using the following command:
cat ERR458493.qc.fq.gz.keep.abundfilt | tr -d ' ' > ERR458493.qc.fq.gz.keep.abundfilt.clean
So let's run the assembler as follows:
time Trinity --seqType fq --max_memory 10G --CPU 4 --single ERR458493.qc.fq.gz.keep.abundfilt.clean --output yeast_trinity
(This will take about 20 minutes)
You should see something like:
All commands completed successfully. :-)
** Harvesting all assembled transcripts into a single multi-fasta file...
Thursday, July 11, 2019: 17:15:47 CMD: find /home/dibada/assembly/yeast_trinity/read_partitions/ -name '*inity.fasta' | /opt/miniconda/envs/trinity-env/opt/trinity-2.8.5/util/support_scripts/partitioned_trinity_aggregator.pl --token_prefix TRINITY_DN --output_prefix /home/dibada/assembly/yeast_trinity/Trinity.tmp
-relocating Trinity.tmp.fasta to /home/dibada/assembly/yeast_trinity/Trinity.fasta
Thursday, July 11, 2019: 17:15:47 CMD: mv Trinity.tmp.fasta /home/dibada/assembly/yeast_trinity/Trinity.fasta
###################################################################
Trinity assemblies are written to /home/dibada/assembly/yeast_trinity/Trinity.fasta
###################################################################
Thursday, July 11, 2019: 17:15:47 CMD: /opt/miniconda/envs/trinity-env/opt/trinity-2.8.5/util/support_scripts/get_Trinity_gene_to_trans_map.pl /home/dibada/assembly/yeast_trinity/Trinity.fasta > /home/dibada/assembly/yeast_trinity/Trinity.fasta.gene_trans_map
at the end.
First, save the assembly:
cp yeast_trinity/Trinity.fasta yeast-transcriptome-assembly.fa
Now, look at the beginning:
head yeast-transcriptome-assembly.fa
It's RNA! Yay!
Let's capture also some statistics of the Trinity assembly. Trinity provides a handy tool to do exactly that:
TrinityStats.pl yeast-transcriptome-assembly.fa
The output should look something like the following:
################################
## Counts of transcripts, etc.
################################
Total trinity 'genes': 3296
Total trinity transcripts: 3320
Percent GC: 42.02
########################################
Stats based on ALL transcript contigs:
########################################
Contig N10: 1420
Contig N20: 1069
Contig N30: 802
Contig N40: 634
Contig N50: 514
Median contig length: 321
Average contig: 448.98
Total assembled bases: 1490618
#####################################################
## Stats based on ONLY LONGEST ISOFORM per 'GENE':
#####################################################
Contig N10: 1385
Contig N20: 1021
Contig N30: 787
Contig N40: 618
Contig N50: 502
Median contig length: 319
Average contig: 442.56
Total assembled bases: 1458676
This is a set of summary stats about your assembly. Are they good? Bad? How would you know?