-
Notifications
You must be signed in to change notification settings - Fork 51
Running SQANTI3 Quality Control
-
Selecting the metric used to aggregate TSS ratio across samples
-
IsoAnnotLite and SQANTI3: using an existing tappAS GFF3 annotation file
Before running SQANTI3, you will need to:
Activate the SQANTI3 conda environment:
(base)-bash-4.1$ conda activate SQANTI3.env
(SQANTI3.env)-bash-4.1$
The SQANTI3 quality control script accepts the following arguments:
usage: sqanti3_qc.py [-h] [--min_ref_len MIN_REF_LEN] [--force_id_ignore]
[--aligner_choice {minimap2,deSALT,gmap,uLTRA}]
[--CAGE_peak CAGE_PEAK] [--polyA_motif_list POLYA_MOTIF_LIST]
[--polyA_peak POLYA_PEAK] [--phyloP_bed PHYLOP_BED]
[--skipORF] [--is_fusion] [--orf_input ORF_INPUT]
[--fasta] [-e EXPRESSION] [-x GMAP_INDEX]
[-t CPUS] [-n CHUNKS] [-o OUTPUT] [-d DIR]
[-c COVERAGE] [-s SITES] [-w WINDOW] [--genename]
[-fl FL_COUNT] [-v] [--saturation]
[--report {html,pdf,both,skip}]
[--isoAnnotLite] [--gff3 GFF3]
[--short_reads SHORT_READS] [--SR_bam SR_BAM]
[--isoform_hits]
[--ratio_TSS_metric {max,mean,median,3quartile}]
isoforms annotation genome
Structural and Quality Annotation of Novel Transcript Isoforms
positional arguments:
isoforms Isoforms (FASTA/FASTQ) or GTF format. It is recommended to
provide them in GTF format, but if it is needed to map the sequences
to the genome use a FASTA/FASTQ file with the
--fasta option.
annotation Reference annotation file (GTF format)
genome Reference genome (Fasta format)
options:
-h, --help show this help message and exit
--min_ref_len MIN_REF_LEN
Minimum reference transcript length (default: 0 bp)
If supplied, all isoforms matching reference < min_ref_len
are recorded to *_omitted_due_to_min_ref_len.txt file
--force_id_ignore Allow the usage of transcript IDs non related with
PacBio's nomenclature (PB.X.Y)
--aligner_choice {minimap2,deSALT,gmap,uLTRA}
--CAGE_peak CAGE_PEAK
FANTOM5 Cage Peak (BED format, optional)
--polyA_motif_list POLYA_MOTIF_LIST
Ranked list of polyA motifs (text, optional)
--polyA_peak POLYA_PEAK
PolyA Peak (BED format, optional)
--phyloP_bed PHYLOP_BED
PhyloP BED for conservation score (BED, optional)
--skipORF Skip ORF prediction (to save time)
--is_fusion Input are fusion isoforms, must supply GTF as input
--orf_input ORF_INPUT
Input fasta to run ORF on. By default, ORF is run on genome-corrected
fasta - this overrides it. If input is fusion (--is_fusion), this must be provided for ORF prediction.
--fasta Use when running SQANTI by using as input a FASTA/FASTQ with the sequences of isoforms
-e EXPRESSION, --expression EXPRESSION
Expression matrix (supported: Kallisto tsv)
-x GMAP_INDEX, --gmap_index GMAP_INDEX
Path and prefix of the reference index created by gmap_build. Mandatory
if using GMAP unless -g option is specified.
-t CPUS, --cpus CPUS Number of threads used during alignment by aligners. (default: 10)
-n CHUNKS, --chunks CHUNKS
Number of chunks to split SQANTI3 analysis in for speed up (default: 1).
-o OUTPUT, --output OUTPUT
Prefix for output files.
-d DIR, --dir DIR Directory for output files. Default: Directory where the script was run.
-c COVERAGE, --coverage COVERAGE
Junction coverage files (provide a single file, comma-delmited filenames,
or a file pattern, ex: "mydir/*.junctions").
-s SITES, --sites SITES
Set of splice sites to be considered as canonical (comma-separated list of splice
sites). Default: GTAG,GCAG,ATAC.
-w WINDOW, --window WINDOW
Size of the window in the genomic DNA screened for Adenine content downstream of TTS
--genename Use gene_name tag from GTF to define genes. Default: gene_id used to define genes
-fl FL_COUNT, --fl_count FL_COUNT
Full-length PacBio abundance file
-v, --version Display program version number.
--saturation Include saturation curves into report
--report {html,pdf,both,skip}
select report format --html --pdf --both --skip
--isoAnnotLite Run isoAnnot Lite to output a tappAS-compatible gff3 file
--gff3 GFF3 Precomputed tappAS species specific GFF3 file. It will serve as reference
to transfer functional attributes
--short_reads SHORT_READS
File Of File Names (fofn, space separated) with paths to FASTA or FASTQ from
Short-Read RNA-Seq. If expression or
coverage files are not provided, Kallisto (just for pair-end
data) and STAR, respectively, will be run to calculate them.
--SR_bam SR_BAM Directory or fofn file with the sorted bam files of Short Reads RNA-Seq mapped
against the genome
--isoform_hits Report all FSM/ISM isoform hits in a separate file
--ratio_TSS_metric {max,mean,median,3quartile}
Define which statistic metric should be reported in the ratio_TSS column
This are the minimal files that you will need to run SQANTI3 QC:
-
Long read-defined transcriptome. It can be obtained after processing data from any of the available Third Generation Sequencing techonologies like Iso-Seq (PacBio) or Oxford Nanopore (ONT). SQANTI3 is compatible with the output of any long read-based transcriptome building pipeline, such as IsoSeq3, TALON, or FLAIR. SQANTI3 accepts it in several formats such as FASTA, FASTQ and GTF:
- GTF (default): by default, SQANTI3 expects the transcriptome to be provided as a GTF file, and we recommend to stick to this format if your transcriptome construction pipeline allows it.
-
FASTA/FASTQ: if you provide your transcript sequences in FASTA or FASTQ format, you will also need to supply the
--fasta
option. In this case, a mapping step will be initially performed. SQANTI3 currently supports minimap2 (default), uLTRA, deSALT and GMAP for mapping. To select a mapper, just supply the name to the--aligner_choice
argument.
-
Reference transcriptome annotation in GTF format. This file will be used as reference to describe the amount of novelty across long read transcripts. Some examples of reference transcriptomes for different species can be found in GENCODE or CHESS.
-
Reference genome in FASTA format, for example, hg38. Before running SQANTI3, make sure of the following:
- The transcriptome annotation GTF must match the provided reference genome.
- The chromosome/scaffolds names must be same in the reference annotation and the reference genome.
WARNINGS:
-
Before running SQANTI3, it is strongly recommended to collapse redundant transcript sequences using cDNA_Cupcake or TAMA. To learn more about this, we suggest taking a look at our recommended long read processing workflow.
-
Note that, by default, SQANTI3 expects PacBio-formatted IDs (i.e. PB.XX.XX). Users whose data was not processed using IsoSeq3 should add the
--force_id_ignore
flag to override this behavior.
-
Short read data: SQANTI3 QC uses short-read data to validate several aspects of long read defined transcripts, i.e. to compute gene/isoform expression, junction coverage and TSS ratio. To learn more about the different ways in which short reads can be supplied to SQANTI3 QC, see Provding short reads section.
-
CAGE Peak data: In SQANTI2, a CAGE peak database for the hg38 genome was provided, originally retrieved from FANTOM5. Now, we recommend to use data from refTSS. To ease usage, some versions of this annotation for several species have been uploaded to the
data/ref_TSS_annotation
folder, and we will try to keep them updated for new genome releases. -
polyA motif list: A ranked list of polyA motifs to find upstream of the 3' end site. See human polyA list for an example. We have included a list of motifs that can be used for mouse and human data at the
data/polyA_motifs/
folder. Please, if you know which are the most likely polyA motifs for other type of organism/clade, we would highly appreciate if you let us know. -
FL count information: See FL count section to include Iso-Seq FL count information for each isoform.
-
tappAS-annotation file: when the
--isoAnnotLite
flag is activated, a GFF3 file containing isoform-level functional annotations for a reference transcriptome (e.g. Ensembl) can also be supplied via the--gff3
flag. -
Intropolis junction BED file: In previous versions of SQANTI, it was provided a version of Intropolis for hg38 genome, modified into STAR junction format which is still valid.
If you don't feel like running the ORF prediction part, use --skipORF
. Just know that all your transcripts will be annotated as non-coding.
If you have short read data, you can run STAR to get the junction file (usually called SJ.out.tab
, see STAR manual) and supply it to SQANTI3 as is. If you have several samples, it is recommended to provide them as separated *SJ.out.tab
files.
By way of example, the following command was used to run SQANTI3 with the example
data. The input files are a subdivision of the Universal Human Reference (UHR) IsoSeq data that corresponds just to those polished and collapsed sequences located at chromosome 22.
python sqanti3_qc.py example/UHR_chr22.gtf example/gencode.v38.basic_chr22.gtf example/GRCh38.p13_chr22.fasta \
--CAGE_peak data/ref_TSS_annotation/human.refTSS_v3.1.hg38.bed \
--polyA_motif_list data/polyA_motifs/mouse_and_human.polyA_motif.txt \
-o UHR_chr22 -d example/SQANTI3_output -fl example/UHR_abundance.tsv \
--short_reads example/UHR_chr22_short_reads.fofn --cpus 4 --report both
It is highly recommended to run SQANTI3 using a GTF file with the Long-Read defined isoforms instead of input FASTA/FASTQ file with their sequences. If you ran Cupcake collapse, the obtained *collapsed.gff
file is actually in GFF2 format (equivalent to GTF), so you can use it as is. If you provide the sequences, a mapping step will be needed and the decisions took by the mapping algorithm will affect the final results.
Short read data can be supplied to SQANTI3 in several ways:
- Raw short-read data: for users who may want SQANTI3 to compute all short read-based attributes internally without the need to perform any additional pre-processing, short read FASTA/FASTQ files can be provided using the dedicated argument (see Raw short read data section below).
Alternatively, users may supply pre-processed short read data in one or more of these formats:
-
Short read-based transcript expression: see Short read isoform expression section below for details on how to include pre-computed, short read-based expression files into SQANTI3 QC (RSEM or Kallisto output files or a user-defined expression matrix).
-
Splice junction coverage by short-reads: short reads can be mapped to the genome using STAR to obtain junction coverage information, included in the the
SJ.out.tab
file (see Splice junction coverage section for details). -
Short read BAMs: should be obtained by mapping short reads against the reference genome using a splice-aware mapper, such as STAR. SQANTI3 QC will use this information to compute the TSS ratio.
Matching RNA-Seq data (i.e. obtained from the same samples as your long-read data) can be supplied to SQANTI3 in the form of FASTA/FASTQ files via the --short_reads
argument. The expected file format is a File Of File Names (.fofn), which is a text file that contains the paths of the Short-Read RNA-Seq data with one line per replicate and separated by one space in the case of paired-end data. It should look like this:
/path/to/replicate1.R1.fastq /path/to/replicate1.R2.fastq
/path/to/replicate2.R1.fastq /path/to/replicate2.R2.fastq
STAR and kallisto will be run to automatically calculate splice-junction coverage, isoform expression and the ratio_TSS
attribute, a metric that quantifies the differences of mean coverage 100bp upstream and downstream the reported TSS.
Please note that, if you are going to use the --short_reads
option, you should provide enough memory and computational resources for mapping and quantifying.
Warning: Kallisto will only be run if paired-end data is provided. If you wish to run Kallisto on single-end data, you may do it before running SQANTI3 and supply the isoform expression matrix as indicated below.
Users can supply their pre-computed isoform expression via the --expression
argument. Depending on whether this information is provided as one or multiple files, two formats are supported:
- If you input one expression file per sample, you can provide several expression data files as a chain of comma-separated paths or by providing a directory where ONLY expression data is present. Accepted formats include the output of Kallisto and RSEM (see below).
- Alternatively, it is possible to provide a pre-computed expression matrix. The matrix should be a tab-separated file with transcripts as rows and samples/replicates as columns. Transcript identifiers must match the ones included in the input long read-defined transcriptome, and the name for that column must be
ID
. The rest of the columns in the header can be named as desired, and should correspond to the sample/replicate identifier.
Kallisto expression files have the following format:
target_id length eff_length est_counts tpm
PB.1.1 1730 1447.8 0 0
PB.1.3 1958 1675.8 0 0
PB.2.1 213 54.454 0 0
PB.2.2 352 126.515 0 0
PB.3.1 153 40.3918 0 0
PB.4.1 1660 1377.8 0 0
PB.5.1 2767 2484.8 0 0
RSEM expression files have the following format:
transcript_id gene_id length effective_length expected_count TPM FPKM IsoPct posterior_mean_count posterior_standard
_deviation_of_count pme_TPM pme_FPKM IsoPct_from_pme_TPM TPM_ci_lower_bound TPM_ci_upper_bound FPKM_ci_lower_boun
d FPKM_ci_upper_bound
PB.1.1 PB.1 1516 1369.11 8.00 0.05 0.22 100.00 8.00 0.00 0.06 0.25 100.00 0.0233365 0.0984532 0.
0981584 0.413561
PB.10.1 PB.10 1101 954.11 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.04 100.00 4.80946e-08 0.0283585 2.
01809e-07 0.11905
PB.100.1 PB.100 979 832.11 0.00 0.00 0.00 0.00 6.62 6.60 0.08 0.35 4.00 3.51054e-06 0.241555 1.47313e-05 1.01397
PB.100.2 PB.100 226 81.11 20.18 2.26 9.47 100.00 16.84 5.20 1.99 8.36 96.00 0.559201 3.45703 2.
34572 14.5141
If you have short read data, you can run STAR to get the junction file (usually called SJ.out.tab
) and supply it to SQANTI3 via the -c
argument. It is possible to use a different mapper, however, SQANTI3 requires that the format of the junction is similar to that of STAR's SJ.out.tab
file (see below).
As described in the STAR manual:
SJ.out.tab contains high confidence collapsed splice junctions in tab-delimited format. Note that STAR defines the junction start/end as intronic bases, while many other software define them as exonic bases. The columns in
SJ.out.tab
have the following meaning:
- column 1: chromosome
- column 2: first base of the intron (1-based)
- column 3: last base of the intron (1-based)
- column 4: strand (0: undefined, 1: +, 2: -)
- column 5: intron motif: 0: non-canonical; 1: GT/AG, 2: CT/AC, 3: GC/AG, 4: CT/GC, 5:AT/AC, 6: GT/AT
- column 6: 0: unannotated, 1: annotated in the splice junctions database. Note that in 2-pass mode, junctions detected in the 1st pass are reported as annotated, in addition to annotated junctions from GTF.
- column 7: number of uniquely mapping reads crossing the junction
- column 8: number of multi-mapping reads crossing the junction
- column 9: maximum spliced alignment overhang The filtering for this output file is controlled by the
--outSJfilter
parameters.
If you have several samples, we recommend providing them as separated *SJ.out.tab
files instead of combined into a single file. To provide several files, you can follow different strategies:
- Supply the path of the directory where all the
SJ.out.tab
are situated. - Provide each of their individual paths separated by a comma.
- Use a wildcard to provide all paths at the same time, e.g.
/path/to/*my_suffix.tsv
.
If you used the -c
argument to provide coverage information, but you want to calculate ratio_TSS
values for each isoform, SQANTI3 QC will still require you to supply short read mapping information. Therefore, we recommend users to supply the BAM files obtained as a result of mapping short reads to the genome via the --SR_bam
option. Otherwise, you may supply short reads as raw data to SQANTI3 and have all pre-processing steps done internally, however, keep in mind that this will take longer as well as require more memory/computational resources.
To perform quality control of the Transcription Start Site (TSS), users may provide CAGE peak data. This is particularly relevant given that 5' RNA degradation can generate variability in the TSS that can be mistaken for a novel TSS.
CAGE peak data must be provided as a BED file, where:
- Column 1 contains the chromosome (chr1, chr2...).
- Column 2 contains the zero-based index start coordinate.
- Column 3 contains one-based index end coordinate.
- Column 6 contains strand information.
The BED file needs to be supplied to sqanti3_qc.py
via the --CAGE_peak
argument.
Please, note that public CAGE peak data for human and mouse from the refTSS database is supplied in SQANTI3's data
folder.
SQANTI3 QC will use this data to validate the transcript's TSS. First, it will evaluate whether the TSS of the different long read-defined transcripts fall inside a CAGE peak (within_CAGE_peak
column). Then, it will calculate the distance to the nearest detected CAGE peak (dist_to_CAGE_peak
column in the output *_classification.txt
file). A detailed glossary of classification file columns can be found here.
To perform quality control of the Transcription Termination Site (TTS), users may provide polyA motif or polyA site/peak information. SQANTI3 QC will use this data to create several descriptors regarding the presence of the TTS within polyA sites/motifs, as well as the distance to them, if detected.
The polyA motif list should be supplied in the form of a text file. The most common polyA motifs for human and mouse are supplied in the SQANTI3 data
folder (see here).
The list of polyA motifs with the --polyA_motif_list
parameter in sqanti3_qc.py
.
If a polyA motif is identified at the 3' end of the transcript, the polyA_motif_found
column in the classification file will be TRUE
. In addition, SQANTI3 returns the sequence (polyA_motif
) and the distance (polyA_dist
) to the closest polyA motif detected. A detailed glossary of classification file columns can be found here.
Complementary to polyA motif information, polyA site data can be supplied. It must be supplied as a BED file, where:
- Column 1 contains the chromosome (chr1, chr2...).
- Column 2 contains the zero-based index start coordinate.
- Column 3 contains one-based index end coordinate.
- Column 6 contains strand information.
For human, mouse, and worm, you can download public polyA site data from the PolyASite atlas. Note, however, that the chromosomes in the BED files are missing the "chr" prefix. You will need to modify the downloaded BED file as follows:
sed -i 's/^/chr/' atlas.clusters.2.0.GRCh38.96.bed
The polyA site BED file must be supplied to sqanti3_qc.py
via the --polyA_peak
argument.
Using this information, SQANTI3 will write out the distance to the closest polyA site/peak (dist_to_polyA_site
column) and whether the transcript's TTS was found inside a polyA site/peak (within_polyA_site
column) to the output *_classification.txt
file. A detailed glossary of classification file columns can be found here.
sqanti3_qc.py
supports single or multi-sample FL counts from PacBio Iso-Seq pipeline. There are three acceptable formats.
A single sample FL Count file is automatically produced by the Iso-Seq With Mapping pipeline in SMRTLink/SMRTAnalysis with the following format:
#
# -----------------
# Field explanation
# -----------------
# count_fl: Number of associated FL reads
# norm_fl: count_fl / total number of FL reads, mapped or unmapped
# Total Number of FL reads: 1065
#
pbid count_fl norm_fl
PB.1.1 2 1.8779e-03
PB.1.2 6 5.6338e-03
PB.1.3 3 2.8169e-03
PB.2.1 3 2.8169e-03
PB.3.1 2 1.8779e-03
PB.4.1 8 7.5117e-03
A multi-sample FL Count file produced by the chain_samples.py script in Cupcake will have the following format:
superPBID | sample1 | sample2 |
---|---|---|
PB.1.2 | 3 | NA |
PB.2.1 | 2 | NA |
PB.3.1 | 2 | 2 |
PB.3.2 | 5 | 3 |
PB.3.3 | 5 | 2 |
This is a tab-delimited file.
A multi-sample Demux FL Count file produced by the demux_isoseq_with_genome.py script in Cupcake will have the following format:
id,sample1,sample2
PB.1.1,3,10
PB.1.2,0,11
PB.1.3,4,4
This is a comma-delimited file.
For each sample provided through the --fl_count
option, sqanti3_qc.py
will create a column in the *_classification.txt
output file that is FL.<sample>
. Note that this is the raw FL count provided. The sum of all the FL reads accross the samples associated to one transcript will be recorded in th FL
column of the *_classification.txt
output file.
When plotted, the script SQANTI3_report.R will convert the FL counts to TPM using the formula:
raw FL count for PB.X.Y in sample1
FL_TPM(PB.X.Y,sample1) = ------------------------------------- x 10^6
total FL count in sample1
Two additional columns, FL_TPM.<sample>
and FL_TPM.<sample>_log10
will be added and output to a new classification file with the suffix *_classification_TPM.txt
. Please, do not mix up *_classification_TPM.txt
and *_classification.txt
files. The one used in subsequent scripts (filtering, future isoAnnot, etc.) will be the _classification.txt
one. A detailed glossary of classification file columns can be found here.
As of v5.2, SQANTI3 includes the --ratio_TSS_metric
argument, which can be used to tweak the results in the TSS_ratio
column of the
classification.txt file. Briefly, SQANTI3 calculates the TSS ratio metric for all supplied short-read samples and replicates, and then summarizes the results to provide a single TSS ratio value.
The following options are available:
-
max
(default): uses the maximum value among the short-read samples and replicates supplied. -
mean
: mean value of samples and replicates is used. -
median
: median value of samples and replicates is used. -
3rd quartile
: the 3rd quartile value of samplesl and replicates is used.
When aiming to discover novel and/or rare TSS, this metric will ensure that underrepresented sites will not be disregarded just because they are only captured in one sample. median
and 3rd quartile
, on the other hand, allow the user to enforce different levels of robustness, preventing the metric from being driven by outliers and favoring widely-detected TSS to be preserved after QC. mean
values, on the other hand, provide a balance between both scenarios in situations in which the degree of TSS novelty of the sample is unknown.
When the --isoAnnotLite
flag is supplied to SQANTI3 QC, the tool will run IsoAnnotLite internally. IsoAnnotLite is a Python script designed to use the SQANTI3 QC output to generate the input for tappAS, i.e. tappAS-compatible GFF3 file (see "Annotation features file format" section in the tappAS overview).
If you want IsoAnnotLite to perform positional transfer of functional features, you will also need to provide a pre-annotated tappAS GFF3 file via the --gff3
argument. By doing this, functional feature information will be added to your long read-defined transcriptome, unlocking tappAS functional and feature-level analyses. You can find all functional annotation files available for tappAS here.
Note that, if no pre-annotated tappAS GFF3 file is provided (e.g. because there is no file available for your species), IsoAnnotLite will only perform structural information transfer, but a tappAS-like GFF3 file containing this structural information will still be generated. Therefore, in this case, users will still be able to load their data into tappAS for isoform-level expression analyses.
Warning: we are aware that some of the tappAS GFF3 files correspond to old transcriptome releases of reference databases (ENSEMBL, RefSeq, etc.). We are currently working on updating these annotations to later transcriptome versions.
There are two options related to parallelization:
- The
-t
(--cpus
) parameter designates the number of CPUs used by the aligners for long and short reads. If your have supplied your transcriptome a GTF file and you do not provide short-read FASTA/FASTQ files, the-t
option has no effect. - The
-n
(--chunks
) parameter divides the input transcriptome (GTF or FASTA/FASTQ) into chunks and runs SQANTI3 in parallel for each fragment, then combining the results into one classification and junctions file.
Note that both of this parameters can be combined, for instance, if you have set -t 30 -n 10
, then each chunk will get (30/10=3) CPUs.
Wiki index
- Introduction to SQANTI3
- Dependencies and installation
- Version history
- Isoform classification: categories and subcategories
- Running SQANTI3 quality control
- Understanding the output of SQANTI3 QC
- IsoAnnotLite
- Running SQANTI3 filter
- Running SQANTI3 rescue
- Tutorial: running SQANTI3 on an example dataset
- Running SQANTI-reads
- Memory requirements to use parallelization