Skip to content

Running SQANTI3 rescue

Ángeles Arzalluz-Luque edited this page Jul 5, 2022 · 33 revisions

⚠️UNDER CONSTRUCTION⚠️

Table of contents:


Introduction

As of SQANTI3 v5.1, a new module has been added to the SQANTI3 workflow for transcriptome characterization and quality control: SQANTI3 rescue.

The SQANTI3 rescue algorithm is designed to be run after transcriptome filtering and uses the long read-based evidence provided by discarded isoforms (i.e. artifacts) to recover transcripts in the associated reference transcriptome. The idea behind this strategy is to avoid losing transcripts/genes that are detected as expressed by long read sequencing, but whose start/end/junctions could not be confidently validated using orthogonal data, resulting in the removal of those genes/transcripts from the transriptome. More details about this can be found in the Motivation section below.

In particular, during the rescue, SQANTI3 will try to confidently assign each discarded artifact to the best matching reference transcript. As a result, SQANTI3 rescue will generate an expanded transcriptome GTF including a set of reference transcripts as well as the long read-defined isoforms that passed the filter.

Motivation

To be completed.

SQANTI3 rescue strategy

The SQANTI3 rescue algorithm consists in the following steps:

1. Automatic rescue

As explained above, the rescue strategy in SQ3 was conceived to recover transcriptome diversity lost during filtering. This, among other things, means verifying that mone of the reference-supported junction chains that were initially detected by long-reads are lost due to stringent artifact removal.

To achieve this during automatic rescue, all reference transcripts that were represented by at least one FSM in the original post-QC transcriptome are first retrieved -note that this information is available in the associated_transcript column of the *_classification.txt file. Then, those reference transcripts for which all FSM representatives were removed by the filter are rescued.

The previous analytic decision is justified because, in practice, any case where all FSMs with the same associated_transcript are removed can be interpreted as follows: 1) the TSS and/or TTS of the long read-defined transcript is different from that of the matching reference transcript, however, it could not be validated by SQ3 QC-supplied orthogonal data; and 2) the junctions are identical to those found in the reference, which can be interpreted as evidence that this isoform is real. As a result, SQ3 will not rescue any of the discarded FSMs, but the associated_transcript from the reference.

2. Selection of rescue candidate and target transcripts

In spite of its potential to achieve the goals that we set for the rescue, the previous strategy does not consider ISM, NIC or NNC artifacts. These will be included in the rescue candidate group, i.e. transcripts classified as artifacts for which SQANTI3 will try to find a matching reference transcript to include in the final, curated transcriptome.

  • For ISM artifact transcripts, there are two possible situations:

    • There will be cases where the same discarded reference transcript is supported by FSM and ISM. ISM artifacts with an FSM artifact counterpart will therefore be "collapsed" into the rescued reference transcript during the automatic rescue step.

    • Conversely, there will be associated_transcript references that are only supported by one or more ISM. Those ISM artifacts that constituted evidence of a non-FSM supported reference will therefore be included in the rescue candidate list.

  • For novel transcripts from the NIC and NNC categories, since there is no associated transcript information, all transcripts classified as artifacts will be included in the rescue candidate list.

As a result, we consider all reference or long read-defined transcripts from genes that have at least one rescue candidate to be rescue targets.

3. Mapping of candidates to targets

SQ3 rescue next tries to find matches between each rescue target and its same-gene candidates based on sequence similarity. To achieve this, we perform an internal mapping step using minimap2. In it, rescue candidates are considered to be "reads" and rescue targets are used as a "reference genome" in which each transcript sequence constitutes a different "chromosome".

To map candidates, we use the map-hifi option in minimap2 and the -a -x parameters:

minimap2 --secondary=yes -ax map-hifi rescue_targets.fasta rescue_candidates.fasta > mapped_rescue.sam

Finally, all candidate-target pairs obtained during mapping -referred to as mapping hits- are obtained from the output SAM file regardless of whether they are primary or secondary alignments.

4. Application of SQ3 filter to the reference transcriptome

Validation of transcripts using orthogonal sources of data is an important part of the SQANTI3 philosophy. In consequence, the rescue strategy in SQ3 includes a validation step for all reference rescue targets before considering them for inclusion in the transcriptome, since no QC information is available for them (in contrast to long read-defined targets).

This requires users to run SQANTI3 quality control on the reference transcriptome and supply the output *_classification.txt file to SQANTI3 rescue using the --refClassif (-k) flag. This must be done using the same orthogonal data files that were used when running SQANTI3 QC for the long read-transcriptome, since the rescue is based on the assumption that the same evidence is required to validate all rescue targets.

⚠️ Disclaimer: note that this is a temporary solution -we are currently working on implementing an option to run this step internally as a part of the rescue-, however, we are aware that the same reference may be used for multiple transcriptomes, therefore, this option will remain active to save runtime and resources. ⚠️

Using the supplied reference classification file, SQ3 rescue will next apply SQ3 filter to the reference transcriptome. The filter to be applied will be specified by the rules or ml flags used when running the rescue. This means that, if you run SQ3 machine learning-based filter, you should also run the rescue using the ml option (and the same is true for the rules filter).

5. Rescue-by-mapping

In the final rescue step, hereby referred to as rescue-by-mapping, mapping hits (i.e. candidate-target pairs obtained during mapping) and reference transcriptome filter results are combined to generate a final list of reference transcripts to be rescued.

This part of the rescue can be divided into the following tasks/criteria:

  • SQ3 filtering of targets: first, candidate-target pairs are removed if the rescue target did not pass the filter (ML or rules). If the user is working with the rules filter, this will mean that the reference (or long-read) target transcript did not pass the rules defined in the JSON file. If the user is working with the ML filter, this will mean that the target transcript did not pass the specified ML probability threshold. In the case of the ML filter, however, only the target transcript(s) with maximum ML probability are selected to continue in the rescue.

  • Select reference targets: as explained above, both long read-defined and reference target transcripts are considering during mapping. As a result, rescue candidates (i.e. artifacts) can match to either of these transcripts. From this step onwards, however, only targets selected from the reference transcriptome are considered. This is done to avoid introducing redundancy in the final transcriptome: if a rescue candidate's perfect match was another long read-defined counterpart from the same gene, there is no need to include another transcript as a rescued representative, since the best matching transcript is already part of the transcriptome.

  • Remove redundant reference transcripts: during rescue, there is a chance that some of the reference transcripts that we wish to include in our transcriptome is already represented by an FSM from the same gene (or that it has been already retrieved during automatic rescue). In this situation, target reference transcripts that are already represented by a long read-defined isoform are not to be included in the transcriptome in order to avoid increasing redundancy. In other words -and similarly to what was previously stated-, these are cases in which the artifact's (i.e. rescue candidate) best matching transcript is already represented by an isoform in the transcriptome.

After performing this last filter of the rescue target list, SQ3 rescue outputs a list of rescued reference transcripts, which are then added to the long-read transcriptome GTF.

Running SQ3 rescue

Similarly to the SQANTI3 filter, the SQANTI3 rescue is designed as a dual implementation, depending on whether the rules or the machine learning filter was previously run. Therefore, the sqanti3_rescue.py script requires a flag to be provided to activate either the ml or rules specific rescue.

usage: sqanti3_rescue.py [-h] {ml,rules} ...

Rescue artifacts discarded by the SQANTI3 filter, i.e. find closest match for
the artifacts in the reference transcriptome and add them to the
transcriptome.

positional arguments:
  {ml,rules}

optional arguments:
  -h, --help  show this help message and exit

Input files and arguments

Regardless of the rescue mode that is selected, SQ3 has the following common arguments, all of which are mandatory:

  • Long read-defined isoforms: should be the FASTA file generated by SQANTI3 QC (*_corrected.fasta). The isoform sequences will be used to extract rescue candidates and targets for mapping, and must be supplied via the --isoforms argument.

  • Long read-defined transcriptome annotation (filtered): should be the GTF file output by SQANTI3 filter (*.filtered.gtf). Rescued transcripts will be appended to this GTF file to generate the final curated transcriptome. This file must be supplied via the --gtf argument.

  • Reference transcriptome annotation: reference GTF used to run SQANTI3 QC. This file will be used to extract rescue targets for mapping, and must be supplied via the --refGTF (or -g) argument.

  • Reference transcriptome classification file generated after running SQANTI3 QC on the reference transcriptome, which must be done previously to running the rescue and using the same orthogonal data as for long read-defined transcriptome QC. This file must be supplied via the --refClassif (or -k) argument and will be used to evaluate reference rescue target support (see details above).

Additionally, the following parameters can be set to modify the behavior of the rescue algorithm:

  • Define rescue for mono-exon transcripts: the -e flag can be used to determine whether (and how) to perform the rescue on mono-exonic transcripts. The following options can be supplied:

    • all (default): all mono-exon transcripts classified as artifacts will be considered as rescue candidates, regardless of the category that they are classified into.
    • fsm: mono-exonic transcripts will only be considered for the rescue if they belong to the full-splice match (FSM) category. In this case, mono-exons will be rescued via the automatic rescue strategy.
    • none: do not consider mono-exonic transcripts for the rescue (all mono-exons classified as artifacts will be discarded and no reference representatives included in the final transcriptome).
  • Output directory and outfile prefix: the output directory can be set via the -d flag (default: current directory). Output files will be named using the prefix provided using the -o argument (default: SQANTI3).

  • Skip report creation: when the --skip_report is supplied, no summary rescue report will be generated.

Rules rescue arguments

In addition to the common arguments, the rules rescue requires the following specific files:

  • Rules JSON file: using the -j flag, the user must provide the JSON file used for running SQANTI3 rules filter on the long read-defined transcriptome. This same set of rules will be applied to the reference transcriptome to evaluate the reliability of reference rescue targets. Note that, to achieve this, SQANTI3 rescue requires the classification file output after running SQANTI3 QC on the reference to be provided via the --refClassif argument.

All in all, these are the arguments accepted by sqanti3_rescue.py rules:

usage: sqanti3_rescue.py rules [-h] [--isoforms ISOFORMS] [--gtf GTF] [-g REFGTF] 
                               [-f REFGENOME] [-k REFCLASSIF]
                               [-e {all,fsm,none}] [-o OUTPUT] [-d DIR] 
                               [--skip_report] [-v] [-j JSON]
                               sqanti_filter_classif

Rescue for rules-filtered transcriptomes.

positional arguments:
  sqanti_filter_classif
                        SQANTI filter (ML or rules) output classification file.

optional arguments:
  -h, --help            show this help message and exit
  --isoforms ISOFORMS   FASTA file output by SQANTI3 QC (*_corrected.fasta), 
                        i.e. the full long read transcriptome.
  --gtf GTF             GTF file output by SQANTI3 filter (*.filtered.gtf).
  -g REFGTF, --refGTF REFGTF
                        Full path to reference transcriptome GTF used when 
                        running SQANTI3 QC.
  -f REFGENOME, --refGenome REFGENOME
                        Full path to reference genome FASTA used when 
                        running SQANTI3 QC.
  -k REFCLASSIF, --refClassif REFCLASSIF
                        Full path to the classification file obtained when 
                        running SQANTI3 QC on the reference transcriptome.
  -e {all,fsm,none}, --rescue_mono_exonic {all,fsm,none}
                        Whether or not to include mono-exonic artifacts in 
                        the rescue. Options include: none, fsm and all (default).
  -o OUTPUT, --output OUTPUT
                        Prefix for output files.
  -d DIR, --dir DIR     Directory for output files. Default: Directory where 
                        the script was run.
  --skip_report         Skip creation of a report about the filtering
  -v, --version         Display program version number.
  -j JSON, --json JSON  Full path to the JSON file including the rules used when 
                        running the SQANTI3 rules filter.

Machine learning rescue arguments

In addition to the common arguments, the machine learning rescue requires the following specific files:

  • Pre-trained random forest classifier: using the -r flag, users must provide the randomforest.RData object that was obtained when running the machine learning filter on the long read-defined transcriptome. This classifier will be used to run the ML filter on reference transcriptome isoforms, i.e. obtain an ML probability value for reference rescue targets and evaluate their likelihood to be a true isoform based on orthogonal data supplied to SQANTI3 QC. Note that, to achieve this, SQANTI3 rescue requires the classification file output after running SQANTI3 QC on the reference to be provided via the --refClassif argument.

  • ML probability threshold: minimum probability value that is required in order to consider a transcript to be a true isoform. It should be set using the -j flag (default: 0.7). We recommend setting the same threshold that was used when running SQANTI3 filter.

All in all, these are the arguments accepted by sqanti3_rescue.py ml:

usage: sqanti3_rescue.py ml [-h] [--isoforms ISOFORMS] [--gtf GTF] [-g REFGTF] 
                            [-f REFGENOME] [-k REFCLASSIF]
                            [-e {all,fsm,none}] [-o OUTPUT] [-d DIR] 
                            [--skip_report] [-v] [-r RANDOMFOREST] [-j THRESHOLD]
                            sqanti_filter_classif

Rescue for ML-filtered transcriptomes.

positional arguments:
  sqanti_filter_classif
                        SQANTI filter (ML or rules) output classification file.

optional arguments:
  -h, --help            show this help message and exit
  --isoforms ISOFORMS   FASTA file output by SQANTI3 QC (*_corrected.fasta), 
                        i.e. the full long read transcriptome.
  --gtf GTF             GTF file output by SQANTI3 filter (*.filtered.gtf).
  -g REFGTF, --refGTF REFGTF
                        Full path to reference transcriptome GTF used when 
                        running SQANTI3 QC.
  -f REFGENOME, --refGenome REFGENOME
                        Full path to reference genome FASTA used when 
                        running SQANTI3 QC.
  -k REFCLASSIF, --refClassif REFCLASSIF
                        Full path to the classification file obtained when 
                        running SQANTI3 QC on the reference transcriptome.
  -e {all,fsm,none}, --rescue_mono_exonic {all,fsm,none}
                        Whether or not to include mono-exonic artifacts in 
                        the rescue. Options include: none, fsm and all (default).
  -o OUTPUT, --output OUTPUT
                        Prefix for output files.
  -d DIR, --dir DIR     Directory for output files. Default: Directory where 
                        the script was run.
  --skip_report         Skip creation of a report about the filtering
  -v, --version         Display program version number.
  -r RANDOMFOREST, --randomforest RANDOMFOREST
                        Full path to the randomforest.RData object obtained when 
                        running the SQANTI3 ML filter.
  -j THRESHOLD, --threshold THRESHOLD
                        Default: 0.7. Machine learning probability threshold to 
                        filter elegible rescue targets (mapping hits).

Rescue output files

The final result of SQANTI3 rescue is a transcriptome GTF file - named prefix_rescued.gtf - including all the transcripts that were classified as isoforms by either the rules or the ML filter (already included in the supplied prefix.filtered.gtf file) and the rescued transcript isoforms that were selected from the reference transcriptome (see rescue strategy section above).

Other additional output files that are generated in the rescue process will be explained below.

Common rescue output files

  • Automatic rescue result: a single-column file named prefix_automatic_rescued_list.tsv containing the IDs of reference transcripts that were selected during automatic rescue.

  • Rescue candidate files. Briefly, rescue candidates are a group artifact transcripts from the ISM, NIC and NNC categories (which did not undergo automatic rescue) for which a reference transcript match needs to be found. Two rescue candidate files are generated:

    • A rescue candidate ID list, named prefix_rescue_candidates.tsv.
    • A rescue candidate FASTA file named prefix_rescue_candidates.fasta containing the sequences of rescue candidates for mapping.
  • Rescue target files. As explained above, rescue targets are those reference and long read-defined transcripts for which there is a same-gene rescue candidate counterpart. During the rescue, candidates are mapped to targets using minimap2. Similarly to rescue candidates, two rescue target files are generated:

    • ID list: prefix_rescue_targets.tsv.
    • FASTA file: prefix_rescue_targets.fasta.
  • Mapping results: as a result of mapping with minimap2, two files are generated by SQ3 rescue:

    • Minimap2 SAM file containing the mapping results, named prefix_mapped_rescue.sam.
    • A three-column table in which the mapping results are summarized, named prefix_rescue_mapping_hits.tsv. This table has the following format:
PB.1.3	0	PB.7707.5
PB.1.3	256	ENST00000557932.5
PB.1.3	256	ENST00000558784.5
PB.1.3	256	ENST00000442898.5
PB.1.3	256	ENST00000692602.1
PB.1.3	256	ENST00000686495.1

Here, the first column contains the rescue candidate ID, the second column includes the SAM flag indicating the type of alignment and the last column indicates the ID of the rescue target that the candidate was successfully mapped to.

  • Rescue summary table: the prefix_rescue_table.tsv file contains a table in which rescue outcome for all the target-candidate pairs included in the mapping hit table is detailed.

    • Column 1 (rescue_candidate): rescue candidate ID.

    • Column 2 (sam_flag): SAM flag indicating the type of alignment.

    • Column 3 (mapping_hit): ID of the rescue target that the candidate was successfully mapped to.

    • Column 4 shows the filter result for the mapping hit (or rescue candidate), and may be named hit_POS_MLprob if the ml mode was used, or hit_filter_result if the rules mode was used.

      • For ML, the hit_POS_MLprob column will contain the probability that the hit is a true isoform provided by the pre-trained random forest classifier.
      • For rules, the hit_filter_result column will read Isoform or Artifact depending on whether or not the hit passed the rules defined in the JSON file.
    • Column 5 (candidate_structural_category): SQANTI3 category that the rescue candidate (i.e. the artifact that is being considered for rescue) belongs to.

    • Column 6 (rescue_result): final rescue decision. Will be rescued or not_rescued depending on whether or not the isoform was finally included in the rescued transcriptome GTF

    • Column 7 (exclusion_reason): if rescue_result == not_rescued, this column will include a flag explaining why the rescue target was not included in the rescue. On the contrary, if rescue_result == rescued, the column will contain NA. Possible exclusion reasons are:

      • long_read_transcript if the mapping hit is a long read-defined isoform, and is therefore already present in the transcriptome.
      • reference_already_present if the mapping hit is a reference transcript that is already represented by an isoform, be it an FSM that passed the filter or a transcript that was obtained during automatic rescue.
      • For ML rescue, ML_probability if the mapping hit did not pass the supplied ML probability threshold. Alternatively, for rules rescue, the flag will read artifact_by_rules, meaning that the mapping hit did not pass the rules specified in the JSON file.

An example of this table (in this case for the ml rescue mode) can be viewed here:

rescue_candidate	sam_flag	mapping_hit	hit_POS_MLprob	candidate_structural_category	rescue_result	exclusion_reason
PB.1.3	0	PB.7707.5	0.754	novel_not_in_catalog	not_rescued	long_read_transcript
PB.1.3	256	ENST00000557932.5	0.93	novel_not_in_catalog	not_rescued	reference_already_present
PB.1.3	256	ENST00000558784.5	0.098	novel_not_in_catalog	not_rescued	ML_probability
PB.1.3	256	ENST00000442898.5	0.18	novel_not_in_catalog	not_rescued	ML_probability
PB.1.3	256	ENST00000692602.1	0.998	novel_not_in_catalog	not_rescued	reference_already_present
PB.1.3	256	ENST00000686495.1	0.066	novel_not_in_catalog	not_rescued	ML_probability

  • Rescue inclusion list: a single-column file containing the IDs of all rescued transcripts, named prefix_rescue_inclusion-list.tsv. In this list, both the IDs of rescued transcripts obtained via automatic rescue and via rescue-by-mapping are combined into a single inclusion list. This list is used to generate the final rescued transcriptome GTF, i.e. prefix_rescued.gtf.

Rules-specific output

Machine learning-specific output

In addition to the files listed above, the ML rescue will also output a table including the result obtained after running the pre-trained ML classifier on the reference transcriptome. This file, named prefix_reference_isoform_predict.tsv, will contain reference transcript IDs in the first column and random forest classifier-generated probability values on the second. This probability can be interpreted as the extent to which the transcript is supported by SQANTI3 QC orthogonal data (short-reads, CAGE/polyA peaks, etc.). An example of this output is represented below:

isoform	POS_MLprob
DQ459430	0.25
DQ516784	0.262
DQ516752	0.27
DQ668364	0.262
DQ883670	0.804
EF011062	0.512
DQ875385	0.758