Skip to content

Running SQANTI3 filter

Ángeles Arzalluz-Luque edited this page Apr 26, 2022 · 22 revisions

Rules filter: removing artifacts using the SQANTI3 output and user-defined rules

I've made a lightweight filtering script based on SQANTI3 output that filters for two things: (a) intra-priming and (b) short read junction support.

The script usage is:

usage: sqanti3_RulesFilter.py [-h] [--sam SAM] [--faa FAA] [-a INTRAPRIMING]
                              [-r RUNALENGTH] [-m MAX_DIST_TO_KNOWN_END]
                              [-c MIN_COV] [--filter_mono_exonic] [--skipGTF]
                              [--skipFaFq] [--skipJunction] [-v]
                              sqanti_class isoforms gtf_file

sqanti3_RulesFilter.py: error: the following arguments are required: sqanti_class, isoforms, gtf_file

python sqanti3_RulesFilter.py [classification] [fasta] [sam] [gtf]
         [-a INTRAPRIMING] [-c MIN_COV] [-m MAX_DIST_TO_KNOWN_END]

where -a determines the fraction of genomic 'A's above which the isoform will be filtered. The default is -a 0.6. -r is another option for looking at genomic 'A's that looks at the immediate run-A length. The default is -r 6.

-m sets the maximum distance to an annotated 3' end (the diff_to_gene_TTS field in classification output) to offset the intrapriming rule.

-c is the filter for the minimum short read junction support (looking at the min_cov field in _classification.txt), and can only be used if you have short read data.

For example:

python sqanti3_RulesFilter.py test_classification.txt \
                         test.renamed_corrected.fasta \
                         test.gtf

The current filtering rules are as follow:

  • If a transcript is FSM, then it is kept unless the 3' end is unreliable (intrapriming).
  • If a transcript is not FSM, then it is kept only if all of below are true:
    • (1) 3' end is reliable.
    • (2) does not have a junction that is labeled as RTSwitching.
    • (3) all junctions are either canonical or has short read coverage above -c threshold.

Machine learning filter: removing artifacts using a random forest classifier

SQANTI3 incorporates an automated filter based on a random forest classifier, which is designed to discriminate potential artifacts from true isoforms without the need for user-defined rules or manually-set thresholds.

Briefly, this classifier learns high and low quality attributes from a True Positive (TP) and True Negative (TN) transcript set, building a model that discriminates artifacts and isoforms based on TN and TP properties. The filter returns a modified version of the *_classification.txt file (named *_MLresult_classification.txt). The new table will include a filter_result column with the label Isoform or Artifact, which will be the result of the random forest classifier.

Disclaimer: we hereby provide instructions to run the filter and a comprehensive description of its parameters. While we have included recommendations on how to best tune them to obtain optimal results, users are encouraged to try several configurations of the filter to find the best way to run it for their particular dataset.

Running the machine learning filter

Similarly to the rules filter, the machine learning filter (ML filter) can be run using the sqanti3_filter.py script by providing the ML argument and the path to the SQANTI *_classification.txt file (output by sqanti3_qc.py):

python sqanti3_filter.py ML path/to/classification.txt 

The ML filter also accepts the following parameters:

usage sqanti3_filter.py ML [-h] [--isoforms ISOFORMS] [--gtf GTF] [--sam SAM] [--faa FAA] [-o OUTPUT] [-d DIR]
                  [-v] [--report] [-t PERCENT_TRAINING] [-p TP] [-n TN] [-j THRESHOLD] [-f FORCE_FSM_IN]
                  [--intermediate_files INTERMEDIATE_FILES] [-r REMOVE_COLUMNS] [-z MAX_CLASS_SIZE]
                  [-i INTRAPRIMING] [-e]
                  sqanti_class

positional arguments:
  sqanti_class          SQANTI3 QC classification file.

optional arguments:
  -h, --help            show this help message and exit
  --isoforms ISOFORMS   fasta/fastq isoform file to be filtered
  --gtf GTF             GTF file to be filtered
  --sam SAM             SAM alignment of the input fasta/fastq
  --faa FAA             ORF prediction faa file to be filtered by SQANTI3
  -o OUTPUT, --output OUTPUT
                        Prefix for output files.
  -d DIR, --dir DIR     Directory for output files. Default: Directory where the script was run.
  -v, --version         Display program version number.
  --report              Create a report about the filtering
  -t PERCENT_TRAINING, --percent_training PERCENT_TRAINING
                        Proportion of the data that goes to training (parameter p of the function
                        createDataPartition). Default: 0.8
  -p TP, --TP TP        Path to file containing the list of the TP transcripts, one ID by line, no header
                        (optional). If not supplied, it will be generated from input data.
  -n TN, --TN TN        Path to file containing the list of the TN transcripts, one ID by line, no header
                        (optional). If not supplied, it will be generated from input data.
  -j THRESHOLD, --threshold THRESHOLD
                        Machine Learning probability threshold to classify transcripts as positive isoforms.
  -f FORCE_FSM_IN, --force_fsm_in FORCE_FSM_IN
                        If activated, forces retaining only multi-exon transcripts, all mono-exon isoforms will be
                        automatically removed.
  --intermediate_files INTERMEDIATE_FILES
                        If activated, outputs ML filter intermediate files.
  -r REMOVE_COLUMNS, --remove_columns REMOVE_COLUMNS
                        Path to single-column file (no header) containing the names of the columns in SQ3's
                        classification.txt file that are to be excluded during random forest training (optional).
  -z MAX_CLASS_SIZE, --max_class_size MAX_CLASS_SIZE
                        Maximum number of isoforms to include in True Positive and True Negative sets. TP and TN
                        sets will be downsized to this value if they are larger.
  -i INTRAPRIMING, --intrapriming INTRAPRIMING
                        Adenine percentage at genomic 3' end to flag an isoform as intra-priming (default: 60 )
  -e, --filter_mono_exonic
                        Filter out all mono-exonic transcripts (default: OFF)

We next provide a detailed description of each of the ML filter parameters in order to help users understand how to best apply it to their own data.

True Positive (TP) and True Negative (TN) sets

As stated above, the random forest classifier model needs to be trained on a subset of isoforms from the transcriptome. This will include a set of True Positive isoforms -that is, isoforms that are reliable enough for the classifier to learn the properties of a real isoform- and a set of True Negative isoforms -that is, isoforms that can be considered low-quality or false, from which the classifier will learn what constitutes an artifact or false-positive isoform-.

Although the ML filter includes built-in selection of training data from specific SQANTI categories, it also accepts user-defined TN and TP sets (--TN [-n] and --TP [-p] parameters, respectively). These should be provided as single-column files (with no header) containing the IDs of the selected isoforms, and will be used to train and test the random forest classifier.

If not provided, the SQANTI ML filter will use the following as built-in TP and TN sets by default:

  • TP: all Reference Match (RM) isoforms in the transcriptome (multi-exon only). RM are an FSM subcategory in which the TSS and TTS of the isoform are within +/-50bp of the TSS/TTS of the reference associated transcript (see category description page for details). If there are less than 40 RM isoforms, all FSMs will be taken as TP.
  • TN: all Novel Not in Catalog (NNC) isoforms that have at least one non-canonical junction (multi-exo only). If there are less than 40 isoforms in this group (normally because there are not enough non-canonical junctions in the transcriptome), all NNC isoforms will be taken as TN.

Note that both user-defined and built-in training data will be subset if the number of isoforms in the TP or TN isoform sets is larger than the other. By default, SQANTI3 will downsize the larger set to match the smaller of the two sizes. Downsizing is performed by random sampling. When using the built-in option, and given this random sampling step, TP and TN lists will be written out as part of the SQANTI ML filter output.

Users may change the --max_class_size [-z] parameter to set a maximum number of isoforms to include in the TP and TN sets. By default, TP and TN lists will be downsized to 3000 transcripts if they are larger that this, no matter if they were user-defined or generated internally.

Partitioning of the training data (model training/testing)

TP and TN isoform sets need to be split into training and test data to correctly generate the random forest model. Using the --percent_training or -t parameter, users can specify the proportion of the data that is used for training. By default, 80% of the isoforms will be used for training (which is equivalent to supplying -t 0.8). The remaining 20% will be used to test the model.

The ML filter uses functions from the caret R package to build and test the random forest classifier. Briefly, the model training/test includes the following internal steps:

  1. Data partitioning via the createDataPartition() function.
  2. Cross-validation (10x) using the trainControl() function.
  3. Model training using the train() function.
  4. Testing of the model using the predict() function from the stats package in base R.
  5. Output model statistics using several other functions from the caret package (see the ML filter output section for details).

Warning: after training, the model will be stored as an .RData object in the specified output folder. If the filter is re-run on the same directory, SQANTI will detect that a model already exists, skipping the training step and applying the extant model to perform the Isoform/Artifact classification on the input data. This is useful on occassions where a user wants to apply an already-generated model to a different transcriptome; however, if you wish to train a new model, you will need to delete/move/rename the previous model object.

Probability threshold to define Isoforms/Artifacts

Ultimately, flagging transcripts as Isoforms or Artifacts is based on the random forest probability to correctly classify a transcript into either group. These probabilities are included in the POS_MLprob and NEG_MLprob columns of the output _MLresult_classification.txt file. Intuitively:

  • POS_MLprob is the probability to classify the transcript as an Isoform.
  • NEG_MLprob is the probability to classify the transcript as an Artifact, and is equivalent to 1 - POS_MLprob.

The SQANTI ML filter will assign the Artifact and Isoform labels based on a probability threshold parameter supplied via the --threshold or -j argument. By default, the filter is more stringent on the Isoform condition than it is on the Artifact condition, that is: by default, transcripts will be considered isoforms when POS_MLprob is larger than 0.7.

This results in some artifacts having POS_MLprob > 0.5 (which can be counter-intuitive). Of course, in this scenario the filter will exclude more transcripts than when using the usual POS_MLprob > NEG_MLprob approach. However, users may reproduce this leninent filter by lowering the threshold to 0.5.

Classification file columns excluded from the ML filter

Due to their lack of importance for artifact definition (or to prevent them from having unwanted effects on the filtering), the following columns are removed from the classification table prior to running the SQANTI ML filter:

  • Chromosome and strand info (chrom, strand)
  • Reference gene/transcript info (associated_gene, associated_transcript)
  • Structural information about the reference or the transcript (ref_length, ref_exons, ORF_length, CDS_length, CDS_start, CDS_end, CDS_genomic_start, CDS_genomic_end).
  • Redundant junction information (all_canonical).
  • Sequence information (seq_A_downstream_TTS, ORF_seq).
  • SQANTI category (structural_category, subcategory).

Excluding additional data columns from the filter

The SQANTI ML filter incorporates the option to exclude columns from the _classification.txt file from model training to prevent filtering based on them. These should be provided as a single-column file including the column names exactly as they are named in the classification file. A path to this file should then be supplied via the --remove_columns or -r argument.

This is particularly useful to prevent overfitting during model training. In particular, there may be situations in which users wish to define TP and TN sets based on the values of specific SQ3 columns. We recommend that these are, in turn, excluded from the ML filter to prevent them from gaining too much importance during Isoform/Artifact classification.

Warning: we always encourage testing several ML filter configurations and TP/TN sets before making a final decision on which are the artifacts in your transcriptome.

Intra-priming filter

The ML filter script in SQANTI3 includes a simple threshold-based filter to prevent potential intra-priming artifacts from remaining in the transcriptome, which is not specifically considered by the ML filter. Still, note that the perc_A_downstream_TTS column, which is used to flag transcripts as potentially resulting from intra-priming, is considered during random forest training.

To apply it, users may supply the maximum percentage of adenines ("A" nucleotides) that is to be allowed in the genome sequence region immediately following the end of the transcript (i.e. the TTS). This can be provided via the --intrapriming or -i argument, resulting in perc_A_downstream_TTS > i isoforms being flagged as intra-priming artifacts. By default, transcripts containing more than 60% A's at the genomic 3' end will be flagged.

Forcing the removal/retention of specific isoform groups

  1. Full-splice match (FSM) transcripts can be excluded from the ML filter and therefore retained by providing the --force_fsm_in or -f argument. By default, FSM will be run through the filter along with the isoforms from the rest of the structural categories.

  2. All mono-exonic transcripts can be automatically removed by providing the --filter_mono_exonic or -e parameter. Note that mono-exon transcripts will not be evaluated by the ML filter (only by the intra-priming filter).

Exceptions to running the SQANTI3 ML filter

The random forest classifier cannot/will not be run in any of the following scenarios:

  • All transcripts in the long read-defined transcriptome are classified as mono-exon. The ML filter can only be applied to multi-exon transcripts.
  • One (or both) of the user-defined TP and TN sets have less than 40 isoforms.
  • One of the default TP and TN categories have less than 40 isoforms. This applies when no user-defined TP and TN sets are provided.

However, even in scenarios where no ML-based filtering can be applied, the intra-primming filter will still be applied to all input transcripts.

SQANTI ML filter output

The SQANTI ML filter output files are written to the path specified using the --dir or -d argument, also appending the prefix provided via the --output or -o argument.

The following output files are generated after running the filter:

Input-related files

  • params.txt: two-column file including the name of the argument and the value that was used to run the filter.
  • TP_list.txt and TN_list.txt: a single-column text file including the IDs of the isoforms used as TP and TN (generated only when no user-defined sets were provided).

The supplied prefix will be appended to the filenames above.

Model-related files

  • randomforest.RData: R object containing the trained random forest model.
  • classifier_variagble-importance_table.txt: two-column text file including the names of the variables that were used for classification and a numeric value indicating their importance in the random forest classifier.
  • intermediate_*_MLinput_table.txt: training-ready classification table. SQANTI ML filter performs a series of modifications of the classification table to enable running of the training/test steps. These include handling of NAs, value formatting, column removal, etc. This file will only be output if the --intermediate_files argument is supplied. Note that this file is NOT intented for downstream anaylsis.

Test set files

All include the testSet_ prefix:

  • stats.txt: all statistics for the model testing.
  • ROC_curve.pdf.
  • confusionMatrix.txt: confusion matrix obtained after running the trained model on the test set.
  • summary.txt: summary statistics of the model testing, namely sensitivity, specificity and AUROC.

Classification file

The MLresult_classification.txt constitutes a modified classification file including the results of the ML and intra-priming filters. The following columns are added:

  • POS_MLprob and NEG_MLprob, described above (see probability threshold section).
  • ML_classifier column, in which ML_classifier == Positive corresponds to transcripts in which POS_MLprob > t (therefore flagged as Isoforms by the ML filter).
  • intra_priming column, in which intrapriming == TRUE corresponds to transcripts flagged as intra-priming atrifacts.
  • filter_result column, including the combined result for the ML and intra-priming filters. An isoform will be filter_result == Isoform if it passed both the ML and intra-priming filters.

In the specific case of mono-exonic transcripts, passing the intra-priming filter will be enough to be considered an isoform, as long as they are not removed using the -e argument. Similarly, all FSM transcripts will be flagged as true isoforms if the -f option is provided.

Filter inclusion list

Finally, inclusion-list.txt is single-column file including the IDs of the transcripts classified as true isoforms (i.e. flagged as Isoform in the filter_result column). This list will be used to provide filtered versions of all available SQANTI QC output files (GTF, GFF3, fasta, faa, etc.).