-
Notifications
You must be signed in to change notification settings - Fork 51
Running SQANTI3 filter
I've made a lightweight filtering script based on SQANTI3 output that filters for two things: (a) intra-priming and (b) short read junction support.
The script usage is:
usage: sqanti3_RulesFilter.py [-h] [--sam SAM] [--faa FAA] [-a INTRAPRIMING]
[-r RUNALENGTH] [-m MAX_DIST_TO_KNOWN_END]
[-c MIN_COV] [--filter_mono_exonic] [--skipGTF]
[--skipFaFq] [--skipJunction] [-v]
sqanti_class isoforms gtf_file
sqanti3_RulesFilter.py: error: the following arguments are required: sqanti_class, isoforms, gtf_file
python sqanti3_RulesFilter.py [classification] [fasta] [sam] [gtf]
[-a INTRAPRIMING] [-c MIN_COV] [-m MAX_DIST_TO_KNOWN_END]
where -a
determines the fraction of genomic 'A's above which the isoform will be filtered. The default is -a 0.6
.
-r
is another option for looking at genomic 'A's that looks at the immediate run-A length. The default is -r 6
.
-m
sets the maximum distance to an annotated 3' end (the diff_to_gene_TTS
field in classification output) to offset the intrapriming rule.
-c
is the filter for the minimum short read junction support (looking at the min_cov
field in _classification.txt
), and can only be used if you have short read data.
For example:
python sqanti3_RulesFilter.py test_classification.txt \
test.renamed_corrected.fasta \
test.gtf
The current filtering rules are as follow:
- If a transcript is FSM, then it is kept unless the 3' end is unreliable (intrapriming).
- If a transcript is not FSM, then it is kept only if all of below are true:
- (1) 3' end is reliable.
- (2) does not have a junction that is labeled as RTSwitching.
- (3) all junctions are either canonical or has short read coverage above
-c
threshold.
SQANTI3 incorporates an automated filter based on a random forest classifier, which is designed to discriminate potential artifacts from true isoforms without the need for user-defined rules or manually-set thresholds.
Briefly, this classifier learns high and low quality attributes from a True
Positive (TP) and True Negative (TN) transcript set, building a model that
discriminates artifacts and isoforms based on TN and TP properties.
The filter returns a modified version of the *_classification.txt
file (named *_MLresult_classification.txt
). The new table will include a filter_result
column with the label Isoform
or Artifact
, which will be the result of the
random forest classifier.
Disclaimer: we hereby provide instructions to run the filter and a comprehensive description of its parameters. While we have included recommendations on how to best tune them to obtain optimal results, users are encouraged to try several configurations of the filter to find the best way to run it for their particular dataset.
Similarly to the rules filter, the machine learning filter (ML filter) can be
run using the sqanti3_filter.py
script by providing the ML
argument and
the path to the SQANTI *_classification.txt
file (output by sqanti3_qc.py
):
python sqanti3_filter.py ML path/to/classification.txt
The ML filter also accepts the following parameters:
usage sqanti3_filter.py ML [-h] [--isoforms ISOFORMS] [--gtf GTF] [--sam SAM] [--faa FAA] [-o OUTPUT] [-d DIR]
[-v] [--report] [-t PERCENT_TRAINING] [-p TP] [-n TN] [-j THRESHOLD] [-f FORCE_FSM_IN]
[--intermediate_files INTERMEDIATE_FILES] [-r REMOVE_COLUMNS] [-z MAX_CLASS_SIZE]
[-i INTRAPRIMING] [-e]
sqanti_class
positional arguments:
sqanti_class SQANTI3 QC classification file.
optional arguments:
-h, --help show this help message and exit
--isoforms ISOFORMS fasta/fastq isoform file to be filtered
--gtf GTF GTF file to be filtered
--sam SAM SAM alignment of the input fasta/fastq
--faa FAA ORF prediction faa file to be filtered by SQANTI3
-o, --output OUTPUT
Prefix for output files.
-d, --dir DIR Directory for output files. Default: Directory where the script was run.
-v, --version Display program version number.
--report Create filter report
-t, --percent_training PERCENT_TRAINING
Proportion of the data that goes to training (parameter p of the function
createDataPartition). Default: 0.8
-p, --TP TP Path to file containing the list of the TP transcripts, one ID by line, no header
(optional). If not supplied, it will be generated from input data.
-n, --TN TN Path to file containing the list of the TN transcripts, one ID by line, no header
(optional). If not supplied, it will be generated from input data.
-j, --threshold THRESHOLD
Machine Learning probability threshold to classify transcripts as positive isoforms.
-f, --force_fsm_in FORCE_FSM_IN
If activated, forces retaining only multi-exon transcripts, all mono-exon isoforms will be
automatically removed.
--intermediate_files INTERMEDIATE_FILES
If activated, outputs ML filter intermediate files.
-r, --remove_columns REMOVE_COLUMNS
Path to single-column file (no header) containing the names of the columns in SQ3's
classification.txt file that are to be excluded during random forest training (optional).
-z, --max_class_size MAX_CLASS_SIZE
Maximum number of isoforms to include in True Positive and True Negative sets. TP and TN
sets will be downsized to this value if they are larger.
-i, --intrapriming INTRAPRIMING
Adenine percentage at genomic 3' end to flag an isoform as intra-priming (default: 60)
-e, --filter_mono_exonic
Filter out all mono-exonic transcripts (default: OFF)
We next provide a detailed description of each of the ML filter parameters in order to help users understand how to best apply it to their own data.
As stated above, the random forest classifier model needs to be trained on a subset of isoforms from the transcriptome. This will include a set of True Positive isoforms -that is, isoforms that are reliable enough for the classifier to learn the properties of a real isoform- and a set of True Negative isoforms -that is, isoforms that can be considered low-quality or false, from which the classifier will learn what constitutes an artifact or false-positive isoform-.
Although the ML filter includes built-in selection of training data from specific
SQANTI categories, it also accepts user-defined TN and TP sets (--TN [-n]
and --TP [-p]
parameters, respectively). These should be provided as
single-column files (with no header) containing the IDs of the selected isoforms,
and will be used to train and test the random forest classifier.
If not provided, the SQANTI ML filter will use the following as built-in TP and TN sets by default:
- TP: all Reference Match (RM) isoforms in the transcriptome (multi-exon only). RM are an FSM subcategory in which the TSS and TTS of the isoform are within +/-50bp of the TSS/TTS of the reference associated transcript (see category description page for details). If there are less than 40 RM isoforms, all FSMs will be taken as TP.
- TN: all Novel Not in Catalog (NNC) isoforms that have at least one non-canonical junction (multi-exo only). If there are less than 40 isoforms in this group (normally because there are not enough non-canonical junctions in the transcriptome), all NNC isoforms will be taken as TN.
Note that both user-defined and built-in training data will be subset if the number of isoforms in the TP or TN isoform sets is larger than the other. By default, SQANTI3 will downsize the larger set to match the smaller of the two sizes. Downsizing is performed by random sampling. When using the built-in option, and given this random sampling step, TP and TN lists will be written out as part of the SQANTI ML filter output.
Users may change the --max_class_size [-z]
parameter to set a maximum
number of isoforms to include in the TP and TN sets. By default, TP and TN lists
will be downsized to 3000 transcripts if they are larger that this,
no matter if they were user-defined or generated internally.
TP and TN isoform sets need to be split into training and test data to correctly
generate the random forest model. Using the --percent_training
or -t
parameter,
users can specify the proportion of the data that is used for training. By default,
80% of the isoforms will be used for training (which is equivalent to supplying
-t 0.8
). The remaining 20% will be used to test the model.
The ML filter uses functions from the caret
R package to build and test the random
forest classifier. Briefly, the model training/test includes the following internal
steps:
- Data partitioning via the
createDataPartition()
function. - Cross-validation (10x) using the
trainControl()
function. - Model training using the
train()
function. - Testing of the model using the
predict()
function from thestats
package in base R. - Output model statistics using several other functions from the
caret
package (see the ML filter output section for details).
Warning: after training, the model will be stored as an .RData
object in
the specified output folder. If the filter is re-run on the same directory, SQANTI
will detect that a model already exists, skipping the training step and
applying the extant model to perform the Isoform/Artifact classification on the input data.
This is useful on occassions where a user wants to apply an already-generated
model to a different transcriptome; however, if you wish to train a new model, you will need to delete/move/rename the previous model object.
Ultimately, flagging transcripts as Isoforms or Artifacts is based on the
random forest probability to correctly classify a transcript into either
group. These probabilities are included in the POS_MLprob
and NEG_MLprob
columns of the output _MLresult_classification.txt
file. Intuitively:
-
POS_MLprob
is the probability to classify the transcript as an Isoform. -
NEG_MLprob
is the probability to classify the transcript as an Artifact, and is equivalent to1 - POS_MLprob
.
The SQANTI ML filter will assign the Artifact and Isoform labels based on a
probability threshold parameter supplied via the --threshold
or -j
argument. By default, the filter is more stringent on the Isoform condition than
it is on the Artifact condition, that is: by default, transcripts will be considered
isoforms when POS_MLprob
is larger than 0.7.
This results in some artifacts having POS_MLprob > 0.5
(which can be
counter-intuitive). Of course, in this scenario the filter will exclude more
transcripts than when using the usual POS_MLprob > NEG_MLprob
approach.
However, users may reproduce this leninent filter by lowering the threshold to 0.5.
Due to their lack of importance for artifact definition (or to prevent them from having unwanted effects on the filtering), the following columns are removed from the classification table prior to running the SQANTI ML filter:
- Chromosome and strand info (
chrom
,strand
) - Reference gene/transcript info (
associated_gene
,associated_transcript
) - Structural information about the reference or the transcript (
ref_length
,ref_exons
,ORF_length
,CDS_length
,CDS_start
,CDS_end
,CDS_genomic_start
,CDS_genomic_end
). - Redundant junction information (
all_canonical
). - Sequence information (
seq_A_downstream_TTS
,ORF_seq
). - SQANTI category (
structural_category
,subcategory
).
The SQANTI ML filter incorporates the option to exclude columns from the
_classification.txt file from model training to prevent filtering based on
them. These should be provided as a single-column file including the column names
exactly as they are named in the classification file. A path to this file should
then be supplied via the --remove_columns
or -r
argument.
This is particularly useful to prevent overfitting during model training. In particular, there may be situations in which users wish to define TP and TN sets based on the values of specific SQ3 columns. We recommend that these are, in turn, excluded from the ML filter to prevent them from gaining too much importance during Isoform/Artifact classification.
Warning: we always encourage testing several ML filter configurations and TP/TN sets before making a final decision on which are the artifacts in your transcriptome.
The ML filter script in SQANTI3 includes a simple threshold-based filter to prevent
potential intra-priming artifacts from remaining in the transcriptome, which is
not specifically considered by the ML filter. Still, note that the perc_A_downstream_TTS
column, which is used to flag transcripts as potentially resulting from intra-priming,
is considered during random forest training.
To apply it, users may supply the maximum percentage of adenines ("A" nucleotides)
that is to be allowed in the genome sequence region immediately following the end of the
transcript (i.e. the TTS). This can be provided via the --intrapriming
or -i
argument, resulting in perc_A_downstream_TTS > i
isoforms being flagged as
intra-priming artifacts. By default, transcripts containing more than 60% A's at the genomic
3' end will be flagged.
-
Full-splice match (FSM) transcripts can be excluded from the ML filter and therefore retained by providing the
--force_fsm_in
or-f
argument. By default, FSM will be run through the filter along with the isoforms from the rest of the structural categories. -
All mono-exonic transcripts can be automatically removed by providing the
--filter_mono_exonic
or-e
parameter. Note that mono-exon transcripts will not be evaluated by the ML filter (only by the intra-priming filter).
The random forest classifier cannot/will not be run in any of the following scenarios:
- All transcripts in the long read-defined transcriptome are classified as mono-exon. The ML filter can only be applied to multi-exon transcripts.
- One (or both) of the user-defined TP and TN sets have less than 40 isoforms.
- One of the default TP and TN categories have less than 40 isoforms. This applies when no user-defined TP and TN sets are provided.
However, even in scenarios where no ML-based filtering can be applied, the intra-primming filter will still be applied to all input transcripts.
The SQANTI ML filter output files are written to the path specified using the
--dir
or -d
argument, also appending the prefix provided via the --output
or -o
argument.
The following output files are generated after running the filter:
-
params.txt
: two-column file including the name of the argument and the value that was used to run the filter. -
TP_list.txt
andTN_list.txt
: a single-column text file including the IDs of the isoforms used as TP and TN (generated only when no user-defined sets were provided).
The supplied prefix will be appended to the filenames above.
-
randomforest.RData
: R object containing the trained random forest model. -
classifier_variagble-importance_table.txt
: two-column text file including the names of the variables that were used for classification and a numeric value indicating their importance in the random forest classifier. -
intermediate_*_MLinput_table.txt
: training-ready classification table. SQANTI ML filter performs a series of modifications of the classification table to enable running of the training/test steps. These include handling ofNAs
, value formatting, column removal, etc. This file will only be output if the--intermediate_files
argument is supplied. Note that this file is NOT intented for downstream anaylsis.
All include the testSet_
prefix:
-
stats.txt
: all statistics for the model testing. -
ROC_curve.pdf
. -
confusionMatrix.txt
: confusion matrix obtained after running the trained model on the test set. -
summary.txt
: summary statistics of the model testing, namely sensitivity, specificity and AUROC.
The MLresult_classification.txt
constitutes a modified classification file
including the results of the ML and intra-priming filters. The following
columns are added:
-
POS_MLprob
andNEG_MLprob
, described above (see probability threshold section). -
ML_classifier
column, in whichML_classifier == Positive
corresponds to transcripts in whichPOS_MLprob > t
(therefore flagged as Isoforms by the ML filter). -
intra_priming
column, in whichintrapriming == TRUE
corresponds to transcripts flagged as intra-priming atrifacts. -
filter_result
column, including the combined result for the ML and intra-priming filters. An isoform will befilter_result == Isoform
if it passed both the ML and intra-priming filters.
In the specific case of mono-exonic transcripts, passing the intra-priming filter
will be enough to be considered an isoform, as long as they are not removed using
the -e
argument. Similarly, all FSM transcripts will be flagged as true isoforms if
the -f
option is provided.
Finally, inclusion-list.txt
is single-column file including the IDs of the transcripts
classified as true isoforms (i.e. flagged as Isoform
in the filter_result
column).
This list will be used to provide filtered versions of supplied SQANTI QC output files.
Use the --isoforms
, --gtf
, --faa
and --sam
arguments to provide each of the files
you wish to obtain filtered.
Wiki index
- Introduction to SQANTI3
- Dependencies and installation
- Version history
- Isoform classification: categories and subcategories
- Running SQANTI3 quality control
- Understanding the output of SQANTI3 QC
- IsoAnnotLite
- Running SQANTI3 filter
- Running SQANTI3 rescue
- Tutorial: running SQANTI3 on an example dataset
- Running SQANTI-reads
- Memory requirements to use parallelization