Home

CAFA-evaluator is a Python program designed to evaluate the performance of prediction methods on targets with hierarchical concept dependencies. It generalizes multi-label evaluation to modern ontologies where the prediction targets are drawn from a directed acyclic graph and achieves high efficiency by leveraging matrix computation and topological sorting.

The code replicates the Critical Assessment of protein Function Annotation (CAFA) benchmarking, which evaluates predictions of the consistent subgraphs in Gene Ontology. The package also contains a Jupyter Notebook to generate precision-recall and remaining uncertainty–misinformation curves. CAFA-evaluator implementation was inspired by the original Matlab code used in CAFA2 assessment and available at https://github.com/yuxjiang/CAFA2.

The word aspect, namespace and sub-ontology are used interchangeably in the following documentation.

Workflow

Figure 1. CAFA-evaluator workflow.

Parsing

Ontology file - Only the OBO format is accepted. The following rules are applied:

Obsolete terms are always excluded.
Only "is_a" and "part_of" relationships are considered. You can modify this behaviour calling obo_parser function with the valid_rel argument.
Cross-aspect (cross-namespace) relationships are always discarded.
Alternative term identifiers are automatically mapped to canonical identifiers both in the prediction and ground truth inputs.
When information accretion is provided, terms which are not available in the accretion file are removed from the ontology.

Prediction folder - Prediction files inside the prediction folder are filtered considering only those targets included in the ground truth and only those terms included in the ontology file. If the ground truth contains only annotations from one aspect (e.g. "molecular function"), the evaluation is provided only for that aspect.

Internal representation and memory usage

The algorithm stores in memory a Numpy boolean N x M array (N = number of ground truth targets; M = number of ontology terms of a single aspect) for each aspect in the ground truth file.
An array of the same size (rows ≤ N), but containing floats (the prediction scores) instead of booleans, is stored for each prediction file. Prediction files are processed one by one and the matrix gets reassigned.
When running the code in parallel consider the ground truth matrix gets cloned in every thread. Be careful if you are short in memory.

Propagation and topological sorting

Both the predictions and the ground truth annotations are always propagated up to the ontology root(s). The tologically sorted list of nodes allows to optimize the propagation process by scanning the prediction and ground truth matrices only once based on the indexed provided in the sorting vector.

Two propagation strategies are available:

max scores are propagated considering always the max.
fill prediction scores are propagated without overwriting the scores assigned to the parents.

In some cases the propagation strategy can affect the final evaluation and is not possible to predict which one gives higher scores without knowing the data.

propagation

Figure 2. Example of different propagation startegies.

Evaluation

Confusion matrix

After the ground truth and prediction matrices are propagated, for each target is calculated the intersection and the difference between the predicted and ground truth sub-graph for a given target. Therefore is possible to calculate a sort of confusion matrix where the true positives (TP) correspond to the intersaction, the false positives (FP) are ontology nodes predicted but not in the ground truth and the false negatives (FN) are those in the ground truth but not predicted.

Note that TN are not considered since usually the ontology graph is huge compared to the terms associated to a target and evaluation metrices using TN are not meaningful in this context.

Normalization

The CAFA-evaluator package provides different normalization strategies to provide an evaluation at the dataset level.

In CAFA Genome Biology, 2016 normalization is performed in a peculiar way. The precision is calculated for each target independently (macro-average) and then normalized over the number of predicted targets at a given score, while the recall is normalized by the number of targets in the ground truth as described in the formulas below.

precision
recall

In the example below, target no. 2 is in the ground truth but it is not predicted. In the cafa and gt normalization modes the recall is normalized by the number of targets in the ground truth and therefore the method is penalized, while the pred normalization consider only targets for which there is a prediction at a given score.

propagation

Figure 3. Example of a prediction of two targets (proteins).

The current implementation of the software also provides micro-average evaluation. The micro-average is calculated normilizing the confusion matrices of each target by the number of targets. As for the macro-average it is possible to choose to normalize by the number of ground truth or predicted targets setting the norm parameter.

Information accretion and weighted evaluation

When the information accretion (IA) file is provided, the CAFA-evaluator tools also provides weighted evaluation for the precision, recall, misinformation and remaining uncertainty measures. This is provided by weighting nodes in the graph based on thei information accretion before calculating the confusion matrix. For example, for a given target the TP is not simply the number of intersecting nodes between the prediction and ground truth sub-graphs, but instead the sum of intersecting nodes multiplied by the corresponding imformation accretion value. Misinformation and remaining uncertainty are always normalized (micro-average) considering all targets in the ground truth. As in the formulas below.

Information accretion can be calculated as described in Wyatt and Radivojac, Bioinformatics, 2013. A nice explanation is also provided in a Kaggle discussion thread.

You can find an implemenation of Information Accretion calculation used for the Gene Ontology prediction in CAFA5 (Kaggle) in the InformationAccretion repo.

Examples

Propagation

The example below shows the effect of the two different propagation startegies for the two examples of Figure 2. max propagation on the left, and fill on the right. Notably in this specific case, the fill propagation gives a higher F-max.

Figure 4. Precision recall curves of two predictions (see Figure 2) expanded with different propagation strategies.

Normalization and roots

The example below shows the effect on the precision recall curves and max F-score for the three different normalization strategies. The prediction and ground truth contain exatecly the same graphs of the Figure 3. On the right, the same plot show the effect on precision recall curves when removing the roots from the ontology (and therefore excluding them from both the ground truth and the predictions).

Figure 5. Precision recall curves of the same prediction (see Figure 3) normalized with different strategies.

Number of thresholds (tau)

The example below show the effect of changing the threshold step (th_step) parameter for the evaluation. The evaluation has been performed on predictions of about 1,000 targets from previous CAFA challenges.

Figure 6. Precision recall curves of the same prediction using different precision (number of thresholds) for the evaluation.

Monotonic curves

The example below shows the effect of applying the cumulate falg in plot.ipunb Jupyter Notebook for generating the precision-recall curves. This option is necessary to generate curves identical to those available in CAFA evaluation. Data for this example are taken from the evaluation of prediction methods trained to predict the subset of Gene Ontology terms available in the DisProt database.

Figure 7. Precision recall curves of the same prediction excluding or including a comulative operation to obtain monotonic curves.

Critical Assessment of protein Function Annotation (CAFA)

Provious CAFA challanges

	Ground truth	Ontology	Sequences	Target lists
CAFA 1	cafa1_gt.tsv	N/A	cafa1_gt.fasta	N/A
CAFA 2	cafa2_gt.tsv	go_20130615.obo	cafa2_gt.fasta	cafa2_type1.txt, cafa2_type2.txt, cafa2_typex.txt
CAFA 3	cafa3_gt.tsv	go_20160601.obo°	cafa3_gt.fasta	cafa3_type1.txt, cafa3_type2.txt, cafa3_typex.txt

In order to replicate CAFA results, you can simply adapt the input files listed in the table above.

No-knowledge and Partial-knowledge benchmarks can be reproduced by filtering the Ground truth file based on Target lists files
In order to exclude specific terms from the analyses you can directly modify the input Ontology file. For example you can remove the "binding" term and its children.

° The version of the provided ontology is the most likely but not necessarely the same as the one used in the official evaluation.

CAFA5 / Kaggle

Owing to its reliability and accuracy, the organizers have selected CAFA-evaluator as the official evaluation software in the CAFA5 Kaggle competition. In Kaggle the software is executed with the following command:

cafaeval go-basic.obo prediction_dir test_terms.tsv -ia IA.txt -prop fill -norm cafa -th_step 0.001 -max_terms 500

In the example above the method prediction file should be inside the prediction_dir folder and evaluated against the test_terms.tsv file (not available to participants) containing the ground truth.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly