Skip to content

Random Forest, GenePattern - Mesirov Lab, UCSD

License

Notifications You must be signed in to change notification settings

genepattern/RandomForest

Repository files navigation

Random Forest

Omar Halawa (ohalawa@ucsd.edu) of the GenePattern Team @ Mesirov Lab - UCSD


The following repository is a GenePattern module written in Python 3, using the following Docker image.

It performs random forest classification, a machine learning algorithm that is an ensemble of decision trees, through either: cross-validation (takes one dataset as input) or test-train prediction (takes two datasets, test and train). Each dataset consists of two file inputs, one for feature data (.gct), and one for target data (.cls). It processes files and performs classification via Scikit-learn's RandomForestClassifier, generating a prediction results file (.pred.odf) which the "true" class to the model's prediction and outputting a feature importance file (.feat.odf) in the case of test-train prediction. The module also supports importing and exporting trained models. Created for GenePattern module usage through optional arguments for classifier parameters.

Documentation on usage and implementation is found here. A detailed step-by-step explanation behind how the Random Forest algorithm works is found here. All source files, including cross-validation runs for all_aml_train (.gct, .cls), BRCA_HUGO (.gct, .cls), and iris (.gct, .cls) datasets as well as a test-train run with all_aml_test (.gct, .cls) and all_aml_train (.gct, .cls) all with output examples ("examples," as the classifier utilizes randomness, so each run varies) are available for better reproducibility and portability. However, to see how randomness can be "reproduced," read this.

Also see the GPU-backed CuPy-based implementation of this module, RandomForest.GPU, for potentially faster jobs.