- a bacterial genome wide association approach through utilizing machine learning practices.
conda create -n BoBaFor_env python==3.11.10
conda activate BoBaFor_env
pip install BoBaFor
git clone https://github.com/PaulDanPhillips/BoBaFor.git
cd BoBaFor/examples
python AddAbsolutePaths.py
Run the example data (user needs to pay attention to how many cores they have available and want to use)
BoBaFor --config example_config.yaml --cores 4
- The data needs to be split into two files: 1 containing all of the genetic features and 2 the response phenotype ensureing that the indexes on the feature matrix align with the reponse vector.
- You will edit the Sim1.yaml file from the example directory.
- Predictor
- The absolute or relative path to your genetic feature matrix will take the place of """balanced_features/Strongest_Sim1.txt"""
- /path/to/data/features.txt
- Response
- The absolute or relative path to your response phenotype vector will take the place of """balanced_responses/Sim1.txt""" /path/to/daata/response.txt
- *chi2Correct
- The user need to decie if they want to the chi2 prefilter step to be corrected via false-discovery-rate or not (only True or False)
- True/False
- FDRchi2Thresh
- The user needs to decide what the FDR chi2 threshold is for removing genetic features based on this chi2 test is.
- 0.-1. (float)
- PreFilterFileName
- The name you want for the output files from just the chi2-prefilter step.
- Prefilter_Experiment_Name
- sim
- Experiment Name for prefix file name
- Simulation1
- GridParamDepth
- The number of of options to keep for EVERY hyperparameter after optimizing from RandomSearch leading into the much more thorough and computational demanding GridSearch
- 1-5 (int)
- ScoreMetric
- What metric to use to score the model. Pay attention to what type of data your response metric is and how balanced the data is.
- balanced_accuracy
- RandSearch1_Niter
- The number of RandomSearch iterations to perform prior to boruta selection (The number of hyperparamter combinations to test)
- (int)
- GridSearch1_CV
- The number of GridSearch iterations to perform prior to boruta selection (The number of times to create kfold splits and test EVERY possible hyperparameter combination)
- (int)
- boruta_perc
- The percent strength of the boruta shadow-features to be comparted against the real features.
- 0-100 (int)
- boruta_pval
- Level at which the corrected p-values will get rejected in both correction steps.
- 0.-1. (float)
- borutaSelect_perc
- The percent theshold of which to select final features (the number of times a feature was selected divided by the number of iterations boruta was run).
- collect
- A boolean on whether to keep all selected features no matter the boruta_Select_perc setting.
- True/False
- boruta_reps
- The number of iterations to run boruta selection.
- model
- Whether to run a random forest or extreme gradient boosted model
- RF/XGB
- RandSearch2_Niter
- The number of RandomSearch iterations to perform after boruta selection (The number of hyperparamter combinations to test)
- (int)
- RandSearch2_CV
- TO BE REMOVED
- The number of iterations to create kfolds within a single RandSearch2_Niter.
- (int)
- GridSearch2_CV
- The number of GridSearch iterations to perform after boruta selection (The number of times to create kfold splits and test EVERY possible hyperparameter combination) (int)
- RankingFeatureIters
- The number of iterations to perform feature ranking via Feature Importance or Permutation Importance
- (int)
- PermImpInteralIter
- The number of internal repetitions for Permutation Importance
- (int)
-
Most important is BorutaOUT_1.00.05_RF/FinalFeatures_RF/eatureImportance_Ranked_*.out as it contains the genentic variants and their associated feature importance value.
- BorutaOUT_1.00.05_RF/PreFilterOut/PreFilterPredictor_Sim1.txt
- The features that pass the chi2 prefilter set in the config.yaml file
- BorutaOUT_1.00.05_RF/TuningOUT_RF/GridParamSpace1_Sim1.pickle
- BorutaOUT_1.00.05_RF/TuningOUT_RF/GridParamSpace2_Sim1.pickle
- BorutaOUT_1.00.05_RF/TuningOUT_RF/gridSearch1_Sim1.txt
- BorutaOUT_1.00.05_RF/TuningOUT_RF/gridSearch2_Sim1.txt
- BorutaOUT_1.00.05_RF/TuningOUT_RF/GridSearchFIG1_Sim1.png
- BorutaOUT_1.00.05_RF/TuningOUT_RF/GridSearchFIG2_Sim1.png
- BorutaOUT_1.00.05_RF/TuningOUT_RF/Optimal_Tuning1_Sim1.pickle
- BorutaOUT_1.00.05_RF/TuningOUT_RF/Optimal_Tuning2_Sim1.pickle
- The optimized model used to rank the genetic features. This can be used with the explainable AI shap to best interpret the model if desired.
- BorutaOUT_1.00.05_RF/TuningOUT_RF/randSearch1_Sim1.txt
- BorutaOUT_1.00.05_RF/TuningOUT_RF/randSearch2_Sim1.txt
- BorutaOUT_1.00.05_RF/BorutaOUT/RF_Boruta_Selected_Sim1.txt
- BorutaOUT_1.00.05_RF/BorutaOUT/RF_Boruta_Tentative_Sim1.txt
- BorutaOUT_1.00.05_RF/BorutaOUT/RF_FinalFeatureMatrix_BorutaSelected_Sim1.txt
- BorutaOUT_1.00.05_RF/FinalFeatures_RF/FeatureImportance_Ranked_Sim1.out
- BorutaOUT_1.00.05_RF/PreFilterOut/PreFilterPredictor_Sim1.txt