diff --git a/Qlik-Py-Init.bat b/Qlik-Py-Init.bat index 6827818..347226a 100644 --- a/Qlik-Py-Init.bat +++ b/Qlik-Py-Init.bat @@ -12,12 +12,13 @@ call activate cd .. echo. echo Installing required packages... & echo. -python -m pip install --upgrade pip +python -m pip install --upgrade setuptools pip pip install grpcio grpcio-tools numpy scipy pandas cython pip install pystan==2.17 pip install fbprophet pip install -U scikit-learn pip install hdbscan +pip install -U skater echo. echo Creating a new firewall rule for TCP port 50055... & echo. netsh advfirewall firewall add rule name="Qlik PyTools" dir=in action=allow protocol=TCP localport=50055 diff --git a/README.md b/README.md index 55cc09c..083bd15 100644 --- a/README.md +++ b/README.md @@ -17,7 +17,7 @@ This repository provides a server side extension (SSE) for Qlik Sense built usin The current implementation includes: -- **Supervised Machine Learning** : Implemented using [scikit-learn](http://scikit-learn.org/stable/index.html), the go-to machine learning library for Python. This SSE implements the full machine learning flow from data preparation, model training and evaluation, to making predictions in Qlik. +- **Supervised Machine Learning** : Implemented using [scikit-learn](http://scikit-learn.org/stable/index.html), the go-to machine learning library for Python. This SSE implements the full machine learning flow from data preparation, model training and evaluation, to making predictions in Qlik. In addition, models can be interpreted using [Skater](https://datascienceinc.github.io/Skater/overview.html). - **Unupervised Machine Learning** : Also implemented using [scikit-learn](http://scikit-learn.org/stable/index.html). This provides capabilities for dimensionality reduction and clustering. - **Clustering** : Implemented using [HDBSCAN](https://hdbscan.readthedocs.io/en/latest/comparing_clustering_algorithms.html), a high performance algorithm that is great for exploratory data analysis. - **Time series forecasting** : Implemented using [Facebook Prophet](https://research.fb.com/prophet-forecasting-at-scale/), a modern library for easily generating good quality forecasts. @@ -62,7 +62,7 @@ I prefer this approach for two key reasons: 4. Right click `Qlik-Py-Init.bat` and chose 'Run as Administrator'. You can open this file in a text editor to review the commands that will be executed. If everything goes smoothly you will see a Python virtual environment being set up, project files being copied, some packages being installed and TCP Port `50055` being opened for inbound communication. - If you need to change the port you can do so in the file `core\__main__.py` by opening the file with a text editor, changing the value of the `_DEFAULT_PORT` variable, and then saving the file. You will also need to update `Qlik-Py-Init.bat` to use the same port in the `netsh` command. This command will only work if you run the batch file through an elevated command prompt (i.e. with administrator privileges). - - Once the execution completes, do a quick scan of the log to see everything installed correctly. The libraries imported are: `grpcio`, `grpcio-tools`, `numpy`, `scipy`, `pandas`, `cython`, `pystan`, `fbprophet`, `scikit-learn`, `hdbscan`. Also, check that the `core` and `generated` directories have been copied successfully to the newly created `qlik-py-env` directory. + - Once the execution completes, do a quick scan of the log to see everything installed correctly. The libraries imported are: `grpcio`, `grpcio-tools`, `numpy`, `scipy`, `pandas`, `cython`, `pystan`, `fbprophet`, `scikit-learn`, `hdbscan`, `skater` and their dependencies. Also, check that the `core` and `generated` directories have been copied successfully to the newly created `qlik-py-env` directory. - If the initialization fails for any reason, you can simply delete the `qlik-py-env` directory and re-run `Qlik-Py-Init.bat`. 5. Now whenever you want to start this Python service you can run `Qlik-Py-Start.bat`. diff --git a/core/__main__.py b/core/__main__.py index 2f484d7..df63eb4 100644 --- a/core/__main__.py +++ b/core/__main__.py @@ -97,7 +97,8 @@ def functions(self): 25: '_sklearn', 26: '_sklearn', 27: '_sklearn', - 28: '_sklearn' + 28: '_sklearn', + 29: '_sklearn' } """ @@ -537,6 +538,11 @@ def _sklearn(request, context): for i in range(response.shape[1]-2): dtypes.append("num") + elif function == 29: + # Explain the feature importances for the model + response = model.explain_importances() + dtypes = ["str", "str", "num"] + # Get the response as SSE.Rows response_rows = utils.get_response_rows(response.values.tolist(), dtypes) diff --git a/core/_sklearn.py b/core/_sklearn.py index 860ad9d..eaa2d16 100644 --- a/core/_sklearn.py +++ b/core/_sklearn.py @@ -46,6 +46,9 @@ from sklearn.cluster import AffinityPropagation, AgglomerativeClustering, Birch, DBSCAN, FeatureAgglomeration, KMeans,\ MiniBatchKMeans, MeanShift, SpectralClustering +from skater.model import InMemoryModel +from skater.core.explanations import Interpretation + import _utils as utils from _machine_learning import Preprocessor, PersistentModel import ServerSideExtension_pb2 as SSE @@ -683,6 +686,23 @@ def calculate_metrics(self, caller="external"): self.model.confusion_matrix.loc[:,"model_name"] = self.model.name self.model.confusion_matrix = self.model.confusion_matrix.loc[:,\ ["model_name", "true_label", "pred_label", "count"]] + + if self.model.calc_feature_importances: + # Calculate model agnostic feature importances using the skater library + interpreter = Interpretation(self.X_test, feature_names=self.model.features_df.index.tolist()) + + try: + # We use the predicted probabilities from the estimator if available + imm = InMemoryModel(self.model.pipe.predict_proba, examples = self.X_test[:10], model_type="classifier") + except AttributeError: + # Otherwise we simply use the predict method + imm = InMemoryModel(self.model.pipe.predict, examples = self.X_test[:10], model_type="classifier", \ + unique_values = self.model.pipe.classes_) + + # Add the feature importances to the model as a sorted data frame + self.model.importances = interpreter.feature_importance.feature_importance(imm, progressbar=False, ascending=False) + self.model.importances = pd.DataFrame(self.model.importances).reset_index() + self.model.importances.columns = ["feature_name", "importance"] elif self.model.estimator_type == "regressor": # Get the r2 score @@ -705,6 +725,18 @@ def calculate_metrics(self, caller="external"): metrics_df.loc[:,"model_name"] = self.model.name metrics_df = metrics_df.loc[:,["model_name", "r2_score", "mean_squared_error", "mean_absolute_error",\ "median_absolute_error", "explained_variance_score"]] + + if self.model.calc_feature_importances: + # Calculate model agnostic feature importances using the skater library + interpreter = Interpretation(self.X_test, feature_names=self.model.features_df.index.tolist()) + + # Set up a skater InMemoryModel to calculate feature importances using the predict method + imm = InMemoryModel(self.model.pipe.predict, examples = self.X_test[:10], model_type="regressor") + + # Add the feature importances to the model as a sorted data frame + self.model.importances = interpreter.feature_importance.feature_importance(imm, progressbar=False, ascending=False) + self.model.importances = pd.DataFrame(self.model.importances).reset_index() + self.model.importances.columns = ["feature_name", "importance"] if caller == "external": self.response = metrics_df @@ -844,7 +876,35 @@ def predict(self, load_script=False, variant="predict"): return self.response.loc[:,'result'] - # STAGE 3: Implement feature_importances_ for applicable algorithms + def explain_importances(self): + """ + Explain feature importances for the requested model + """ + + # Get the model from cache or disk based on the model_name in request + self._get_model_by_name() + + # Get the feature importances calculated in the calculate_metrics method + try: + self.response = self.model.importances + except AttributeError: + err = "Feature importances are not available. Check that the execution argument calculate_importances " +\ + "is set to True, and that test_size > 0 or the Calculate_Metrics function has been executed." + raise Exception(err) + + # Add the model name to the response and rearrange columns + self.response.loc[:, "model_name"] = self.model.name + self.response = self.response[["model_name", "feature_name", "importance"]] + + # Send the reponse table description to Qlik + self._send_table_description("importances") + + # Debug information is printed to the terminal and logs if the paramater debug = true + if self.model.debug: + self._print_log(4) + + # Finally send the response + return self.response def get_features_expression(self): """ @@ -944,6 +1004,7 @@ def _set_params(self, estimator_args, scaler_args, execution_args, metric_args=N self.model.scaler = "StandardScaler" self.model.scaler_kwargs = {} self.model.missing = "zeros" + self.model.calc_feature_importances = False # Default metric parameters: if metric_args is None: @@ -977,6 +1038,10 @@ def _set_params(self, estimator_args, scaler_args, execution_args, metric_args=N # Flag to determine if the training and test data should be saved in the model if 'retain_data' in execution_args: self.model.retain_data = 'true' == execution_args['retain_data'].lower() + + # Flag to determine if feature importances should be calculated when the fit method is called + if 'calculate_importances' in execution_args: + self.model.calc_feature_importances = 'true' == execution_args['calculate_importances'].lower() # Set the debug option for generating execution logs # Valid values are: true, false @@ -1250,6 +1315,10 @@ def _send_table_description(self, variant): self.table.fields.add(name="true_label") self.table.fields.add(name="pred_label") self.table.fields.add(name="count", dataType=1) + elif variant == "importances": + self.table.fields.add(name="model_name") + self.table.fields.add(name="feature_name") + self.table.fields.add(name="importance", dataType=1) elif variant == "predict": self.table.fields.add(name="model_name") self.table.fields.add(name="key") diff --git a/core/functions.json b/core/functions.json index 3faeb3a..034dff9 100644 --- a/core/functions.json +++ b/core/functions.json @@ -309,6 +309,15 @@ "b_key": 0, "n_features": 0 } + }, + { + "Id": 29, + "Name": "sklearn_Explain_Importances", + "Type": 0, + "ReturnType": 0, + "Params": { + "a_model_name": 0 + } } ] } diff --git a/docs/README.md b/docs/README.md index 2e66890..70840ad 100644 --- a/docs/README.md +++ b/docs/README.md @@ -17,7 +17,7 @@ This repository provides a server side extension (SSE) for Qlik Sense built usin The current implementation includes: -- **Supervised Machine Learning** : Implemented using [scikit-learn](http://scikit-learn.org/stable/index.html), the go-to machine learning library for Python. This SSE implements the full machine learning flow from data preparation, model training and evaluation, to making predictions in Qlik. +- **Supervised Machine Learning** : Implemented using [scikit-learn](http://scikit-learn.org/stable/index.html), the go-to machine learning library for Python. This SSE implements the full machine learning flow from data preparation, model training and evaluation, to making predictions in Qlik. In addition, models can be interpreted using [Skater](https://datascienceinc.github.io/Skater/overview.html). - **Unupervised Machine Learning** : Also implemented using [scikit-learn](http://scikit-learn.org/stable/index.html). This provides capabilities for dimensionality reduction and clustering. - **Clustering** : Implemented using [HDBSCAN](https://hdbscan.readthedocs.io/en/latest/comparing_clustering_algorithms.html), a high performance algorithm that is great for exploratory data analysis. - **Time series forecasting** : Implemented using [Facebook Prophet](https://research.fb.com/prophet-forecasting-at-scale/), a modern library for easily generating good quality forecasts. @@ -62,7 +62,7 @@ I prefer this approach for two key reasons: 4. Right click `Qlik-Py-Init.bat` and chose 'Run as Administrator'. You can open this file in a text editor to review the commands that will be executed. If everything goes smoothly you will see a Python virtual environment being set up, project files being copied, some packages being installed and TCP Port `50055` being opened for inbound communication. - If you need to change the port you can do so in the file `core\__main__.py` by opening the file with a text editor, changing the value of the `_DEFAULT_PORT` variable, and then saving the file. You will also need to update `Qlik-Py-Init.bat` to use the same port in the `netsh` command. This command will only work if you run the batch file through an elevated command prompt (i.e. with administrator privileges). - - Once the execution completes, do a quick scan of the log to see everything installed correctly. The libraries imported are: `grpcio`, `grpcio-tools`, `numpy`, `scipy`, `pandas`, `cython`, `pystan`, `fbprophet`, `scikit-learn`, `hdbscan`. Also, check that the `core` and `generated` directories have been copied successfully to the newly created `qlik-py-env` directory. + - Once the execution completes, do a quick scan of the log to see everything installed correctly. The libraries imported are: `grpcio`, `grpcio-tools`, `numpy`, `scipy`, `pandas`, `cython`, `pystan`, `fbprophet`, `scikit-learn`, `hdbscan`, `skater` and their dependencies. Also, check that the `core` and `generated` directories have been copied successfully to the newly created `qlik-py-env` directory. - If the initialization fails for any reason, you can simply delete the `qlik-py-env` directory and re-run `Qlik-Py-Init.bat`. 5. Now whenever you want to start this Python service you can run `Qlik-Py-Start.bat`. diff --git a/docs/Sample-App-scikit-learn-Train-Test.qvf b/docs/Sample-App-scikit-learn-Train-Test.qvf index b2969b9..0ec4e10 100644 Binary files a/docs/Sample-App-scikit-learn-Train-Test.qvf and b/docs/Sample-App-scikit-learn-Train-Test.qvf differ diff --git a/docs/scikit-learn.md b/docs/scikit-learn.md index bec52bb..812c78d 100644 --- a/docs/scikit-learn.md +++ b/docs/scikit-learn.md @@ -8,6 +8,7 @@ - [Preparing feature definitions](#preparing-feature-definitions) - [Setting up the model](#setting-up-the-model) - [Training and testing the model](#training-and-testing-the-model) + - [Interpreting the Model](#interpreting-the-model) - [Making predictions using the model](#making-predictions-using-the-model) - [Unsupervised Machine Learning](#unsupervised-machine-learning) - [Matrix Decomposition](#matrix-decomposition) @@ -31,7 +32,7 @@ Supervised machine learning techniques make use of known samples to train a model, and then use this model to make predictions on new data. One of the best known machine learning libraries is [scikit-learn](http://scikit-learn.org/stable/index.html#), a package that provides efficient versions of a large number of well researched algorithms. A good introduction to machine learning and the scikit-learn API is available in [this excerpt from the Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/05.01-what-is-machine-learning.html). -This SSE provides functions to train, test and evaluate models and then use these models to make predictions. The current implementation scope includes classification and regression algorithms. +This SSE provides functions to train, test and evaluate models and then use these models to make predictions. The implementation includes classification and regression algorithms. In addition this SSE also implements the unsupervised machine learning algorithms available in scikit-learn. These include techniques for inferring structure in unlablelled data such as clustering and dimensionality reduction. @@ -56,7 +57,9 @@ At a high-level the steps are: - `PyTools.sklearn_Calculate_Metrics(model_name, n_features)` - `PyTools.sklearn_Get_Metrics(model_name)` - `PyTools.sklearn_Get_Confusion_Matrix(model_name, n_features)` _(Only applicable to classifiers)_ -8. Get predictions from an existing model +8. Optionally, calculate feature importances to gain a better understanding of the model + - `PyTools.sklearn_Explain_Importances(model_name)` +9. Get predictions from an existing model - `PyTools.sklearn_Predict(model_name, n_features)` _(For use in chart expressions)_ - `PyTools.sklearn_Bulk_Predict(model_name, key, n_features)` _(For use in the load script)_ - `PyTools.sklearn_Predict_Proba(model_name, n_features)` _(For use in chart expressions. Only applicable to classifiers)_ @@ -220,6 +223,34 @@ LOAD EXTENSION PyTools.sklearn_Get_Confusion_Matrix(TEMP_SAMPLES{Model_Name}); ``` +### Interpreting the Model + +An understanding of the the model can be gained using the `sklearn_Explain_Importances` function. This function makes use of the [Skater](https://www.datascience.com/resources/tools/skater) library to provide a degree of transparency into the model. + +Model interpretability is a developing area in Machine learning, and explaining the results from more complex algorithms is a challenging prospect. For an introduction to the concepts in model interpretability refer to the [Skater documentation](https://datascienceinc.github.io/Skater/overview.html). + +We can call the `sklearn_Explain_Importances` function in the load script to understand the importance assigned to each feature by the estimator. This can help guide further exploration of the data with Qlik; analyzing how the target changes with selections made to the most influential features. + +This function is only valid if `calculate_importances=true` is passed in the execution arguments. In addition, `test_size` should be greater than zero in the execution arguments or the `sklearn_Calculate_Metrics` function should have been called explicitly with a test dataset. + +``` +// Remember to pass calculate_importances=true in the execution arguments +LET vExecutionArgs = 'overwrite=true,test_size=0.3,calculate_importances=true'; +... + +// Use the LOAD...EXTENSION syntax to call the sklearn_Explain_Importances function +[Result-Importances]: +LOAD + model_name, + feature_name, + importance +EXTENSION PyTools.sklearn_Explain_Importances(TEMP_SAMPLES{Model_Name}); +``` + +This function is valid for all sklearn classifiers and regressors. + +For more information on the [Execution Arguments](#execution-arguments) refer to the section on Input Specifications. + ### Making predictions using the model To make predictions you need to use an existing model. A list of models can be obtained using the `sklearn_List_Models` function. This function is meant to be used in chart expressions, for example as a dimension in a table object. @@ -431,7 +462,8 @@ If you want to use default values you can simply pass an empty string for `Execu | test_size | Set the ratio that will be used to split the samples into training and testing data sets | `0.3` | Defaults to `0.33`. | | random_state | Seed used by the random number generator when generating the training testing split | `42` | Default to `42`.

Must be an integer. | | compress | Compression level between 1-9 used by joblib when saving the model | `1` | Defaults to `3`. | -| retain_data | Flag to determine if the training and test data should be saved in the model | `true`, `false` | Defaults to `false`. | +| retain_data | Flag to determine if the training and test data should be saved in the model | `true`, `false` | Defaults to `false` as this adds to the size of the model on disk. | +| calculate_importances | Flag to determine if feature importances should be calculated during model evaluation | `true`, `false` | Defaults to `false` as this adds to the processing time. | | debug | Flag to output additional information to the terminal and logs | `true`, `false` | Defaults to `false`.

Information will be printed to the terminal as well to a log file: `qlik-py-tools\qlik-py-env\core\logs\SKLearn Log .txt`. | ### Scaler Arguments