Skip to content

Commit

Permalink
Merge pull request #9 from nabeel-oz/model-interpretation
Browse files Browse the repository at this point in the history
Model interpretation
  • Loading branch information
nabeel-oz committed Oct 2, 2018
2 parents 1c18122 + 868c5db commit 8be723f
Show file tree
Hide file tree
Showing 8 changed files with 127 additions and 10 deletions.
3 changes: 2 additions & 1 deletion Qlik-Py-Init.bat
Original file line number Diff line number Diff line change
Expand Up @@ -12,12 +12,13 @@ call activate
cd ..
echo.
echo Installing required packages... & echo.
python -m pip install --upgrade pip
python -m pip install --upgrade setuptools pip
pip install grpcio grpcio-tools numpy scipy pandas cython
pip install pystan==2.17
pip install fbprophet
pip install -U scikit-learn
pip install hdbscan
pip install -U skater
echo.
echo Creating a new firewall rule for TCP port 50055... & echo.
netsh advfirewall firewall add rule name="Qlik PyTools" dir=in action=allow protocol=TCP localport=50055
Expand Down
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ This repository provides a server side extension (SSE) for Qlik Sense built usin

The current implementation includes:

- **Supervised Machine Learning** : Implemented using [scikit-learn](http://scikit-learn.org/stable/index.html), the go-to machine learning library for Python. This SSE implements the full machine learning flow from data preparation, model training and evaluation, to making predictions in Qlik.
- **Supervised Machine Learning** : Implemented using [scikit-learn](http://scikit-learn.org/stable/index.html), the go-to machine learning library for Python. This SSE implements the full machine learning flow from data preparation, model training and evaluation, to making predictions in Qlik. In addition, models can be interpreted using [Skater](https://datascienceinc.github.io/Skater/overview.html).
- **Unupervised Machine Learning** : Also implemented using [scikit-learn](http://scikit-learn.org/stable/index.html). This provides capabilities for dimensionality reduction and clustering.
- **Clustering** : Implemented using [HDBSCAN](https://hdbscan.readthedocs.io/en/latest/comparing_clustering_algorithms.html), a high performance algorithm that is great for exploratory data analysis.
- **Time series forecasting** : Implemented using [Facebook Prophet](https://research.fb.com/prophet-forecasting-at-scale/), a modern library for easily generating good quality forecasts.
Expand Down Expand Up @@ -62,7 +62,7 @@ I prefer this approach for two key reasons:

4. Right click `Qlik-Py-Init.bat` and chose 'Run as Administrator'. You can open this file in a text editor to review the commands that will be executed. If everything goes smoothly you will see a Python virtual environment being set up, project files being copied, some packages being installed and TCP Port `50055` being opened for inbound communication.
- If you need to change the port you can do so in the file `core\__main__.py` by opening the file with a text editor, changing the value of the `_DEFAULT_PORT` variable, and then saving the file. You will also need to update `Qlik-Py-Init.bat` to use the same port in the `netsh` command. This command will only work if you run the batch file through an elevated command prompt (i.e. with administrator privileges).
- Once the execution completes, do a quick scan of the log to see everything installed correctly. The libraries imported are: `grpcio`, `grpcio-tools`, `numpy`, `scipy`, `pandas`, `cython`, `pystan`, `fbprophet`, `scikit-learn`, `hdbscan`. Also, check that the `core` and `generated` directories have been copied successfully to the newly created `qlik-py-env` directory.
- Once the execution completes, do a quick scan of the log to see everything installed correctly. The libraries imported are: `grpcio`, `grpcio-tools`, `numpy`, `scipy`, `pandas`, `cython`, `pystan`, `fbprophet`, `scikit-learn`, `hdbscan`, `skater` and their dependencies. Also, check that the `core` and `generated` directories have been copied successfully to the newly created `qlik-py-env` directory.
- If the initialization fails for any reason, you can simply delete the `qlik-py-env` directory and re-run `Qlik-Py-Init.bat`.

5. Now whenever you want to start this Python service you can run `Qlik-Py-Start.bat`.
Expand Down
8 changes: 7 additions & 1 deletion core/__main__.py
Original file line number Diff line number Diff line change
Expand Up @@ -97,7 +97,8 @@ def functions(self):
25: '_sklearn',
26: '_sklearn',
27: '_sklearn',
28: '_sklearn'
28: '_sklearn',
29: '_sklearn'
}

"""
Expand Down Expand Up @@ -537,6 +538,11 @@ def _sklearn(request, context):
for i in range(response.shape[1]-2):
dtypes.append("num")

elif function == 29:
# Explain the feature importances for the model
response = model.explain_importances()
dtypes = ["str", "str", "num"]

# Get the response as SSE.Rows
response_rows = utils.get_response_rows(response.values.tolist(), dtypes)

Expand Down
71 changes: 70 additions & 1 deletion core/_sklearn.py
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,9 @@
from sklearn.cluster import AffinityPropagation, AgglomerativeClustering, Birch, DBSCAN, FeatureAgglomeration, KMeans,\
MiniBatchKMeans, MeanShift, SpectralClustering

from skater.model import InMemoryModel
from skater.core.explanations import Interpretation

import _utils as utils
from _machine_learning import Preprocessor, PersistentModel
import ServerSideExtension_pb2 as SSE
Expand Down Expand Up @@ -683,6 +686,23 @@ def calculate_metrics(self, caller="external"):
self.model.confusion_matrix.loc[:,"model_name"] = self.model.name
self.model.confusion_matrix = self.model.confusion_matrix.loc[:,\
["model_name", "true_label", "pred_label", "count"]]

if self.model.calc_feature_importances:
# Calculate model agnostic feature importances using the skater library
interpreter = Interpretation(self.X_test, feature_names=self.model.features_df.index.tolist())

try:
# We use the predicted probabilities from the estimator if available
imm = InMemoryModel(self.model.pipe.predict_proba, examples = self.X_test[:10], model_type="classifier")
except AttributeError:
# Otherwise we simply use the predict method
imm = InMemoryModel(self.model.pipe.predict, examples = self.X_test[:10], model_type="classifier", \
unique_values = self.model.pipe.classes_)

# Add the feature importances to the model as a sorted data frame
self.model.importances = interpreter.feature_importance.feature_importance(imm, progressbar=False, ascending=False)
self.model.importances = pd.DataFrame(self.model.importances).reset_index()
self.model.importances.columns = ["feature_name", "importance"]

elif self.model.estimator_type == "regressor":
# Get the r2 score
Expand All @@ -705,6 +725,18 @@ def calculate_metrics(self, caller="external"):
metrics_df.loc[:,"model_name"] = self.model.name
metrics_df = metrics_df.loc[:,["model_name", "r2_score", "mean_squared_error", "mean_absolute_error",\
"median_absolute_error", "explained_variance_score"]]

if self.model.calc_feature_importances:
# Calculate model agnostic feature importances using the skater library
interpreter = Interpretation(self.X_test, feature_names=self.model.features_df.index.tolist())

# Set up a skater InMemoryModel to calculate feature importances using the predict method
imm = InMemoryModel(self.model.pipe.predict, examples = self.X_test[:10], model_type="regressor")

# Add the feature importances to the model as a sorted data frame
self.model.importances = interpreter.feature_importance.feature_importance(imm, progressbar=False, ascending=False)
self.model.importances = pd.DataFrame(self.model.importances).reset_index()
self.model.importances.columns = ["feature_name", "importance"]

if caller == "external":
self.response = metrics_df
Expand Down Expand Up @@ -844,7 +876,35 @@ def predict(self, load_script=False, variant="predict"):

return self.response.loc[:,'result']

# STAGE 3: Implement feature_importances_ for applicable algorithms
def explain_importances(self):
"""
Explain feature importances for the requested model
"""

# Get the model from cache or disk based on the model_name in request
self._get_model_by_name()

# Get the feature importances calculated in the calculate_metrics method
try:
self.response = self.model.importances
except AttributeError:
err = "Feature importances are not available. Check that the execution argument calculate_importances " +\
"is set to True, and that test_size > 0 or the Calculate_Metrics function has been executed."
raise Exception(err)

# Add the model name to the response and rearrange columns
self.response.loc[:, "model_name"] = self.model.name
self.response = self.response[["model_name", "feature_name", "importance"]]

# Send the reponse table description to Qlik
self._send_table_description("importances")

# Debug information is printed to the terminal and logs if the paramater debug = true
if self.model.debug:
self._print_log(4)

# Finally send the response
return self.response

def get_features_expression(self):
"""
Expand Down Expand Up @@ -944,6 +1004,7 @@ def _set_params(self, estimator_args, scaler_args, execution_args, metric_args=N
self.model.scaler = "StandardScaler"
self.model.scaler_kwargs = {}
self.model.missing = "zeros"
self.model.calc_feature_importances = False

# Default metric parameters:
if metric_args is None:
Expand Down Expand Up @@ -977,6 +1038,10 @@ def _set_params(self, estimator_args, scaler_args, execution_args, metric_args=N
# Flag to determine if the training and test data should be saved in the model
if 'retain_data' in execution_args:
self.model.retain_data = 'true' == execution_args['retain_data'].lower()

# Flag to determine if feature importances should be calculated when the fit method is called
if 'calculate_importances' in execution_args:
self.model.calc_feature_importances = 'true' == execution_args['calculate_importances'].lower()

# Set the debug option for generating execution logs
# Valid values are: true, false
Expand Down Expand Up @@ -1250,6 +1315,10 @@ def _send_table_description(self, variant):
self.table.fields.add(name="true_label")
self.table.fields.add(name="pred_label")
self.table.fields.add(name="count", dataType=1)
elif variant == "importances":
self.table.fields.add(name="model_name")
self.table.fields.add(name="feature_name")
self.table.fields.add(name="importance", dataType=1)
elif variant == "predict":
self.table.fields.add(name="model_name")
self.table.fields.add(name="key")
Expand Down
9 changes: 9 additions & 0 deletions core/functions.json
Original file line number Diff line number Diff line change
Expand Up @@ -309,6 +309,15 @@
"b_key": 0,
"n_features": 0
}
},
{
"Id": 29,
"Name": "sklearn_Explain_Importances",
"Type": 0,
"ReturnType": 0,
"Params": {
"a_model_name": 0
}
}
]
}
4 changes: 2 additions & 2 deletions docs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ This repository provides a server side extension (SSE) for Qlik Sense built usin

The current implementation includes:

- **Supervised Machine Learning** : Implemented using [scikit-learn](http://scikit-learn.org/stable/index.html), the go-to machine learning library for Python. This SSE implements the full machine learning flow from data preparation, model training and evaluation, to making predictions in Qlik.
- **Supervised Machine Learning** : Implemented using [scikit-learn](http://scikit-learn.org/stable/index.html), the go-to machine learning library for Python. This SSE implements the full machine learning flow from data preparation, model training and evaluation, to making predictions in Qlik. In addition, models can be interpreted using [Skater](https://datascienceinc.github.io/Skater/overview.html).
- **Unupervised Machine Learning** : Also implemented using [scikit-learn](http://scikit-learn.org/stable/index.html). This provides capabilities for dimensionality reduction and clustering.
- **Clustering** : Implemented using [HDBSCAN](https://hdbscan.readthedocs.io/en/latest/comparing_clustering_algorithms.html), a high performance algorithm that is great for exploratory data analysis.
- **Time series forecasting** : Implemented using [Facebook Prophet](https://research.fb.com/prophet-forecasting-at-scale/), a modern library for easily generating good quality forecasts.
Expand Down Expand Up @@ -62,7 +62,7 @@ I prefer this approach for two key reasons:

4. Right click `Qlik-Py-Init.bat` and chose 'Run as Administrator'. You can open this file in a text editor to review the commands that will be executed. If everything goes smoothly you will see a Python virtual environment being set up, project files being copied, some packages being installed and TCP Port `50055` being opened for inbound communication.
- If you need to change the port you can do so in the file `core\__main__.py` by opening the file with a text editor, changing the value of the `_DEFAULT_PORT` variable, and then saving the file. You will also need to update `Qlik-Py-Init.bat` to use the same port in the `netsh` command. This command will only work if you run the batch file through an elevated command prompt (i.e. with administrator privileges).
- Once the execution completes, do a quick scan of the log to see everything installed correctly. The libraries imported are: `grpcio`, `grpcio-tools`, `numpy`, `scipy`, `pandas`, `cython`, `pystan`, `fbprophet`, `scikit-learn`, `hdbscan`. Also, check that the `core` and `generated` directories have been copied successfully to the newly created `qlik-py-env` directory.
- Once the execution completes, do a quick scan of the log to see everything installed correctly. The libraries imported are: `grpcio`, `grpcio-tools`, `numpy`, `scipy`, `pandas`, `cython`, `pystan`, `fbprophet`, `scikit-learn`, `hdbscan`, `skater` and their dependencies. Also, check that the `core` and `generated` directories have been copied successfully to the newly created `qlik-py-env` directory.
- If the initialization fails for any reason, you can simply delete the `qlik-py-env` directory and re-run `Qlik-Py-Init.bat`.

5. Now whenever you want to start this Python service you can run `Qlik-Py-Start.bat`.
Expand Down
Binary file modified docs/Sample-App-scikit-learn-Train-Test.qvf
Binary file not shown.
Loading

0 comments on commit 8be723f

Please sign in to comment.