The Scikit-learn library is the gold standard in performing machine learning in Python as it allows you to build machine learning models, prepare data, analyze data, and evaluate models. As the tool has been around for quite some time since 2007, there are extensive examples and resources that one can use to get started.
In this lesson, we're going to use the Scikit-learn
library for the various phases in the data science cycle as mentioned earlier (e.g. build machine learning models, prepare data, analyze data, and evaluate models).
Firstly, open up a terminal window and install Scikit-learn via pip as follows:
pip install scikit-learn
We'll start with the basics by examining Scikit-learn's tabular representation of data.
The tabular dataset for a supervised learning problem will contain both X and y variables, while the tabular dataset for an unsupervised learning problem will contain only X variables.
As an overview, X variables are also known as independent variables, and they can describe samples of interest quantitatively or qualitatively, while y variables are known as dependent variables, and they serve as target or response variables used in predictive models.
To illustrate, if we're constructing a model to predict whether individuals will have a disease or not, the disease/non-disease status is the y variable and clinical test results are the X variable.
A dataset can be stored as a CSV file and read using the Pandas library via the pd.read_csv()
function. Hence, a Pandas DataFrame is used to represent the loaded data.
In the following example, we're loading in a CSV data stored on the cloud particularly in the GitHub repo.
# Load data
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/dataprofessor/data/master/delaney_solubility_with_descriptors.csv')
df
Next, we'll separate the DataFrame as X and y variables that will be subsequently used for model building.
# Separate data as X and y
X = df.drop('logS', axis=1)
y = df['logS']
This gives us the following X variable:
And the following y variable:
0 -2.180
1 -2.000
2 -1.740
3 -1.480
4 -3.040
...
1139 1.144
1140 -4.925
1141 -3.893
1142 -3.790
1143 -2.581
Name: logS, Length: 1144, dtype: float64
It is common to use a data splitting function to separate given inputs X and y variables into training and test sets (X_train
, Y_train
, X_test
, y_test
).
The code snippet below uses the train_test_split()
function to split the data. Particularly, the size of the test set is specified to 0.2 (or 20%) and the random seed number set to 42 (so that the code block will produce the same data split if it is run multiple times).
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Afterwards, let's take a look at the dimension size of the 4 data subsets using the shape()
method.
- Here's the dimension of X as determined via
shape()
:
X_train.shape, X_test.shape
((915, 4), (229, 4))
- Here's the dimension of y as determined via
shape()
:
y_train.shape, y_test.shape
((915,), (229,))
Let's now use the generated data subsets to train a machine learning model.
# Model building
from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor(max_features=4, n_estimators=100)
rf.fit(X_train, y_train)
In the above code snippet, we're building a random forest model via the RandomForestRegressor()
function and we've specified that the maximum number of features (i.e. the X variables) to use is 4 which corresponds to the 4 X variables.
In essence, we're applying the trained random forest model to make predictions of y
values for both X_train
and X_test
. Correspondingly, predicted values are stored in 2 variables: y_train_pred
and y_test_pred
. In other words, the model is taking in X to predict y values.
# Apply trained model to make predictions
y_train_pred = rf.predict(X_train)
y_test_pred = rf.predict(X_test)
Let's take a preview of the returned output from y_train_pred
:
array([-4.46691333e+00, -8.09591000e+00, -4.08295400e+00, 5.71000000e-02,
-1.30996667e+00, -1.96770000e+00, -7.56278571e-01, -1.96031667e+00,
-3.30567810e+00, -2.12870000e+00, -4.18783333e+00, -8.19162667e+00,
-3.70431617e+00, -7.78641667e-01, -1.72077000e+00, -4.63966667e-01,
.....
-1.34834000e+00, -6.75356917e+00, -5.36715167e+00, -8.04068993e+00,
-3.82656667e+00, -1.21819667e+00, -7.25481667e-01, -1.83893000e+00,
-1.27629714e+00, -7.06956102e+00, -3.41684400e+00, -1.80697000e+00,
-5.54075000e+00, -4.73305000e+00, -6.44516500e+00])
Now that we have predicted y values for both the train and test subsets, we're going to calculate model performance metrics comprising of
Here's how we can calculate the model performance metrics using Scikit-learn:
-
$R^2$ can be calculated using ther2_score()
function. - MSE can be calculated using the
mean_squared_error()
function.
For both of these metrics, we're providing 2 input arguments namely the actual value (y_train
or y_test
) and the predicted value (y_train_pred
or y_test_pred
).
Let's put this in action:
# Evaluate model performance
from sklearn.metrics import mean_squared_error, r2_score
train_r2 = r2_score(y_train, y_train_pred)
test_r2 = r2_score(y_test, y_test_pred)
train_mse = mean_squared_error(y_train, y_train_pred)
test_mse = mean_squared_error(y_test, y_test_pred)
- The returned output for
$R^2$ by running the following code:
train_r2, test_r2
gives us the following:
(0.9793095988902839, 0.8868388639174744)
- The returned output for MSE by running the following code:
train_mse, test_mse
gives us the following:
(0.09102390986320828, 0.49276498022937004)
Insights can be extracted from the model so as to drive the decision-making process. Here's how we can figure out which features are important and thus contribute to the model's prediction.
Particularly, we can append feature_importances_
to the instantiated model:
# Feature importance
rf.feature_importances_
This returns the following output:
array([0.82758671, 0.12854113, 0.02059104, 0.02328112])
Let's now add additional details to the above set of feature importance values, which are the feature names that are associated with the feature importance value.
# Name of features
X_train.columns
This returns the following output:
Index(['MolLogP', 'MolWt', 'NumRotatableBonds', 'AromaticProportion'], dtype='object')
Now that we have all of the prediction results in hand, let's now proceed to visualizing them.
Here, we're going to create a scatter plot of the actual versus predicted logS
value using Matplotlib and NumPy
import matplotlib.pyplot as plt
import numpy as np
# Creating the scatter plot
plt.figure(figsize=(5,5))
plt.scatter(x=y_train, y=y_train_pred, c="#7CAE00" ,alpha=0.3)
# Adding the trend line
z = np.polyfit(y_train, y_train_pred, 1)
p = np.poly1d(z)
plt.plot(y_train, p(y_train), '#F8766D')
# Adding the label
plt.ylabel('Predict logS')
plt.xlabel('Experimental logS')
plt.show()
In this lesson, we had learned the basics of using Scikit-learn to build machine learning models. Particularly, we started with loading in a CSV data, preprocessed it as data subsets, built a machine learning model that is subsequently used to predict the y values for both the train and test subsets, computing the model performance metrics, figuring out the feature importance and finally visualizing the prediction results.
🚀 Proceed to Project 5 to build a Streamlit app that shows the use of Scikit-learn to create an ML model in a Streamlit app.