- Terminology
- General Definitions
- Pre-processing
- Supervised Learning
- Unsupervised Learning
- Feature Tuning
- Hyperparameter Tuning
Symbol | Meaning |
---|---|
model coefficient or parameter |
Definition | Meaning |
---|---|
Reducible Error | error that can be improved upon by a statistical learning technique. |
Irreducible Error | error that cannot be predicted and therefore cannot be reduced. |
Quantitative Variables | variables that take on numerical on continuous values. |
Qualitative Variables | variables that take on values in one of any number of different classes or categories. |
Regression | problems with a quantitative, or numerical response. |
Classification | problems with a qualitative, or classifying response. |
Cross-Validation | method for estimating the test mean squared error (MSE) using the training data. |
Bootstrap Sampling | sampling from the dataset with replacement. |
Hypothesis Testing | statistical method used in making statistical decisions using experimental |
Null Hypothesis | statement claiming that there is no relationship between X and Y(denoted by H0). |
Alternative Hypothesis | statement directly contradicting the Null Hypothesis (denoted by Ha). |
Data Leakage | Calculating statistics from full a dataset instead of just the training set. |
Hold Out Test Set | Test set not a part of cross-validation; utilized for hyperparameter tuning. |
-
Accuracy
- the porportion of correct predictions made by a model over the total # of predictions.# of correct predictions (TP + TN) / total # of predictions
-
Missclassification (Error) Rate
- the porportion of incorrect predictions made by a model over the total # of predictions.# of incorrect predictions (FP + FN) / total # of predictions
-
Recall
- the ability of a model to find all the revelant cases within a dataset.# of true positives (TP) / [# of true positives (TP) + # of false negatives (FN)]
-
Precision
- the porportion of cases the models says was revelant and cases that were actually revelant.# of true positives (TP) / [# of true positives (TP) + # of false positives (FP)]
-
F1 Score
- the optimal blend of precision and recall to take into account both metrics. F1 provides the harmonic mean, which unlike the simple mean, punishes extreme values.F1 = 2 (precision _ recall) / (precision + recall)
Typically caused by model complexity in search for better performance.
Overfitting: when the model is fitting too much noise and variance in the training data. This means it will preform well on training data, but poorly on new unseen data. Overfitting is harder to detect, as good performance on training data could allude to the model performing well.
Underfitting: when the model doesn't capture the underlying trend of the data, and does not fit well enough to the training data. This is often a result of an excessively simple model.
- Low Bias / Low Variance - most accurate model
- Low Bias / High Variance - accurate predictions and more dispersed outliers
- High Bias / Low Variance - biased predictions and minimal outliers
- High Bias / High Variance biased predictions and more disperesed outliers
Split the data into separate train (X_train, y_train)
and test (X_test, y_test)
sets
from sklearn.model_selection import train_test_split
# target values
y = df['Target_Column']
# feature values
X = df.drop('Target_Column', axis=1)
# splitting the data into train/test (70%/30%) sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=42, stratify=True)
# test_size = % of test set, stratify = proportionate labels in each set
- Improves the convergence of "steepest descent" algorithms, or machine learning models that rely on distance metrics.
- When features are on different scales, certain weights may update faster than others since the magnitude of feature values are considered in the weight updates.
- Allows for comparing measurements that have different units, and to directly compare model coefficients to each other.
Caveats
Feature scaling must always be done to new unseen data before providing it to the model. Feature scaling makes it easier to compare coefficients to one another, but harder to relate back to the original unscaled feature.
Benefits
While there is no "real" downside to scaling features, it is absolutely necessary for some models.
- Rescales data to have a mean of 0 and a standard deviation of 1.
- Sometimes referred to as "Z-score normalization".
- Rescales data values to be between 0 and 1.
Scale feature data using scale
in sklearn.preprocessing
from sklearn.preprocessing import scale
# feature values
X = df.drop('Target_Column', axis=1)
# scaled feature values
X_scaled = scale(X, axis=0)
# axis=0 standardizes each feature
# axis=1 standardizes each record
Scale data using StandardScaler
transformer in sklearn.preprocessing
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
# fitting feature data using scaler
scaler.fit(X)
# each feature to have MEAN=0 and STD=1
StandardScaler(copy=True, with_mean=True, with_std=True)
# transform data for scaled feature values
samples_scaled = scaler.transform(X)
- Method that allows for training and evaluating all the data.
- Create train/test splits leaving 1/K for test data.
- Continue k number of splits to have all data evaluated and tested.
- Average the errors to get the mean error (ie. expected performance).
- K-fold CV with K=10 is common; each test set is 10% of your total data.
Caveats
completing computations K number of times can be computationally expensive for large datasets.
Benefits
cross validation provides a better sense of true performance, across the multiple potental splits.
- scikit-learn uses a "scorer object" to provide metrics cross validation.
- all scorer objects follow the convention that higher return values are better than lower return values.
- because a high RMSE is not performant, a negative RMSE is used as its scorer metric
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression
# target values
y = df['Target_Column']
# feature values
X = df.drop('Target_Column', axis=1)
reg = LinearRegression()
# CROSS VALIDATING the model; cv= # of CROSS-VALIDATION folds
cv_results = cross_val_score(reg, X, y, cv=5)
# cv_results = array of CV scores (R-squared for LR)
Replace missing data using Imputer
transformer in sklearn.preprocessing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import Imputer
# feature values
X = df.drop('Target_Column', axis=1)
# axis=0 indicates columns, axis=1 indicates records
imp = Imputer(missing_values='NaN', strategy='most_frequent', axis=0)
# filling missing data using imputer
imp.fit(X)
# transform data for imputed feature values
X = imp.transform(X)
Directly converts categories into integers (ie. 1, 2, 3,...,N)
Caveats
Encoding in this way implies an ordering and relationship for the category values and integers created; the machine learning algorithm may interpret these values as being on a linear scale (ordinal) and not categorical.
Benefits
Integer encoding is easy to understand and does not incease the number of features. Categorical values can be organized prior to encoding to allow the ordinal value scale to be applicable. (ie. mild, medium, hot...1,2,3)
Converts categories into individual features that are either 0 or 1 (Dummy Variables). Can reduce feature column expansion by creating higher level categories.
Caveats
Encoding this way increases the number of features (ie. a new feature created for each unique category value). Adding new category values could require more feature columns to be added
Benefits
One-Hot encoding does not imply an ordering and relationship for the categegory values and integers created;
- Uses
historical
andlabeled
data; the machine learning model predicts a value.- Historical - known results and data from the past.
- Labeled - the desired output is known.
Machine learning models that calculate a categorical value; predicts an "assigned" category.
-
Accuracy
- great with balanced classes, but not great with unbalanced classes (ie. 100 images with 99 dog images, 1 cat image)
-
Precision
- measured by the ability of the model to identify only the relevant data points.
-
Classification algorithm that operates as:
- calculates the distance from "x" to all points in the data
- sorts the points by increasing distance from "x"
- predicts the majority label of the "k" closest points
-
Does not handle categorical data well
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
y = df['Target_Column']
X = df.drop('Target_Column', axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=42)
# defining n_neighbors for model
knn = KNeighborsClassifier(n_neighbors=6)
# fit the model to training data
knn.fit(X_train, y_train)
# predict the labels for the test data
pred = knn.predict(X_test)
# scoring the model accuracy with the train/test data
train_score = knn.score(X_train, y_train)
test_score = knn.score( X_test, y_test)
- Logistic (Sigmoid) Function takes in any value and outputs it between 0 and 1.
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
y = df['Target_Column']
X = df.drop('Target_Column', axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=42)
# instantiating a LOGISTIC_REGRESSION classifier
lr = LogisticRegression()
# fit the model to the training data
lr.fit(X_train, y_train)
# predict the labels for the test data
y_pred = lr.predict(X_test)
# scoring the model accuracy with the train/test data
train_score = lr.score(X_train, y_train)
test_score = lr.score(X_test, y_test)
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
y = df['Target_Column']
X = df.drop('Target_Column', axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=42)
svm = SVC()
# fit the model to the training data
svm.fit(X_train, y_train)
# predict the labels for the test data
pred = svm.predict(X_test)
# scoring the model accuracy with the train/test data
train_score = svm.score(X_train, y_train)
test_score = svm.score(X_test, y_test)
from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVC
y = df['Target_Column']
X = df.drop('Target_Column', axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=42)
svm = LinearSVC()
# fit the model to the training data
svm.fit(X_train, y_train)
# predict the labels for the test data
pred = svm.predict(X_test)
# scoring the model accuracy with the train/test data
train_score = svm.score(X_train, y_train)
test_score = svm.score(X_test, y_test)
Uses many trees with a random sample of features chosen as the split. Alleviates one very strong feature from creating an ensemble of highly similar trees that use the strong feature at the top split.
- a new random sample of features is chosen for every single tree, at every single split.
- in a classification random forest model,
m
features is typically chosen to be thesqrt(p)
or square root of thep
full set of features.
- Machine learning models that calculates a continuous value; predicts a continuous value like prices, scores, etc.
Definition | Meaning |
---|---|
Mean Absolute Error (MAE) | the average of the absolute value of errors ie. true value - predicted value ; does not punish large errors. |
Mean Squared Error (MSE) | the average of the absolute value of errors squared ie. (true value - predicted_value)^2 ; punishes large errors but is in squared units. |
Root Mean Square Error (RMSE) | the square root of the average of the value of errors squared ie. sqrt(true_value - predicted value)^2 ; punishes large errors and provides the same units as y. |
Ordinary Least Squares (OLS) | minimizing the sum of the squares of differences between the observed dependent variable, and the variable predicted by the linear function. ie. y=mx+b
|
Residual | the difference between the actual observed response and the predicted response. |
RSS (Residual Sum of Squares) | sum of the differences between the observed response and the predicted resonse. |
Population Regression Line | the best linear approximation to the true relationship between X and Y. |
Confidence Interval | computed at a designated "confidence level" (ie. 95th, 99th) provides the range of estimates for an unknown parameter. |
Hypothesis Test | method of statistical inference to decide if the data supports a particular hypothesis (ie. null, alternative) |
Null Hypothesis | the claim that no relationship exists between X and Y. |
Alternative Hypothesis | the claim that a relationship exists between X and Y. |
t-statistic | the ratio of the departure of the estimated value of a parameter ( |
p-value | probability of obtaining test results as extreme as the result actually observed, under the assumption the null hypothesis is correct. Small p-values reject the null hypothesis. |
Standard Error | the average amount that the sample mean differs from the actual population mean. |
RSE (Residual Standard Error) | average amount that the response will deviate from the true regression line; measures the lack of fit of the model. |
Useful tool for predicting a quantitative response, and a good jumping-off point for newer approaches.
- OLS Theory
- allows the solving for slope m and intercept b
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
y = df['Target_Column']
X = df.drop('Target_Column', axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=42)
reg = LinearRegression()
# fit the model to the training data
reg.fit(X_train, y_train)
# predict the labels for the test data
y_pred = reg.predict(X_test)
# scoring the model accuracy with the train/test data
train_score = reg.score(X_train, y_train)
test_score = reg.score(X_test, y_test)
This type of regression model considers higher order relationships on the features; it addresses two issues not addressed with linear regression:
- Non-linear feature relationships to labels.
- Interaction terms between features.
Interaction terms applies to features that are only significant when in "sync" with another feature. An interaction term is a feature created by multipling the two exisitng features together; it is then incorporated along with the original features and provided to the regression model.
PolynominalFeatures scikitlearn preprocessing library will automatically create both higher order feature polynominals and the interaction terms between all feature combinations.
A method to reduce model fitting and variance by adding more bias. Solves these model issues by:
- minimizing model complexity
- penalizing the loss function
- reducing model overfitting
- searching for an optimal penalty hyperparameter.
Caveats
introducing an additional hyperparameter means it needs to also be tuned; a multiplier is added to the penalty to decide the "strength" of the penalty.
Resources:
LASSO (Least Absolute Shrinkage and Selection Operator) Regularization adds a penalty greater than or equal to the absolute value of the magnitude of coefficients
When tuning parameter
Benefits
Generally much easier to interpret than Ridge.
import numpy as np
from sklearn.metrics import mean_absolute_error, mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Lasso
y = df['Target_Column']
X = df.drop('Target_Column', axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=42)
# normalize=True normalizes all variables
lasso = Lasso(alpha=0.1, normalize=True)
# fit the model to the training data
lasso.fit(X_train, y_train)
# predicting the labels for the test data
test_predictions = ridge.predict(X_test)
# scoring the model accuracy with the train/test data
train_score = lasso.score(X_train, y_train)
test_score = lasso.score(X_test, y_test)
# calculating the mean absolute and root mean squared errors
MAE = mean_absolute_error(y_test, test_predictions)
RMSE = np.sqrt(mean_squared_error(y_test, test_predictions))
print(f"MAE: {MAE}")
print(f"RMSE: {RMSE}")
Lasso Regression with built-in cross validation.
import numpy as np
from sklearn.metrics import mean_absolute_error, mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LassoCV
lasso_cv_model = LassoCV(
# ratio of alpha_min to alpha_max
eps=0.001,
# number of alphas
n_alphas=100,
# number of iterations for convergence
max_iter=10000,
# number of folds to create
cv=10
)
lasso_cv_model.fit(X_train, y_train)
# identifies the best performing alpha value
print(f"Best Performing ALPHA: {round(lasso_cv_model.alpha_, 3)}\n")
# predicting the labels for the test data
test_predictions = lasso_cv_model.predict(X_test)
# calculating the mean absolute and root mean squared errors
MAE = mean_absolute_error(y_test, test_predictions)
RMSE = np.sqrt(mean_squared_error(y_test, test_predictions))
print(f"MAE: {round(MAE, 4)}")
print(f"RMSE: {round(RMSE, 4)}")
Resources:
Regularization that adds a penalty greater than or equal to the square of the magnitude of coefficients.This helps to limit the size of coefficients, as all coefficients are shrunk by the same factor; however this does not necessarily eliminate coefficients.
import numpy as np
from sklearn.metrics import mean_absolute_error, mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Ridge
y = df['Target_Column']
X = df.drop('Target_Column', axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=42)
# normalize=True normalizes all variables
ridge = Ridge(alpha=0.1, normalize=True)
# fit the model to the training data
ridge.fit(X_train, y_train)
# predicting the labels for the test data
test_predictions = ridge.predict(X_test)
# scoring the model accuracy with the train/test data
train_score = ridge.score(X_train, y_train)
test_score = ridge.score(X_test, y_test)
# calculating the mean absolute and root mean squared errors
MAE = mean_absolute_error(y_test, test_predictions)
RMSE = np.sqrt(mean_squared_error(y_test, test_predictions))
print(f"MAE: {MAE}")
print(f"RMSE: {RMSE}")
Ridge Regression with built-in cross validation.
import numpy as np
from sklearn.metrics import mean_absolute_error, mean_squared_error, SCORERS
from sklearn.model_selection import train_test_split
from sklearn.linear_model import RidgeCV
# instance of ridge model w/cross validation
ridge_cv_model = RidgeCV(
# alpha parameters to test
alphas=(0.1, 1.0, 10.0),
# identifies the number of folds to create; none = leave-one-out cv
cv=None,
# tells which scoring error to use
scoring="neg_mean_absolute_error"
)
ridge_cv_model.fit(X_train, y_train)
# identifies the best performing alpha
print(f"Best Performing ALPHA: {round(ridge_cv_model.alpha_, 3)}\n")
# predicting the labels for the test data
test_predictions = ridge_cv_model.predict(X_test)
# calculating the mean absolute and root mean squared errors
MAE = mean_absolute_error(y_test, test_predictions)
RMSE = np.sqrt(mean_squared_error(y_test, test_predictions))
print(f"MAE: {round(MAE, 4)}")
print(f"RMSE: {round(RMSE, 4)}")
Resources:
Combines L1 and L2 with the addition of an aplha parameter determining the ratio between them.
import numpy as np
from sklearn.metrics import mean_absolute_error, mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.linear_model import ElasticNetCV
y = df['Target_Column']
X = df.drop('Target_Column', axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=42)
elastic_cv_model = ElasticNetCV(
# ratio for L1 (lasso) to L2 (ridge) scaling
l1_ratio=[.1, .5, .9, .95, .99, 1],
# ratio of alpha_min to alpha_max
eps=0.001,
# number of alphas
n_alphas=100,
# number of iterations for convergence
max_iter=1000000,
)
elastic_cv_model.fit(X_train, y_train)
# identifies the best performing L1 to L2 ratio
print(f"Attempted L1 Ratios: {elastic_cv_model.l1_ratio}")
print(f"Best L1 Ratio: {elastic_cv_model.l1_ratio_}\n")
# identifies the best performing alpha value
print(f"Best Performing ALPHA: {round(elastic_cv_model.alpha_, 3)}\n")
# predicting the labels for the test data
test_predictions = elastic_cv_model.predict(X_test)
# calculating the mean absolute and root mean squared errors
MAE = mean_absolute_error(y_test, test_predictions)
RMSE = np.sqrt(mean_squared_error(y_test, test_predictions))
print(f"MAE: {round(MAE, 4)}")
print(f"RMSE: {round(RMSE, 4)}")
- Uses unlabeled data; the machine learning model discovers possible patterns in the data.
- clusters population into separate groups based off their behavior features.
- No historical correct label, hard to evaluate performance of an unsupervised algorithm.
-
Goal is to divide data into similarly distinct groups, defined by number of clusters
k
. -
Process is to randomly assign each point in the data to a cluster.
-
Until clusters stop changing:
- for each cluster, compute the "cluster centroid" by taking the mean vector of points in the cluster.
- assign each data point to the cluster for which the centroid is the closest.
-
With no best way to choose
k
value,elbow method
is found to be sufficient.- compute the SSE for some values of k (ie. 2, 4, 6, 8).
- when plotting
k
against the SSE, the error decreases ask
gets larger because the number of clusters increases (distortion in a cluster is also smaller).- this observation creates an "elbow effect" when SSE decreases abruptly; the
k
value at the elbow is chosen.
- this observation creates an "elbow effect" when SSE decreases abruptly; the
SSE (Sum of Squared Error): the sum of the squared distance between each member of the cluster and its centroid.
from sklearn.cluster import KMeans
X = df
# n_clusters tell model how many clusters to create
kmeans = KMeans(n_clusters=2)
# fit the model to the data
kmeans.fit(X)
# provides the centers of n_clusters
cluster_centers = kmeans.cluster_centers_
# gives the predicted labels
predictions = kmeans.labels_
Fundamentally a Confusion Matrix provides ways to compare predicted values versus true values.
- True Positive (TP) - model predicts positive, outcome is positive
- False Positive (FP) Type I Error - model predicts positive, outcome is negative
- True Negative (TN) - model predicts negative, outcome is negative
- False Negative (FN) Type II Error - model predicts negative, outcome is positive
Using a K-Nearest Neighbors Model:
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix
y = df['Target_Column']
X = df.drop('Target_Column', axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=42)
# defining n_neighbors for the model
knn = KNeighborsClassifier(n_neighbors=6)
# fit the model to the training data
knn.fit(X_train, y_train)
# predict the labels for the test data
y_pred = knn.predict(X_test)
# generate the confusion_matrix on actual/predicted labels
confusion_m = confusion_matrix(y_test, y_pred)
# print confusion_m = [[TP, FN], [FP,TN]]
print(confusion_m)
Using a K-Nearest Neighbors Model:
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report
y = df['Target_Column']
X = df.drop('Target_Column', axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=42)
# defining n_neighbors for the model
knn = KNeighborsClassifier(n_neighbors=6)
# fit the model to the training data
knn.fit(X_train, y_train)
# predict the labels for the test data
y_pred = knn.predict(X_test)
# generate the classification report on actual/predicted labels
classification_r = classification_report(y_test, y_pred)
# SUPPORT column provides # of samples of the TRUE RESPONSE that lie in that class
print(classification_report(y_test, y_pred))
'''
SPAM E-MAIL EXAMPLE:
HIGH PRECISION = not many emails predicted as spam
HIGH RECALL = predicted most spam emails correctly
'''
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_curve
y = df['Target_Column']
X = df.drop('Target_Column', axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=42)
lr = LogisticRegression()
# predict_proba gives probabilities w/out threshold; [:,1] are labels for the '1' binary label
y_pred_prob = lr.predict_proba(X_test)[:,1]
# unpack ROC curve values: fpr, tpr, thresholds
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)
'''
fpr = FALSE POSITIVE RATE
tpr = TRUE POSITIVE RATE
'''
- Large AUC = better model; produces one AUC Score.
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
y = df['Target_Column']
X = df.drop('Target_Column', axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=42)
lr = LogisticRegression()
# fit the model to the training data
lr.fit(X_train, y_train)
# predict_proba gives probabilities w/out threshold; [:,1] are labels for the '1' binary label
y_pred_prob = lr.predict_proba(X_test)[:,1]
# compute the ROC_AUC_SCORE
score = roc_auc_score(y_test, y_pred_prob)
Using cross_val_score
from sklearn.model_selection
(Cross Validation); produces multiple AUC Scores
from sklearn.model_selection import cross_val_score
y = df['Target_Column']
X = df.drop('Target_Column', axis=1)
# instantiating a LOGISTIC_REGRESSION classifier
lr = LogisticRegression()
# CROSS_VALIDATING the model; scoring='roc_auc' provides AUC (Area Under ROC Curve) Scores
cv_results = cross_val_score(lr, X, y, cv=5, scoring='roc_auc')
# compute the average ROC_AUC_SCORE
mean_score = np.mean(cv_results)
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
y = df['Target_Column']
X = df.drop('Target_Column', axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=42)
# setup HYPERPARAMETER_GRID for model hyperparameter(s)
param_grid = {'C': np.arange(0, 10, 1)}
lr = LogisticRegression()
# intantiating the gridsearch object
lr_cv = GridSearchCV(lr, param_grid, cv=5)
# fit the model to the training data
lr_cv.fit(X_train, y_train)
# returns a dict of the best performing features and values
best_p = lr_cv.best_params_
# returns the score of the best performing feature
best_score = lr_cv.best_score_
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
y = df['Target_Column']
X = df.drop('Target_Column', axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=42)
# setup HYPERPARAMETER_GRID for model hyperparameter(s)
param_grid = {'n_neighbors': np.arange(1, 50)}
knn = KNeighborsClassifier()
# intantiating the gridsearch object
knn_cv = GridSearchCV(knn, param_grid, cv=5)
# fit the model to the training data
knn_cv.fit(X_train, y_train)
# returns a dict of the best performing features and values
best_p = knn_cv.best_params_
# returns the score of the best performing feature
best_score = knn_cv.best_score_
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import ElasticNet
y = df['Target_Column']
X = df.drop('Target_Column', axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=42)
# setup HYPERPARAMETER_GRID for model hyperparameter(s)
param_grid = {'l1_ratio': np.linspace(0, 1, 30)}
elastic_net = ElasticNet()
# instantiating the gridsearch object
gm_cv = GridSearchCV(elastic_net, param_grid, cv=5)
# fit the model to the training data
gm_cv.fit(X_train, y_train)
# predict the labels for the test data
y_pred = gm_cv.predict(X_test)
# scoring the elastic_net model (REGRESSOR)
r2 = gm_cv.score(X_test, y_test)
# computing the mean_squared_error using actual/predicted labels
mse = mean_squared_error(y_test, y_pred)
# returns a dict of the best performing features and values
best_p = gm_cv.best_params_
# returns the score of the best performing feature
best_score = gm_cv.best_score_
print("Tuned ElasticNet R squared: {}".format(r2))
print("Tuned ElasticNet MSE: {}".format(mse))
Not as computationally expensive as GridSearchCV
from sklearn.model_selection import train_test_split
from scipy.stats import randint
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import RandomizedSearchCV
y = df['Target_Column']
X = df.drop('Target_Column', axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=42)
# setup HYPERPARAMETER_GRID for model hyperparameter(s)
param_dist = {"max_depth": [3, None], "max_features": randint(1, 9),
"min_samples_leaf": randint(1, 9), "criterion": ["gini", "entropy"]}
tree = DecisionTreeClassifier()
# instantiating the gridsearch object
tree_cv = RandomizedSearchCV(tree, param_dist, cv=5)
# fit the model to the training data
tree_cv.fit(X_train, y_train)
# returns a dict of the best performing features and values
best_p = tree_cv.best_params_
# returns the score of the best performing feature
best_score = tree_cv.best_score_
Using K-Nearest Neighbors
Model; scaling data using StandardScaler
and comparing scaled/un-scaled scores.
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline
y = df['Target_Column']
X = df.drop('Target_Column', axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=42)
# setup steps with transformer and estimator
steps = [('scaler', StandardScaler()),
('knn', KNeighborsClassifier())]
# instantiating pipeline
pipeline = Pipeline(steps)
# fit the pipeline, computes 'scaler' then 'knn'
knn_scaled = pipeline.fit(X_train, y_train)
# fitting unscaled training data for comparison
knn_unscaled = KNeighborsClassifier().fit(X_train, y_train)
# scoring the model accuracy with the scaled/un-scaled test data
scaled_score = knn_scaled.score(X_test, y_test)
unscaled_score = knn_unscaled.score(X_test, y_test)
Using SVC (Support Vector Machine)
Model; producing Classification Report
on predictions after cleaning up missing data
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import Imputer
from sklearn.svm import SVC
from sklearn.metrics import classification_report
from sklearn.pipeline import Pipeline
y = df['Target_Column']
X = df.drop('Target_Column', axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=42)
imp = Imputer(missing_values='NaN', strategy='most_frequent', axis=0)
# setup steps with transformer and estimator
steps = [('imputation', imp), ('SVM', SVC())]
# instantiating pipeline
pipeline = Pipeline(steps)
# fit the pipeline, computes 'imputation' then 'SVM'
pipeline.fit(X_train, y_train)
# predict the labels for the test data
y_pred = pipeline.predict(X_test)
# generate the classification report on actual/predicted labels
classification_r = classification_report(y_test, y_pred)
# support column provides # of samples of the TRUE RESPONSE that lie in that class
print(classification_report(y_test, y_pred))