-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathPredictive_R_Squared.py
88 lines (64 loc) · 3.84 KB
/
Predictive_R_Squared.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
def pred_r_sq(y_test, y_pred, X):
"""
Calculates Predictive R-Squared.
Required Arguments: (y_train, y_test, y_pred, X)
y_train - A pandas series derived from the scikitlearn train_test_split module.
y_test - A pandas series derived from the scikitlearn train_test_split module.
y_pred - A pandas series derived from statsmodels predict function
X - A pandas dataframe of independent variables BEFORE they are split using
the train_test_split module. This function will subset for the indices
in y_test and y_pred. This may not be necessary, and will be reviewed
in future editions.
This function calculates the Predictive R-Squared value in a number of steps.
First, it calculates the 'ordinary residuals' by subtracting y_test from y_pred.
Second, it calculates the diagonal of the hat matrix of X and subsets this to
the indices of y_test and y_pred. These ordinary residuals and hat matrix are
used to calculate the 'deleted residuals', which saves us from having to re-fit
the model for each value in the test set **[1]**.
The deleted residuals enter the PRESS formula **[2]**. Each value in the series
is squared, and then the sum of the series is taken to return a scalar, PRESS.
Next up is the calculation of Sum of Squares Total (SST), which is calculated
as the summation of (y_pred - y_train_mean)**2. This is used in the final
formula for calculating Predictive R-Squared.
Finally, Predictive R-Squared is calculated as 1-(PRESS/SST). This program
leaves the output as a decimal rather than percentage form.
**[1]** - Applied Linear Regression Models - Fourth Ed. - Kutner, Nachtsheim,
Neter (pages 360-361, 395)
**[2]** - https://rpubs.com/RatherBit/102428
Additional Resources:
- https://blog.minitab.com/en/adventures-in-statistics-2/multiple-regession-analysis-use-adjusted-r-squared-and-predicted-r-squared-to-include-the-correct-number-of-variables
- https://online.stat.psu.edu/stat501/lesson/10/10.5
- https://stackoverflow.com/questions/23926496/computing-the-trace-of-a-hat-matrix-from-and-independent-variable-matrix-with-a
- https://stats.stackexchange.com/questions/208242/hat-matrix-and-leverages-in-classical-multiple-regression
"""
#Import pandas and numpy modules
from pandas import DataFrame
from pandas import Series
from numpy import linalg
from numpy import diagonal
#Calculate the ordinary residuals. These will feed forward into deleted
# residuals.
ordinary_residuals = DataFrame()
ordinary_residuals = y_pred - y_test
#Calculate the diagonal of the hat matrix. This will feed forward into the
# deleted residuals.
hat = X.dot(linalg.inv(X.T.dot(X)).dot(X.T))
hat_diag = diagonal(hat)
#Subset hat matrix to indices from test set
hat_diag_df = Series(hat_diag)
indices = y_pred.index.values.tolist()
hat_diag_df = hat_diag_df.take(indices = indices)
#Calculate the deleted residuals. These will feed forward into the PRESS
# formula
deleted_residuals = ordinary_residuals / (1 - hat_diag_df)
#Calculate PRESS. This will feed forward into the Predictive R-Squared
# formula.
deleted_residuals_df = DataFrame(deleted_residuals, columns = ['deleted_residuals'])
PRESS = deleted_residuals_df['deleted_residuals']**2
PRESS = sum(PRESS)
#Calculate SST. This will feed forward into the Predictive R-Squared formula.
y_train_mean = sum(y_train) / len(y_train)
SST = sum((y_test - y_train_mean)**2)
# Calculate Predictive R-Squared
PRS = ((1-(PRESS/SST)))
return("Predictive R-Squared: " + str(PRS))