Parsnips are kind of like carets, right?
This is a collection of python tools to help me with machine learning projects. I come from the R (caret) world and sorely miss a lot of the tools I loved over there. This repo will mostly be towards bridging the gap.
[x] Restricted Cubic Splines - Under construction.
[ ] Dummy Vars
[ ] NZV
[ ] Correlated variables
[ ] Linear Deps
[ ] Centering and Scaling
[ ] Imputation
[ ] Concurrency
A better approach that maximizes power and only assumes a smooth relationship is to use regression splines for predictors that are not known to predict linearly. Use of flexible parametric approaches such as this allows standard inference techniques (P-values, confidence limits) to be used.
Relationships among variables are seldom linear. It is a common believe among practitioners who do not study bias and efficiency in depth that the presence of non-linearity should be dealt with by chopping continuous variables into intervals. Problems caused by dichotomization include the following:
- Estimated values will have reduced precision, and associated tests will have reduced power.
- Categorization assumes that the relationship between the predictor and the response is flag within intervals.
- Binning can requires fitting more parameters than the smoothing methods implemented here would.
- Prevalence betweenm bins can vary greatly.
- Using the outcome to determine cutpoints risks type I and II errors. Not using the outcome is suboptimal.
- “Optimal” cutpoints do not replicate well over studies.
Polynomials have some undesirable properties. They can create undesirable peaks and valleys, and the fit in one region of X can be greatly affected by data in other regions. They may also not adequately fit many functional forms such as logarithmic functions or “threshold” effects.
A draftman’s spline is a flexible strip of metal or rubber used to draw curves. Spline functions are piecwise polynomials used in curve fitting. That is, they are polynomials within intervals of X that are connected across different intervals of X.
This package creates splines by constructing a design matrix via a truncated power basis and fitting a linear function. B-splines are a more numerically stable way to form a design matrix however they are more complex and do not allow for extrapolation beyond the outer knots. Additionally the truncated power basis seldom presents estimation problems when modern methods such as Q-R decomposition are used for matrix inversion.
The simplest spline function is a linear spline function, a piecewise linear function. Suppose that the x axis is divided into intervals with endpoints at a, b, and c, called knots. The linear spline function is given by:
\begin{equation} f(X) = β_0 + β_1(X) + β_2(X - a)_+ + β_3(X - b)_+ + β_4(X - c)_+ \end{equation}
where
\begin{equation} (u)_+ = u if u>0 else 0 \end{equation}
Although the linear splines is simple and can approximate many common relationships, it is not smooth and will not fit highly curved functions well. These problems can be overcome by using piecewise polynomials of order higher than linear. Cubic polynomials have been found to have nice properties with good ability to fit sharply curving shapes. Cubic splines can be made to be smooth at the join points (knots) by forcing the first and second derivatives of the function to agree at the knots.
Stone and Koo have found that cubic spline functions do have a drawback in that they can be poorly behaved in the tails, that is before the first knot and after the last knot. They cite advantages of constraining the function to be linear in the tails. Their restricted cubic spline function (also called natural splnies) has the additional advantage that only k-1 parameters must be estimated besides the intercept as opposed to k + 3 with unrestricted cubic splines.
Stone has found that the location of knots in a restricted cubic spline model is not very crucial in most situations; the fit depends much more on the choice of k, the number of knots. Placing knots at fixed quantiles of a predictor’s marginal distribution is a good approach in most datasets. This ensures that enough points are available in each interval, and also gaurds against letting outliers overly influence knot placement.
k | Quantiles |
---|---|
3 | [.1, .5, .9] |
4 | [.05, .35, .65, .95] |
5 | [.05, .275, .5, .725, .95] |
6 | [.05, .23, .41, .59, .77, .95] |
7 | [.025, .1833, .3417, .5, .6583, .8167, .975] |