A Python library of 'old school' machine learning algorithms such as linear regression, logistic regression, naive Bayes, k-nearest neighbors, decision trees, support vector machines, and meta models like bootstrap aggregating and gradient boosting. OSML uses the pandas and NumPy libraries for data manipulation and basic linear algebra operations.
The primary components of the library are the machine learning models. They can be grouped into two major categories, transformative and predictive models.
The models of the library only operate on numerical data. The possible data types explanatory variables may have are as below:
- Continuous (CN)
- Binary Categorical (BC)
- Ordinal Categorical (OC)
- Nominal Categorical (NC)
All columns with float
data type are assumed to represent continuous features, while all int
columns correspond to categorical features. Every model supports continuous variables, but not all can handle categorical variables.
Transformative models can be fitted to data sets and be used to perform transformations on new data sets using the parameters learnt from the ones they were fitted to. All of these models assume the features to be continuous. The most common use case of such models is pre-processing. Most transformative models use unsupervised learning, meaning that they do not require labelled training data. The transformative models provided by OSML are as follows.
Model | Supervised | Objectives | Supported Feature Types |
---|---|---|---|
Centering | No | zero mean | CN |
Standardization | No | zero mean, unit standard deviation | CN |
PrincipalComponentAnalysis | No | linear independence, dimensionality reduction | CN |
LinearDiscriminantAnalysis | Yes | linear independence, class separation, dimensionality reduction | CN |
BoxCoxTransformation | No | univariate normal distribution | CN |
All the predictive models provided by the library are supervised machine learning algorithms. Despite this commonality, the algorithms are quite different in many ways. Predictors can be categorized based on several criteria. One such criterion is whether they are parameterized by a fixed number of parameters or not. Parametric models depend on a finite number of parameters regardless of the number of observations in the training data. On the other hand, non-parametric predictive models do not have a fixed number of parameters. However, in spite of what their name suggests, they do have parameters. In fact, the complexity (number of parameters) of non-parametric models grows with the number of observations in the training data.
Another criterion to group predictive algorithms by is the type of predictions they make. Based on this, there are three main categories of predictive models which are described below.
Regression models predict continuous dependent variables. They can be used, for example, to predict house prices and temperatures. The data type of such predictions is always float
.
Binary classifiers predict binary dependent variables. The only two values the predictions can take on are 0 and 1 (after rounding to the closest integer). These values usually encode a true or false relationship with 1 meaning that the observation is predicted to belong to the class in question and 0 meaning it is not. The data type of these models' predictions is always int
. Use cases of binary classifiers include, for example, the prediction of whether a mushroom is edible or whether a transaction is fraudulent. With some work, binary classifiers can be applied to multinomial classification problems as well. This is done by fitting a binary predictor to each class and when it comes to inference, running the observation through all the binary predictors and selecting the class whose predictor made the most confident positive prediction.
Classifiers predict categorical dependent variables. They are used to predict which class an observation belongs to. Examples include the prediction of which animal an image represents or which species of flower a specimen belongs to based on its petal measurements. Classification models make int
predictions. All classifiers can be used as binary classifiers as well.
It is worth noting that classifiers and binary classifiers can be further categorized into the following types.
- Generative (G)
- Probabilistic Discriminative (PD)
- Non-probabilistic Discriminative (NPD)
Generative classifiers model the joint probability P(x ∩ y) = P(x | y) * P(y) i.e. the product of the marginal probability of the label being y according to a PMF and the conditional probability of an observation taking on the value x given the label y according to a multivariate PDF (or the compound probability of the independent individual features being equal to the corresponding elements of the vector x according to their univariate PDFs or PMFs). This allows for the Bayesian inference of the conditional probability of the label being y given an observation x as in P(y | x) = P(x ∩ y) / P(x). The prediction of the generative model is then argymax f(y) = P(x ∩ y). Note that as P(x) does not depend on y, it can be safely ignored
Unlike generative classifiers, probabilistic discriminative classifiers model the conditional probability P(y | x) directly. However, not all discriminative models are probabilistic. Non-probabilistic discriminative classifiers do not estimate distributions but learn mapping functions between observations and labels one way or another.
Most predictive algorithms rely on certain assumptions about the data they are fitted to. For example, naive Bayes assumes that the features are independent (their covariance matrix is diagonal) and the continuous features are normally distributed, k-nearest neighbors assumes that the variance of a feature is proportional to its explanatory power, linear discriminant analysis assumes that all features are normally distributed and the covariance matrix of the features is the same across all classes, etc. It is good to know these assumptions and preprocess the training data accordingly using transformative models.
It is also possible to combine multiple models using a meta-model to improve their predictive power or to use the models for different problem types. For example, a meta-model can employ n binary classifiers to function as a predictor for classification problems with n classes. Meta-models can also utilize an ensemble of weak learners to create a more powerful predictor and reduce bias or variance. Examples of such ensemble algorithms include bootstrap aggregating and gradient boosting. Besides all three of these generic meta-models, OSML features the following predictive models.
Model | Parametric | Supported Problem Types | Supported Feature Types |
---|---|---|---|
LinearRegression | Yes | Regression | CN, BC, OC |
LinearRidgeRegression | Yes | Regression | CN, BC, OC |
LinearLassoRegression | Yes | Regression | CN, BC, OC |
LogisticRegression | Yes | BinaryClassification (PD) | CN, BC, OC |
LogisticRidgeRegression | Yes | BinaryClassification (PD) | CN, BC, OC |
NaiveBayes | Yes | Classification (G) | CN, BC, OC, NC |
LinearDiscriminantAnalysis | Yes | Classification (G) | CN |
QuadraticDiscriminantAnalysis | Yes | Classification (G) | CN |
KNearestNeighbors | No | Classification (NPD), Regression | CN, BC, OC |
DecisionTree | No | Classification (NPD), Regression | CN, BC, OC, NC |
SupportVectorMachine | No | BinaryClassification (NPD) | CN, BC, OC |
BaggedTrees | No | Classification (NPD), Regression | CN, BC, OC, NC |
RandomForest | No | Classification (NPD), Regression | CN, BC, OC, NC |
BoostedTrees | No | BinaryClassification (PD), Regression | CN, BC, OC, NC |
OSML also provides a number of different data sets the models can be tested on. It also includes functions for splitting observations and labels into training and test data sets, random shuffling data, and oversampling. The table below showcases the data sets including the types of their features and their labels.
Data Set | Default Problem Type | Feature Types |
---|---|---|
Boston | Regression | CN, BC |
Exam | Binary Classification | CN |
Iris | Classification | CN |
Titanic | Binary Classification | CN, BC, OC |
Mushroom | Binary Classification | BC |
IMDb | Classification | CN, BC |
The library has an API similar to that of sklearn. Data sets generally contain four collections of data; a training data frame, a series of training labels, a test data frame, and a series of test labels. Each model has a fit
function that takes a data frame of observations and a series of labels as its arguments. The predict
function can be used on fitted models to make predictions about the labels of a data frame of observations. All models have an evaluate
function as well that measures the fit of a series of predictions to a series of labels based on some metric. This metric is the root mean squared error for regression models, accuracy for classification models, and the F1 score for binary classification models. Finally, models that optimize a loss function also have a test
method that measures the error of the predictions of the model according to the loss function.
Binary classification models can also be used to make predictions that are not rounded to the closest integer via the fuzzy_predict
function. Moreover, they can evaluate the precision, recall, and fall-out of predictions using the evaluate_precision
, evaluate_recall
, and evaluate_fall_out
methods. They can also evaluate the receiver operating characteristic curve and the area under it using evaluate_receiver_operating_characteristic
. The MultiBinaryClassification
meta-model can turn a binary classifier into a multinomial classifier by fitting a copy of the binary classifier instance to each class of the multinomial data set. Other meta-models such as BootstrapAggregating
and GradientBoosting
function as model ensembles using multiple instances of a model to enhance its predictive power.
The code snippet below demonstrates the usage of the library on a simple example of fitting a random forest to a classification data set and using it to make predictions on new observations.
import osml.data as osd
import osml.transform as ost
import osml.predict as osp
data_set = osd.IrisDataSet('data/iris/iris.csv')
transformer = ost.Standardization()
predictor = osp.RandomForestClassification(number_of_models=64)
predictor.fit(transformer.fit_transform(data_set.get_training_observations(), None), data_set.get_training_labels())
predictions = predictor.predict(transformer.transform(data_set.get_test_observations()))
accuracy = predictor.evaluate(predictions, data_set.get_test_labels())
print('accuracy:', accuracy)