This is a Python implementation of scikit-learn estimators that use partial_fit
method for distributed learning.
Implemented methods:
linear_model
- callsSGDRegressor
orSGDClassifier
neural_network
- callsMLPRegressor
orMLPClassifier
naive_bayes
- callsMixedNB
(mix ofGaussianNB
andMultinomialNB
), only works for classification tasksgradient_boosting
- callsGradientBoostingRegressor
orGradientBoostingClassifier
, does not support distributed training
It has two modes
docker run --rm --env [list of environment variables] hbpmip/python-sgd-regression:VERSION compute --mode intermediate --job-id 12
which calls partial_fit
of scikit-learn estimator and saves intermediate results into job_results
table. If
--job-id
is specified, it will first load the estimator and continue its training. If not, it will start from scratch.
docker run --rm --env [list of environment variables] hbpmip/python-sgd-regression:VERSION compute --mode aggregate --job-id 13
this mode in addition converts estimator into PFA. If you have only one node, calling naive_bayes
with compute aggregate
will be equivalent to running Naive Bayes in a non-distributed way.
Environment variables are:
- NODE: name of the node (machine) used for execution
- JOB_ID: ID of the job.
- IN_JDBC_DRIVER: org.postgresql.Driver
- IN_JDBC_URL: URL to the input database, e.g. jdbc:postgresql://db:5432/features
- IN_JDBC_USER: User for the input database
- IN_JDBC_PASSWORD: Password for the input database
- OUT_JDBC_DRIVER: org.postgresql.Driver
- OUT_JDBC_URL: URL to the output database, jdbc:postgresql://db:5432/woken
- OUT_JDBC_USER: User for the output database
- OUT_JDBC_PASSWORD: Password for the output database
- PARAM_variables: Name of the target variable (only one variable is supported for KNN)
- PARAM_covariables: List of covariables
- PARAM_query: Query selecting the variables and covariables to feed into the algorithm for training.
- MODEL_PARAM_type: Type of model to use, could be
linear_model
,neural_network
ornaive_bayes
MODEL_PARAM_type
specifies type of model to use, could be linear_model
, neural_network
or naive_bayes
. Use additional MODEL_PARAM_[sklearn_parameter]
envs to specify scikit-learn model parameters (e.g. MODEL_PARAM_alpha
for Naive Bayes or MODEL_PARAM_learning_rate
for SGDRegressor).
For Naive bayes it is enough to go over all data points once (call --mode intermediate
on all nodes).
These methods are trained using Stochastic Gradient Descent and require several passes over training data in random order until convergence.
Does not support distributed training, calling it once on single node is enough.
Run: ./build.sh
Run: captain test
Run: ./publish.sh
Run: ./build.sh
WARNING: unit tests can fail nondeterministically on AttributeError: can't set attribute
because of some error
in Titus port to Python 3
Run integration tests:
cd tests
./test.sh