The project was dedicated to dig into the distributed computing and the tools are mainly used for it. We, as students of the Applied University of Southern Westfalia, took as an example the DARIMA Models to reproduce the code and rewrite it as we this understand.
Therefore, we would like to emphasize
- We do not claim any autorship on every part of the repository, since it was inspired by https://github.com/xqnwang/darima.
- We do not claim any rights on the code, since this was in educational purpose only.
- Apache Spark >= 3.4.1
- R Version >= 4.2.3
- Python Version >= 3.11
- R
install.packages("polynom")
install.packages("forecast")
- Python
pip install -e requirements.txt
On a single local master node:
python ./darima.py
On a running (remote) master node:
Assuming that the $SPARK_HOME
environment variable points to your local Spark installation folder, then the model can be run from the project's root directory using the following command from the terminal,
$SPARK_HOME/bin/spark-submit \
--master local[*] \
--py-files packages.zip \
--files configs/darima_config.json \
./darima.py
Briefly, the options supplied serve the following purposes:
--master local[*]
- the address of the Spark cluster to start the model on. If you have a Spark cluster in operation (either in single-executor mode locally, or something larger in the cloud) and want to send the job there, then modify this with the appropriate Spark IP - e.g.spark://the-clusters-ip-address:7077
;--files configs/darima_config.json
- the (optional) path to any config file that may be required by the DARIMA model;--py-files packages.zip
- archive containing Python dependencies (modules) referenced by the model;./darima.py
- the Python file containing the DARIMA model to execute.
Full details of all possible options can be found here. Note, that we have left some options to be defined within the model (which is actually a Spark application) - e.g. spark.cores.max
and spark.executor.memory
are defined in the Python script as it is felt that the job should explicitly contain the requests for the required cluster resources.
The basic project structure is as follows:
root/
|-- R/
| |-- auto_arima.R
|-- configs/
| |-- darima_config.json
|-- data/
| |-- *.csv
| |-- *.parquet
|-- docs/
| |-- Makefile
| |-- conf.py
|-- py_spark/
| |-- logging.py
| |-- spark.py
|-- py_handlers/
| |-- converters.py
| |-- utils.py
|-- py_handlers/
| |-- logging.py
| |-- spark.py
| darima.py
| build_dependencies.sh
| packages.zip
| requirements.txt
| r_requirements.txt
The main Python module containing the DARIMA model (which will be sent to the Spark cluster), is ./darima.py
. Any external configuration parameters required by darima.py
are stored in JSON format in configs/darima_config.json
. Additional modules that support this model can be kept in the dependencies
folder. In the project's root we include build_dependencies.sh
, which is a bash script for building these dependencies into a zip-file to be sent to the cluster (packages.zip
).