This is a a demonstration of a data engineering pipeline. Starting with defining an Oracle database on a Docker container, a dataset of gradeschool essays is cleaned (using Python scripts) and inserted into the database (using SQL scripts run by a Bash script against the container). Then, the data is queried and organized to be useful to an arbitrary machine learning model. To demo the functionality of this repository, follow these steps:
-
Install Docker if you have not already.
-
Clone the repo, and change the working directory to where the repo was cloned
git clone https://github.com/lmr97/db-to-ml-pipeline
cd db-to-ml-pipeline
-
Install the dependencies. You can use Poetry to install them in an auto-generated virtual environment with
poetry install
, or install the dependencies from therequirements.txt
in the Python environment of your choice. -
Download the essay data from its source, rename the file to
persuade_2.0_human_scores.csv
, and then move it into theClean_and_org
folder (the original repo with project information can be found here). -
Run
clean_and_prep.sh
to clean the data and format it into the expected tables. -
If on Linux, first start a superuser shell, otherwise just run
buildDockerDB.sh
to build a Docker image for an Oracle server, run a container, and load the database (root privileges are needed to run Docker commands on Linux). This might take a while. -
Now you can use the core of the project,
DataOrganizer.py
, which can be imported into a Python script to output model-ready Pandas dataframes.
The container cleanup has been automated with dockerCleanup.sh
.