Skip to content

A demo pipeline for arbitrary machine learning model.

Notifications You must be signed in to change notification settings

lmr97/db-to-ml-pipeline

Repository files navigation

Database to ML Pipeline

This is a a demonstration of a data engineering pipeline. Starting with defining an Oracle database on a Docker container, a dataset of gradeschool essays is cleaned (using Python scripts) and inserted into the database (using SQL scripts run by a Bash script against the container). Then, the data is queried and organized to be useful to an arbitrary machine learning model. To demo the functionality of this repository, follow these steps:

  1. Install Docker if you have not already.

  2. Clone the repo, and change the working directory to where the repo was cloned

git clone https://github.com/lmr97/db-to-ml-pipeline
cd db-to-ml-pipeline
  1. Install the dependencies. You can use Poetry to install them in an auto-generated virtual environment with poetry install, or install the dependencies from the requirements.txt in the Python environment of your choice.

  2. Download the essay data from its source, rename the file to persuade_2.0_human_scores.csv, and then move it into the Clean_and_org folder (the original repo with project information can be found here).

  3. Run clean_and_prep.sh to clean the data and format it into the expected tables.

  4. If on Linux, first start a superuser shell, otherwise just run buildDockerDB.sh to build a Docker image for an Oracle server, run a container, and load the database (root privileges are needed to run Docker commands on Linux). This might take a while.

  5. Now you can use the core of the project, DataOrganizer.py, which can be imported into a Python script to output model-ready Pandas dataframes.

The container cleanup has been automated with dockerCleanup.sh.

About

A demo pipeline for arbitrary machine learning model.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published