Database to ML Pipeline

This is a a demonstration of a data engineering pipeline. Starting with defining an Oracle database on a Docker container, a dataset of gradeschool essays is cleaned (using Python scripts) and inserted into the database (using SQL scripts run by a Bash script against the container). Then, the data is queried and organized to be useful to an arbitrary machine learning model. To demo the functionality of this repository, follow these steps:

Install Docker if you have not already.
Clone the repo, and change the working directory to where the repo was cloned

git clone https://github.com/lmr97/db-to-ml-pipeline
cd db-to-ml-pipeline

Install the dependencies. You can use Poetry to install them in an auto-generated virtual environment with poetry install, or install the dependencies from the requirements.txt in the Python environment of your choice.
Download the essay data from its source, rename the file to persuade_2.0_human_scores.csv, and then move it into the Clean_and_org folder (the original repo with project information can be found here).
Run clean_and_prep.sh to clean the data and format it into the expected tables.
If on Linux, first start a superuser shell, otherwise just run buildDockerDB.sh to build a Docker image for an Oracle server, run a container, and load the database (root privileges are needed to run Docker commands on Linux). This might take a while.
Now you can use the core of the project, DataOrganizer.py, which can be imported into a Python script to output model-ready Pandas dataframes.

The container cleanup has been automated with dockerCleanup.sh.

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
CSV-tables		CSV-tables
Clean_and_org		Clean_and_org
DBSetupScripts		DBSetupScripts
.gitignore		.gitignore
DataOrganizer.py		DataOrganizer.py
Dockerfile		Dockerfile
README.md		README.md
buildDockerDB.sh		buildDockerDB.sh
clean_and_prep.sh		clean_and_prep.sh
dockerCleanup.sh		dockerCleanup.sh
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Database to ML Pipeline

About

Releases

Packages

Languages

lmr97/db-to-ml-pipeline

Folders and files

Latest commit

History

Repository files navigation

Database to ML Pipeline

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages