This repo contains my code and pipelines explained on Data Engineering with Python book.
Software required | OS used |
---|---|
Python 3.x, Spark 3.x, Nifi 1.x, MySQL 8.0.x, Elasticsearch 7.x, Kibana 7.x, Apache Kafka 2.x | Linux (any distro) |
- airflow-dag: this directory contains the airflow DAG modules used in this repo
- great_expectations: contains all the important components of a local Great Expectation deployment
- kafka-producer-consumer: contains modules that produce and consume Kafka topics in Python
- load-database: this directory contains modules that load and query data from MySQL
- load-nosql: this directory contains modules that load and query data from Elasticsearch
- nifi-datalake: this directory contains Nifi Pipelines to simulate reading data from the datalike
- nifi-files: this directory contains the files derived from the Nifi template pipelines
- nifi-scanfiles: this directory contains dictionary files read by ScanContent processor (e.g. VIP)
- nifi-scripts: this directory contains shell scripts that are used with ExecuteStreamCommand in Nifi
- nifi-templates: this directory contains different Apache Nifi pipeline templates
- nifi-versioning: this directory contains Nifi pipelines with version control (NiFi Regsitry)
- pyspark: this directory contains Jupyter Notebooks that connect to PySpark data processor
- scooter-data: this directory contains the scooter dataset and wrangling data modules (pandas)
- sql-user: this directory contains the query to create a user and its credentials data
- writing-reading-data: this directory contains modules that create and read fake data
To setup the working environment run the command:
$ source start-working-environment
If you want to stop/kill the working environment, run the command:
$ ./stop-working-environment
To create the MySQL user, run the following statement as the root user:
$ mysql -u root -p -e "SOURCE sql-user/create-user.sql"
This will also grant access to the databases used in this repo.