Project focuss on automating and containerizing the proccess of data ETL (extract, transform, load) and import it into our chosen database, to build a dashboard and analyzing it using Superset or GDS with our Automated Data.
- Ubuntu 20.04 LTS (OS) on local and inside containers
- SQLlite3 (Storage)
- PostgreSQL on Heroku (Backup)
- Superset & Google Data Studio (Dashboards and Reports)
- JupyterLab (Data Analysis & Engineering)
- Github Actions (CI/CD)
- SonarCloud (Code Quality & Code Security)
- Python 3.8+ (Programming language)
- Sublime Text (Code Editor)
You can run the project by running the following commands inside code/
folder, after reading the requirements and before you run the code make sure to set your sql instance configurations if you want to connect to other DB service provider or LocalDB (Like PostgreSQL/MySQL/MSSQL
) in the following file:
code/configurations/SQL_Config.py
Then after adding your configurations (Or if you want to use SQLite3 as your storge db edit/add on the transferData.py
file) and downloaded needed packages (can found in requirements.txt
), excute the code script to import data into our database.
To run the 'ETL' process, make sure to run the following command to install and save the data to the dataset folder, and make sure you are in the right path (code/
):
python3 etl_data.py
We have the dataset and we are ready now to import the data into our database, by running the following command:
python3 transfer_data.py
Finally, you can automate above process by running a shell script to automatically run both files and also create .sql
file in case you have another database to import into it, without needing to take the same proccess to insert data and you just have to import into the database the generated .sql
file (make sure you make the shell file excutable by running: chmod +x esc.sh
) by running the following command from the root folder:
./esc.sh
We automated the process of 'ETL' using shell script and it works right? but, we have still need to run it manualy inside 'terminal' and this makes half of the process are not automated, and here where comes the subject of 'CI/CD' and we can create custom 'continuous integration (CI)' and 'continuous deployment (CD)' workflows directly inside our GitHub repository with GitHub Actions.
- Go to repository > actions > setup python
- Then copy
./github/workflows/etl-proccess.yml
inside your created file- Note: make sure you configure as you prefer it to work
- If you want sonarcloud to be your code quality/security chosen tool:
- create
sonar-project.properties
file and put your configurations on it - copy
./github/workflows/sonar-cloud.yml
inside your created file - Note: make sure you configure as you prefer it to work
- create
After you configured and enabled github actions, your 'ETL' process now is compatible with 'CI/CD' and the dataset is scheduled to be automaticly downloaded and transfered to the databasefor every 40min 'ksa/riyadh' time and after it completed 5min later sonarcloud scan runs and checks the code security and code quality.
If you wanna use containers as your labs, environments or even for developing analysis solution (like in example predicting next week rates and cases based on the data you have automated), you can use docker containers to run isolated environments for you to work on.
In this project i have used Dockerfile
to configure the installation of only Ubuntu 20.04 LTS
, jupyterlab
and python3
with requirements packages to build isolated environment but this will make our life worse if we need to install other tools and make them communicate together, so i used docker-compose
to install and configure multiple-containers (5 isolated containers with different purposes
) to handle our Database (PostgreSQL)
, Adminstration (pgAdmin)
, NoteBook (Jupyterlab)
, Dashborad (Superset) and Storege (MinIO)
and this makes the process of creating multiple containers on isolated environment and they can communicate and share data between them.
In case you need to Containerize your ETL & Analysis Infrastructure with Docker/Docker-Compose
as an isolated environment for analysis or testing etc., make sure you have first installed Docker and you have also Docker Compose, now you are ready to use the project by following next easy steps:
I put two files in the root folder for containerizing the project called Dockerfile
and docker-compose.yml
. Dockerfile is handling the process of installing ubuntu image and installs inside the image all jupyterlab prerequisite packages along with all its confirgurations, docker-composer.yml is used to download multi-container applications using DockerHub images and making the process of installing multi-containers with its configuration more easy!.
After we have maked sure that docker and docker-compose are installed on our machine also up and running, for testing purpose on isolated environment with docker container you can simply run the following command (in the root directory
):
docker build -t testing .
And you can see if the container is created or not:
docker ps
Then, after we have installed our container successfully run the following command to make it run inside isolated environment on docker:
docker run -d -p 8888 testing
We have successfully running isolated jupyterlab environment on docker, and can access it using your browser and pasting the following:
localhost:8888
After we saw how we can isolate our workflow, for more easy way to work without needing to take care of the environment on each and single container, there is another way for installing multi-container
which is what we need ex: (PostgreSQL
, PgAdmin
, JupyterLab
, Superset
and MinIO
all together) in this case we gonna use docker-compose, just by running following command (in the root directory
):
docker-compose up -d
And we gonna see that a new proccess is running and installing all needed images and configurations as we typed on the docker-compose.yml file, and we can see afer the install process finish all three container are up and running just run the following command now:
docker-compose ps
Name | Description | Link |
---|---|---|
Job Seekers | Dashboard of Saudi Arabia Job Seekers for 2021 | Dashboard |
Data Source | Data of Saudi Arabia Job Seekers for 2021 | Dataset |