ETL and ELT data pipelines offer simplicity and automation for organizations that use cloud storage where data scales exponentially. The data pipelines enables data for analysts and modelling is readily available. Automation of data pipelines also ensures real-time tracking of important Key Perfomance Indicators
for a business. Exponentially growing data may sometimes bring the challenge of storage constraints and embedded databases solve this efficiently.
-
Develop and deploy an efficient
ETL Data Pipeline
extracting data fromAWS S3 bucket
using data load tool using DuckDB database,transforming
it and finallyloading
it in anDuckDB
embedded database. -
Prepare data for analysis using queries which can be found `here'. The queries seek to find:
-
Average age of all shoppers across various shopping malls.
-
Average expenditure of each category of items by gender.
-
Most profitable shopping malls by year and month.
-
Shopping patterns across various malls.
The extracted data from the queries can be found here.
The Data Load Tool
open source library offers very efficient data pipelines offering consumption from various data sources including Google Sheets, Cloud Storage services, APIs
and scraping scripts
. This tool comes in handy for e-commerce
organizations as it has the ability to track streaming data in real-time. It offers more scalability and customization depending on the needs at hand.
Data for this project was ingested from AWS S3
bucket. To get started with setting up the credentials to allow access by third oarty applications to your data, reading dlt documentation provides a clear step by step process for the setup.
After setting up everything, run the following commands in order to get started with dlt
:
-
pip install requirements.txt
-
dlt init filesystem duckdb
-
To run your pipeline, run
python filesystem.py
in the terminal. This downloads a duckdb file in the working directory which contains information about the data source. -
To have an interaction with the data, run
dlt pipeline filesystem show
which opens streamlit on the local browser.
For a more conclusive and extensive approach, reading data sources provides more information on how to connect to several data sources.
Transformation of data is necessary because it needs to be error free for use by the data team to extract meaningful insights more fast than the old traditional formarts. Transformation is done using the Data Build Tool
which is an open source platform that offers more flexibility for transformation of data in data pipelines.
To open up the Streamlit dashboard
to get a sneak peak of data and run simple queries, run the line dlt pipeline filesystem show
.
This tool leverages on using Jinja
to simplify the process of creating sql queries to handle various data objectives. It implements DRY
-Dont Repeat Yourself
approach which allows inheritance of sql tables and queries
.
DuckDb
is an embedded database which has better computational power at handling big datasets and also creates them on the fly without necessarilly loading it in RDBMs
. It also offers compression of big files therefore coming in handy in the event of storage inefficiencies.
It can also be connected to visualization and dashboarding tools after installation of required drivers connections.
To create an interactive end-to-end project, a streamlit dashboard provides an efficient User Interaction & Experience
because of its simple navigable page. The dashboard has filters for:
-
Shopping mall
-
Year
-
Month
-
Product category
-
Gender
&Payment method
With each filter, the visuals adjust accordingly to the filter applied. The application can be found (here)[https://dtl-dbt-pipeline-gyz2smtjzi8bdoqqfptsi8.streamlit.app/] with this
Integrate an authorization
and login page
to ensure different users access different contents and analysis of the application.