Skip to content

A data pipeline extracts data from AWS S3 bucket, Load the data in an embedded database and finally applies transformation to the data.

Notifications You must be signed in to change notification settings

derak-isaack/DTL-DBT-pipeline

Repository files navigation

MALL SHOPPING DATA ANALYSIS DATA PIPELINE

dbt aws awsS3 streamlit Duckdb sql Docker

Project Overview

ETL and ELT data pipelines offer simplicity and automation for organizations that use cloud storage where data scales exponentially. The data pipelines enables data for analysts and modelling is readily available. Automation of data pipelines also ensures real-time tracking of important Key Perfomance Indicators for a business. Exponentially growing data may sometimes bring the challenge of storage constraints and embedded databases solve this efficiently.

Objectives

  1. Develop and deploy an efficient ETL Data Pipeline extracting data from AWS S3 bucket using data load tool using DuckDB database, transforming it and finally loading it in an DuckDB embedded database.

  2. Prepare data for analysis using queries which can be found `here'. The queries seek to find:

  • Average age of all shoppers across various shopping malls.

  • Average expenditure of each category of items by gender.

  • Most profitable shopping malls by year and month.

  • Shopping patterns across various malls.

The extracted data from the queries can be found here.

Data Extraction

The Data Load Tool open source library offers very efficient data pipelines offering consumption from various data sources including Google Sheets, Cloud Storage services, APIs and scraping scripts. This tool comes in handy for e-commerce organizations as it has the ability to track streaming data in real-time. It offers more scalability and customization depending on the needs at hand.

Data for this project was ingested from AWS S3 bucket. To get started with setting up the credentials to allow access by third oarty applications to your data, reading dlt documentation provides a clear step by step process for the setup.

After setting up everything, run the following commands in order to get started with dlt:

  1. pip install requirements.txt

  2. dlt init filesystem duckdb

  3. To run your pipeline, run python filesystem.py in the terminal. This downloads a duckdb file in the working directory which contains information about the data source.

  4. To have an interaction with the data, run dlt pipeline filesystem show which opens streamlit on the local browser.

For a more conclusive and extensive approach, reading data sources provides more information on how to connect to several data sources.

Data Transformation

Transformation of data is necessary because it needs to be error free for use by the data team to extract meaningful insights more fast than the old traditional formarts. Transformation is done using the Data Build Tool which is an open source platform that offers more flexibility for transformation of data in data pipelines.

To open up the Streamlit dashboard to get a sneak peak of data and run simple queries, run the line dlt pipeline filesystem show.

This tool leverages on using Jinja to simplify the process of creating sql queries to handle various data objectives. It implements DRY-Dont Repeat Yourself approach which allows inheritance of sql tables and queries.

Data Loading

DuckDb is an embedded database which has better computational power at handling big datasets and also creates them on the fly without necessarilly loading it in RDBMs. It also offers compression of big files therefore coming in handy in the event of storage inefficiencies.

It can also be connected to visualization and dashboarding tools after installation of required drivers connections.

Streamlit dashboard

To create an interactive end-to-end project, a streamlit dashboard provides an efficient User Interaction & Experience because of its simple navigable page. The dashboard has filters for:

  1. Shopping mall

  2. Year

  3. Month

  4. Product category

  5. Gender & Payment method

With each filter, the visuals adjust accordingly to the filter applied. The application can be found (here)[https://dtl-dbt-pipeline-gyz2smtjzi8bdoqqfptsi8.streamlit.app/] with this snapshot

Future steps

Integrate an authorization and login page to ensure different users access different contents and analysis of the application.

About

A data pipeline extracts data from AWS S3 bucket, Load the data in an embedded database and finally applies transformation to the data.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published