Fully Automated Pipeline to Manage the process of getting Raw Data from Source Systmes, Ingestig The Data to DataLake (Google Cloud Storage), Preparing the Data, Moving The Data to DataWarehouse (BigQuery), Transforming the Data to Create Aggregation Layers and RFM Analysis to Segment the Customers.
Transforming The raw transactional data that comes from postgres database on a daily basis to aggreagated layers and analytical layers that could be used to segment and analyse product success and targeting customers.
for more information about data source, check Data Source README
for more information about final models and production tables Structure, check final tables README
- Cloud: Google Cloud
- Infrastructure: Terraform
- Orchestration: Prefect
- Data lake: Google Cloud Storage
- Data warehouse: BigQuery
- Data visualization: Google Looker Studio
visit Dashboard-Link
- Setup your Google Cloud environment
- Create a Google Cloud Platform project
- Configure Identity and Access Management (IAM) for the service account, giving it the following privileges: BigQuery Admin, Storage Admin and Storage Object Admin
- Download the JSON credentials and save it, e.g. to
~/.gc/<credentials>
- Install the Google Cloud SDK
- Let the environment variable point to your GCP key, authenticate it and refresh the session token
export GOOGLE_APPLICATION_CREDENTIALS=<path_to_your_credentials>.json
gcloud auth activate-service-account --key-file $GOOGLE_APPLICATION_CREDENTIALS
gcloud auth application-default login
- Setup your infrastructure
- Assuming you are using Linux AMD64 run the following commands to install Terraform - if you are using a different OS please choose the correct version here and exchange the download link and zip file name
sudo apt-get install unzip
cd ~/bin
wget https://releases.hashicorp.com/terraform/1.4.1/terraform_1.4.1_linux_amd64.zip
unzip terraform_1.4.1_linux_amd64.zip
rm terraform_1.4.1_linux_amd64.zip
- To initiate, plan and apply the infrastructure, adjust and run the following Terraform commands
cd terraform/
terraform init
terraform plan -var="project=<your-gcp-project-id>"
terraform apply -var="project=<your-gcp-project-id>"
- Go to fake-data-generator Directory and run the following to commands
- run
make run_img
create and run the postgres docker image(Source System). - run
make run
create data and insert it to postgres.
- Setup your Orchestration
- Go to prefect Directory
- to setup the python virtual environment and install all dependancies run
make venv
- check prefect README to setup the blocks and dependancies before running the flow.
- run the flow using
make run
- the final tables will be created at online_store_data dataset in BigQuery.
- build the dashboard.