This repository contains the code to set up an end-to-end data pipeline for an e-commerce dataset using Google Cloud Platform, Hadoop, Hive, PySpark, and Looker Studio.
setup/
: Scripts for setting up the GCP bucket and Dataproc cluster.hdfs_setup/
: Commands for setting up HDFS and uploading the dataset.hive_setup/
: SQL scripts for setting up Hive tables and loading data.pyspark_setup/
: PySpark scripts for processing data.save_dataframes/
: Scripts for saving processed data to GCS, HDFS, and Hive.visualization/
: Instructions for visualizing data in Looker Studio.
./setup/create_gcp_bucket_and_dataproc_cluster.ps1
./setup/upload_dataset_to_bucket.ps1
./setup/ssh_into_cluster.ps1
./hdfs_setup/hdfs_setup_commands.ps1
hive -f hive_setup/hive_setup_commands.sql
pyspark pyspark_setup/pyspark_setup.py
pyspark save_dataframes/save_dataframes.py
Follow the instructions in visualization/looker_studio_instructions.md.
https://github.com/karanzaveri