E-Commerce Data Pipeline

This repository contains the code to set up an end-to-end data pipeline for an e-commerce dataset using Google Cloud Platform, Hadoop, Hive, PySpark, and Looker Studio.

Project Structure

setup/: Scripts for setting up the GCP bucket and Dataproc cluster.
hdfs_setup/: Commands for setting up HDFS and uploading the dataset.
hive_setup/: SQL scripts for setting up Hive tables and loading data.
pyspark_setup/: PySpark scripts for processing data.
save_dataframes/: Scripts for saving processed data to GCS, HDFS, and Hive.
visualization/: Instructions for visualizing data in Looker Studio.

Setup Instructions

1. Create GCP Bucket and Dataproc Cluster

./setup/create_gcp_bucket_and_dataproc_cluster.ps1

2. Upload Dataset to GCP Bucket

./setup/upload_dataset_to_bucket.ps1

3. SSH into Dataproc Cluster

./setup/ssh_into_cluster.ps1

4. HDFS Setup

./hdfs_setup/hdfs_setup_commands.ps1

5. Hive Setup

hive -f hive_setup/hive_setup_commands.sql

6. PySpark Setups

pyspark pyspark_setup/pyspark_setup.py

7. Save DataFrames

pyspark save_dataframes/save_dataframes.py

8. Visualization

Follow the instructions in visualization/looker_studio_instructions.md.

Project Team Members

https://github.com/karanzaveri

https://github.com/Spinal-Tap369

https://github.com/Nick9695

https://github.com/Ulysses013

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
hdfs_setup		hdfs_setup
hive_setup		hive_setup
pyspark_setup		pyspark_setup
save_dataframes		save_dataframes
setup		setup
visualization		visualization
LICENSE		LICENSE
Project_Report_GCP.pdf		Project_Report_GCP.pdf
README.md		README.md
ecom-dataset.csv		ecom-dataset.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

E-Commerce Data Pipeline

Project Structure

Setup Instructions

1. Create GCP Bucket and Dataproc Cluster

2. Upload Dataset to GCP Bucket

3. SSH into Dataproc Cluster

4. HDFS Setup

5. Hive Setup

6. PySpark Setups

7. Save DataFrames

8. Visualization

Project Team Members

About

Releases

Packages

Contributors 4

Languages

License

karanzaveri/ecom-hdfs-hive-analytics

Folders and files

Latest commit

History

Repository files navigation

E-Commerce Data Pipeline

Project Structure

Setup Instructions

1. Create GCP Bucket and Dataproc Cluster

2. Upload Dataset to GCP Bucket

3. SSH into Dataproc Cluster

4. HDFS Setup

5. Hive Setup

6. PySpark Setups

7. Save DataFrames

8. Visualization

Project Team Members

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages