Skip to content

Spark and Data Lakes Project: STEDI Human Balance Analytics (Udacity Data Engineering with AWS Nanodegree)

Notifications You must be signed in to change notification settings

abreufreire/stedi-spark-datalake

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Spark & Data Lakes - STEDI Human Balance Analytics

Project description & goal:

The STEDI startup team has launched on the market Step Trainers which are being adopted by a growing number of customers; the team has been working on IoT sensor data and ML to continuously exceed the expectations of its customers. In order to support the STEDI team, as data engineers, we have been asked to build an ETL pipeline that extracts the data from STEDI Step Trainer sensors and the mobile app, and curate them into a data lakehouse solution on AWS so that data scientists can train a learning model. This lakehouse implementation provides a more versatile and powerful data management solution to embrace modern businesses for data storage, processing, and analytics. For this job we have adopted: Python and Spark, Glue, Athena, and S3.

Project workflow & stack:

flowchart

aws_pipeline

Datasets (source):

  • Customer Records: s3://cd0030bucket/customers/
  • Step Trainer Records: s3://cd0030bucket/step_trainer/
  • Accelerometer Records: s3://cd0030bucket/accelerometer/

Project files:

stedi-spark
└--cli
  |  abreu-stedi.txt            # Content of S3 bucket
└--graphics
  └--curated                    # folder: Athena screenshots: curated data - count of customers & machine learning curated readings                    
  └--landing                    # folder: screenshots: raw data - preview + count of customers, sensor readings & table schemas + data types
  └--trusted                    # folder: Athena screenshots: filtered data - data preview + count of customers & sensor readings
  |  aws_pipeline.png           # AWS stack: S3, Glue/Spark, Athena
  |  database.png               # Glue database
  |  flowchart.png              # Flow of data (source: Udacity)
  |  glue_jobs.png              # screenshot from Glue Studio: Glue ETL jobs
  |  tables.png                 # screenshot from Data Catalog: Glue tables
└--sql                          # folder: sql ddl scripts source of Glue tables                    
└--src                          # folder: python scripts source of Glue jobs 
|  README.md                    # Repository description

Implementation:

Configuration of AWS environment using AWS CLI/cloudshell:

Create S3 gateway endpoint:

$ aws ec2 create-vpc-endpoint --vpc-id vpc-0f184f2e87eba239a --service-name com.amazonaws.us-east-1.s3 --route-table-ids rtb-0852aa5316bd66cc6

Create IAM role:

$ aws iam create-role --role-name my-glue-service-role --assume-role-policy-document '{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Service": "glue.amazonaws.com"
            },
            "Action": "sts:AssumeRole"
        }
    ]
}'

Grant AWS Glue privileges to access S3 bucket:

$ aws iam put-role-policy --role-name my-glue-service-role --policy-name S3Access --policy-document '{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "ListObjectsInBucket",
            "Effect": "Allow",
            "Action": [
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::abreu-stedi"
            ]
        },
        {
            "Sid": "AllObjectActions",
            "Effect": "Allow",
            "Action": "s3:*Object",
            "Resource": [
                "arn:aws:s3:::abreu-stedi/*"
            ]
        }
    ]
}'

Upload data to S3 bucket (landing zone):

  • Download/clone the repository with the project data (accelerometer, customer, step trainer)
  • Create a new S3 bucket (ex. $ aws s3 mb s3://abreu-stedi --region us-east-1)
  • Copy individually the JSON data of the 3 categories to the new S3 bucket, creating a sub-folder "landing" for each of the 3 different locations (ex. bucket path: s3://abreu-stedi/accelerometer/landing/).

Create a Glue database (using Data Catalog or AWS CLI):

$ aws glue create-database --database-input '{"Name": "stedi-lake"}'

A Glue database allows us to manage the data directly from the S3 bucket.


Disclosure: The results of each operation are summarized & presented as graphics. The source code of Glue tables & Glue ETL jobs are included in the project folders.


Generation of the following Glue tables using Athena & inspection of raw data in landing zone:

tables

Schema & data types of Glue tables in landing zone:

acc_landing_schema

cus_landing_schema

step_landing_schema


Analysis of Glue tables data in LANDING zone using SQL/Athena:

Accelerometer:

acc_land

acc_land_prev

Customer:

cus_land

cus_land_prev

Filter the data that can be used for research/analytics (customers gave their consensus): cus_land_priv

Step trainer:

step_land

step_land_prev

The sql folder contains the source code (DDL) to create each of the tables.


Building Spark/Glue jobs to ETL data between landing, trusted, and curated zones:

Use Glue Studio to build ETL jobs that extract, transform, and load the data from the origin (using source type: Data source - Data Catalog), transform/munge/clean/filter the data using SQL, and loading it into the target S3 bucket.

The following Glue jobs were created: glue_jobs

The src folder contains the source code (python) to create each of the Glue jobs. The name of each file is self-explanatory.

Investigation of Glue tables data in TRUSTED zone using SQL/Athena:

Accelerometer:

acc_trusted

Customer:

cus_trusted

cus_trusted_prev

Step trainer:

step_trusted

ETL showing accelerometer and step trainer readings in trusted area being transformed (aggregated) into a glue table machine_learning_curated in the curated zone. glue_etl


Inspection of Glue tables data in CURATED zone using SQL/Athena:

Customer:

cus_curated

Machine learning (step trainer aggregated with accelerometer):

m_l_curated

About

Spark and Data Lakes Project: STEDI Human Balance Analytics (Udacity Data Engineering with AWS Nanodegree)

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages