Skip to content

Simon-Initiative/dataset

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Custom Dataset Creation

This repository contains a parameterized PySpark job for Torus custom dataset creation, along with scripts to deploy, run, and manage dependencies and configurations for executing within AWS EMR Serverless.


Table of Contents

  1. Deployment
  2. Running the PySpark Job
  3. Updating the Custom Docker Image
  4. Requirements

Deployment

The entrypoint for the PySpark job for custom dataset generation is defined in job.py. Supporting modules are found in the dataset directory. To be invoked in the AWS EMR Serverless environment, these files must be deployed and accessible from an S3 bucket.

The deploy.sh script automates packaging and uploading the PySpark job script and dependencies to this S3 bucket.

Steps to Deploy:

  1. Run the deploy.sh script from the root directory:
    ./deploy.sh

Running the PySpark Job

A job can be manually invoked from EMR Serverless Studio, but also directly from the commandline using the run_job.sh script. The AWS commandline tools are required (https://aws.amazon.com/cli/)

Steps to Run:

  1. Run the run_job.sh script from the root directory with arguments for action, event subtypes, and section ids
    ./run_job.sh attempt_evaluated part_attempt_evaluated 2342,2343

Updating the Custom Docker Image

The dependencies needed by code executed by worker and executor nodes in PySpark are supplied via a custom EMR Docker image. Periodically, this image may need to be updated as we expand the feature set. The Dockerfile is present at config/Dockerfile and the script update_image.sh automates the building and deployment of it.

About

AWS EMR Serverless custom dataset creation

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published