s3-data-lake-example

Creating a S3 Data lake with pyspark ETL.

First step involves using pandas to only extract the columns that are required and to create files in data lake using parquet format.

Queries to be supported The Lines where the expected vs actual arrival time is long.