Skip to content

Latest commit

 

History

History
28 lines (19 loc) · 1.69 KB

README.md

File metadata and controls

28 lines (19 loc) · 1.69 KB

Project logo

"Delta Lake is an open source project that enables building a Lakehouse architecture on top of data lakes. Delta Lake provides ACID transactions, scalable metadata handling, and unifies streaming and batch data processing on top of existing data lakes, such as S3, ADLS, GCS, and HDFS."

License


Introduction to Delta Upsert

This repository exemplifies a simple ELT process using delta to perform upsert and remove data files that aren't in the latest state of the transaction log for the table.

📝 Table of Contents

  • 1.raw-zone-ingestion - first ingestion to raw-zone
  • 2.raw-zone-incremental - incremental ingestion (append) to raw-zone
  • 3.staging-zone-ingestion - snapshot of the latest state of the table and creation of staging-zone (delta)
  • 4.staging-zone-incremental - incremental snapshot ingestion (delta)
  • Check scripts (check_raw-zone.py, check_staging-zone.py) - scripts to read and monitor tables being created
  • CSV files (titanic.csv, titanic2.csv, titanic3.csv) - simulate changes in tables being ingested
  • Directories (raw-zone, staging-zone) - store the data