Skip to content

Template CI friendly local development environment featuring Spark Clusters + Blob Storage + a Notebook for prototyping data feature delivery.

Notifications You must be signed in to change notification settings

caldempsey/docker-notebook-spark-s3

Repository files navigation

localhost/build

DockerNotebookSparkS3

This repository provides a local experimental environment for data lakes and mock blob storage, leveraging PySpark and Spark clusters. It allows you to mimic Blob Storage locally and manage it with an Jupyter Notebook connected to a Spark Cluster closely emulating a real but simple environment.

This setup uses mvn to pull artefacts and transitive dependencies for Spark, e.g. Databricks Delta Lake, used as an example in this template, directly into the Spark's jars without any requirement for network requests from Spark, providing an effective template for the CI deployment for data processing pipelines and analytics in a secure or controlled setting.

Effortlessly dive in and unleash your data's potential, today!

Features

  • Mock Blob Storage: Mimics Blob Storage locally, enabling seamless integration with notebooks.
  • Spark Cluster: Configured with Docker containers for distributed computing tasks and large-scale dataset processing. Dependencies are managed via the infra-data-lake pom file and pulled onto the repository via mvn-based bash get_spark_deps.sh.
  • PySpark Notebooks: Jupyter notebooks for interactive data exploration and analysis. These run in driver or cluster mode. There's an issue ticket open to implement Client/Cluster asynchronous programming (#3), describing the tools needed to enable this.
  • CI/CD Heath Checks: Implemented using bash, GitHub Actions, and Docker Compose, CI health checks ensure services are built, up, and healthy before merging to a protected main.

Getting Started

Use make or follow these steps to set up the environment via Just:

  1. Clone this repository.
  2. Ensure Docker is installed.
  3. Install just.
  4. Run just deploy.
  5. Access Jupyter at http://localhost:8890 with token canttouchthis.
  6. Start experimenting with data lakes, mock blob storage, and PySpark notebooks!

Repository Structure

  • infra-data-lake/localhost: Delta Lake and notebooks for local connectivity.
  • infra-mock-blob-storage: Local mock for Blob Storage.
  • notebook-data-lake: Contains notebooks for data exploration and analysis.

Commands should be run from the root of the repository or using Just.

Configuration

Customize the template for your specific requirements and use cases. Since everything is hard coded for the moment, you probably want to find and replace the term orgname to suit you.

Happy Coding! ✨

About

Template CI friendly local development environment featuring Spark Clusters + Blob Storage + a Notebook for prototyping data feature delivery.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published