Reorganize Kerchunk Pythia Cookbook (#48)

ProjectPythia · Oct 26, 2023 · 232482e · 232482e
1 parent 556fd5a
commit 232482e
Show file tree

Hide file tree

Showing 30 changed files with 555 additions and 900 deletions.
diff --git a/README.md b/README.md
@@ -55,29 +55,17 @@ how to use `Kerchunk` to create reference sets
 from single file sources, as well as to create
 multi-file virtual datasets from collections of files.
 
-### Section 2 Case Studies
+### Section 2 Generating Reference Files
 
-The notebooks in the `Case Studies` section
+The notebooks in the `Generating Reference Files` section
 demonstrate how to use `Kerchunk` to create
 datasets for all the supported file formats.
 `Kerchunk` currently supports NetCDF3,
-NetCDF4/HDF5, GRIB2, TIFF (including CoG)
-and FITS, but more file formats will
-be available in the future.
+NetCDF4/HDF5, GRIB2, TIFF (including CoG).
 
-### Future Additions / Wishlist
+### Section 3 Using Pre-Generated References
 
-This Pythia cookbook is a start, but there are
-many more details of `Kerchunk` that could be
-covered. If you have an idea of what to add or
-would like to contribute, please open up a PR or issue.
-
-Some possible additions:
-
-- Diving into the details: The nitty-gritty on how `Kerchunk` works.
-- `Kerchunk` and `Parquet`: what are the benefits of using parquet for reference file storage.
-- Appending to a Kerchunk dataset:
-  How to schedule processing of newly added data files and how to add them to a `Kerchunk` dataset.
+The `Pre-Generated References` section contains notebooks demonstrating how to load existing references into `Xarray` and `Xarray-Datatree`, generated coordinates for GeoTiffs using `xrefcoord` and plotting using `Hvplot Datashader`.
 
 ## Running the Notebooks
 

diff --git a/_toc.yml b/_toc.yml
@@ -4,18 +4,27 @@ parts:
   - caption: Preamble
     chapters:
       - file: notebooks/how-to-cite
+
   - caption: Foundations
     chapters:
       - file: notebooks/foundations/01_kerchunk_basics
       - file: notebooks/foundations/02_kerchunk_multi_file
       - file: notebooks/foundations/03_kerchunk_dask
-      - file: notebooks/foundations/04_kerchunk_reference_storage.ipynb
 
-  - caption: Case Studies
+  - caption: Advanced
+    chapters:
+      - file: notebooks/advanced/Parquet_Reference_Storage
+      - file: notebooks/advanced/Pangeo_Forge
+
+  - caption: Generating Reference Files
+    chapters:
+      - file: notebooks/generating_references/NetCDF
+      - file: notebooks/generating_references/GRIB2
+      - file: notebooks/generating_references/GeoTIFF
+
+  - caption: Using Pre-Generated References
     chapters:
-      - file: notebooks/case_studies/NetCDF_SMN_Arg
-      - file: notebooks/case_studies/GRIB2_HRRR
-      - file: notebooks/case_studies/GeoTIFF_FMI
-      - file: notebooks/case_studies/NetCDF_Pangeo_Forge_gridMET
-      - file: notebooks/case_studies/Streaming_Visualizations_with_Hvplot_Datashader
-      - file: notebooks/case_studies/Kerchunk_DataTree.ipynb
+      - file: notebooks/using_references/Xarray
+      - file: notebooks/using_references/Xrefcoord
+      - file: notebooks/using_references/Datatree
+      - file: notebooks/using_references/Hvplot_Datashader
diff --git a/environment.yml b/environment.yml
@@ -19,7 +19,6 @@ dependencies:
   - jupyter-book
   - jupyterlab
   - jupyterlab=3
-  - kerchunk
   - mamba
   - matplotlib
   - netcdf4
@@ -43,3 +42,4 @@ dependencies:
       - "apache-beam[interactive, dataframe]"
       - git+https://github.com/pangeo-forge/pangeo-forge-recipes
       - git+https://github.com/carbonplan/xrefcoord.git
+      - git+https://github.com/fsspec/kerchunk
diff --git a/notebooks/advanced/Pangeo_Forge.ipynb b/notebooks/advanced/Pangeo_Forge.ipynb
@@ -0,0 +1,232 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Kerchunk and Pangeo-Forge\n",
+    "\n"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Overview\n",
+    "\n",
+    "In this tutorial we are going to use the open-source ETL pipeline named pangeo-forge-recipes to generate Kerchunk references.\n",
+    "\n",
+    "Pangeo-Forge is a community project to build reproducible cloud-native ARCO (Analysis-Ready-Cloud-Optimized) datasets. The Python library (`pangeo-forge-recipes`) is the ETL pipeline to process these datasets or \"recipes\". While a majority of the recipes convert a legacy format such as NetCDF to Zarr stores, `pangeo-forge-recipes` can also use Kerchunk under the hood to create reference recipes. \n",
+    "\n",
+    "It is important to note that `Kerchunk` can be used independently of `pangeo-forge-recipes` and in this example, `pangeo-forge-recipes` is acting as the runner for `Kerchunk`. \n",
+    "\n",
+    "\n",
+    "\n",
+    "## Prerequisites\n",
+    "| Concepts | Importance | Notes |\n",
+    "| --- | --- | --- |\n",
+    "| [Kerchunk Basics](../foundations/kerchunk_basics) | Required | Core |\n",
+    "| [Multiple Files and Kerchunk](../foundations/kerchunk_multi_file) | Required | Core |\n",
+    "| [Multi-File Datasets with Kerchunk](../case_studies/ARG_Weather.ipynb) | Required | IO/Visualization |\n",
+    "\n",
+    "- **Time to learn**: 45 minutes\n",
+    "---"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Getting to Know The Data\n",
+    "\n",
+    "`gridMET` is a high-resolution daily meteorological dataset covering CONUS from 1979-2023. It is produced by the Climatology Lab at UC Merced. In this example, we are going to look create a virtual Zarr dataset of a derived variable, Burn Index. "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Examine a Single File"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "import xarray as xr\n",
+    "\n",
+    "ds = xr.open_dataset(\n",
+    "    \"http://thredds.northwestknowledge.net:8080/thredds/dodsC/MET/bi/bi_2021.nc\"\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### Plot the Dataset"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "ds.sel(day=\"2021-08-01\").burning_index_g.plot()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Create a File Pattern\n",
+    "\n",
+    "To build our `pangeo-forge` pipeline, we need to create a `FilePattern` object, which is composed of all of our input urls. This dataset ranges from 1979 through 2023 and is composed of one year per file. \n",
+    " \n",
+    "To speed up our example, we will `prune` our recipe to select the first two entries in the `FilePattern`"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "from pangeo_forge_recipes.patterns import ConcatDim, FilePattern, MergeDim\n",
+    "\n",
+    "years = list(range(1979, 2022 + 1))\n",
+    "\n",
+    "\n",
+    "time_dim = ConcatDim(\"time\", keys=years)\n",
+    "\n",
+    "\n",
+    "def format_function(time):\n",
+    "    return f\"http://www.northwestknowledge.net/metdata/data/bi_{time}.nc\"\n",
+    "\n",
+    "\n",
+    "pattern = FilePattern(format_function, time_dim, file_type=\"netcdf4\")\n",
+    "\n",
+    "\n",
+    "pattern = pattern.prune()\n",
+    "\n",
+    "pattern"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Create a Location For Output\n",
+    "We write to local storage for this example, but the reference file could also be shared via cloud storage. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "target_root = \"references\"\n",
+    "store_name = \"Pangeo_Forge\""
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Build the Pangeo-Forge Beam Pipeline\n",
+    "\n",
+    "Next, we will chain together a bunch of methods to create a Pangeo-Forge - Apache Beam pipeline. \n",
+    "Processing steps are chained together with the pipe operator (`|`). Once the pipeline is built, it can be ran in the following cell. \n",
+    "\n",
+    "The steps are as follows:\n",
+    "1. Creates a starting collection of our input file patterns.\n",
+    "2. Passes those file_patterns to `OpenWithKerchunk`, which creates references of each file.\n",
+    "3. Combines the references files into a single reference file and write them with `WriteCombineReferences`\n",
+    "\n",
+    "Just like Kerchunk, you can specify the reference file type as either `.json` or `.parquet`.\n",
+    "\n",
+    "Note: You can add additional processing steps in this pipeline. \n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "import apache_beam as beam\n",
+    "from pangeo_forge_recipes.transforms import (\n",
+    "    OpenWithKerchunk,\n",
+    "    WriteCombinedReference,\n",
+    ")\n",
+    "\n",
+    "transforms = (\n",
+    "    # Create a beam PCollection from our input file pattern\n",
+    "    beam.Create(pattern.items())\n",
+    "    # Open with Kerchunk and create references for each file\n",
+    "    | OpenWithKerchunk(file_type=pattern.file_type)\n",
+    "    # Use Kerchunk's `MultiZarrToZarr` functionality to combine and then write references.\n",
+    "    # *Note*: Setting the correct contact_dims and identical_dims is important.\n",
+    "    | WriteCombinedReference(\n",
+    "        target_root=target_root,\n",
+    "        store_name=store_name,\n",
+    "        output_file_name=\"reference.json\",\n",
+    "        concat_dims=[\"day\"],\n",
+    "        identical_dims=[\"lat\", \"lon\", \"crs\"],\n",
+    "    )\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "%%time\n",
+    "\n",
+    "with beam.Pipeline() as p:\n",
+    "    p | transforms"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.11"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
diff --git a/...tions/04_kerchunk_reference_storage.ipynb → .../advanced/Parquet_Reference_Storage.ipynb b/...tions/04_kerchunk_reference_storage.ipynb → .../advanced/Parquet_Reference_Storage.ipynb
@@ -270,7 +270,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.10.12"
+   "version": "3.11.0"
   }
  },
  "nbformat": 4,