Skip to content

Commit

Permalink
Reorganize Kerchunk Pythia Cookbook (#48)
Browse files Browse the repository at this point in the history
  • Loading branch information
norlandrhagen authored Oct 26, 2023
1 parent 556fd5a commit 232482e
Show file tree
Hide file tree
Showing 30 changed files with 555 additions and 900 deletions.
22 changes: 5 additions & 17 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -55,29 +55,17 @@ how to use `Kerchunk` to create reference sets
from single file sources, as well as to create
multi-file virtual datasets from collections of files.

### Section 2 Case Studies
### Section 2 Generating Reference Files

The notebooks in the `Case Studies` section
The notebooks in the `Generating Reference Files` section
demonstrate how to use `Kerchunk` to create
datasets for all the supported file formats.
`Kerchunk` currently supports NetCDF3,
NetCDF4/HDF5, GRIB2, TIFF (including CoG)
and FITS, but more file formats will
be available in the future.
NetCDF4/HDF5, GRIB2, TIFF (including CoG).

### Future Additions / Wishlist
### Section 3 Using Pre-Generated References

This Pythia cookbook is a start, but there are
many more details of `Kerchunk` that could be
covered. If you have an idea of what to add or
would like to contribute, please open up a PR or issue.

Some possible additions:

- Diving into the details: The nitty-gritty on how `Kerchunk` works.
- `Kerchunk` and `Parquet`: what are the benefits of using parquet for reference file storage.
- Appending to a Kerchunk dataset:
How to schedule processing of newly added data files and how to add them to a `Kerchunk` dataset.
The `Pre-Generated References` section contains notebooks demonstrating how to load existing references into `Xarray` and `Xarray-Datatree`, generated coordinates for GeoTiffs using `xrefcoord` and plotting using `Hvplot Datashader`.

## Running the Notebooks

Expand Down
25 changes: 17 additions & 8 deletions _toc.yml
Original file line number Diff line number Diff line change
Expand Up @@ -4,18 +4,27 @@ parts:
- caption: Preamble
chapters:
- file: notebooks/how-to-cite

- caption: Foundations
chapters:
- file: notebooks/foundations/01_kerchunk_basics
- file: notebooks/foundations/02_kerchunk_multi_file
- file: notebooks/foundations/03_kerchunk_dask
- file: notebooks/foundations/04_kerchunk_reference_storage.ipynb

- caption: Case Studies
- caption: Advanced
chapters:
- file: notebooks/advanced/Parquet_Reference_Storage
- file: notebooks/advanced/Pangeo_Forge

- caption: Generating Reference Files
chapters:
- file: notebooks/generating_references/NetCDF
- file: notebooks/generating_references/GRIB2
- file: notebooks/generating_references/GeoTIFF

- caption: Using Pre-Generated References
chapters:
- file: notebooks/case_studies/NetCDF_SMN_Arg
- file: notebooks/case_studies/GRIB2_HRRR
- file: notebooks/case_studies/GeoTIFF_FMI
- file: notebooks/case_studies/NetCDF_Pangeo_Forge_gridMET
- file: notebooks/case_studies/Streaming_Visualizations_with_Hvplot_Datashader
- file: notebooks/case_studies/Kerchunk_DataTree.ipynb
- file: notebooks/using_references/Xarray
- file: notebooks/using_references/Xrefcoord
- file: notebooks/using_references/Datatree
- file: notebooks/using_references/Hvplot_Datashader
2 changes: 1 addition & 1 deletion environment.yml
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,6 @@ dependencies:
- jupyter-book
- jupyterlab
- jupyterlab=3
- kerchunk
- mamba
- matplotlib
- netcdf4
Expand All @@ -43,3 +42,4 @@ dependencies:
- "apache-beam[interactive, dataframe]"
- git+https://github.com/pangeo-forge/pangeo-forge-recipes
- git+https://github.com/carbonplan/xrefcoord.git
- git+https://github.com/fsspec/kerchunk
232 changes: 232 additions & 0 deletions notebooks/advanced/Pangeo_Forge.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,232 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Kerchunk and Pangeo-Forge\n",
"\n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Overview\n",
"\n",
"In this tutorial we are going to use the open-source ETL pipeline named pangeo-forge-recipes to generate Kerchunk references.\n",
"\n",
"Pangeo-Forge is a community project to build reproducible cloud-native ARCO (Analysis-Ready-Cloud-Optimized) datasets. The Python library (`pangeo-forge-recipes`) is the ETL pipeline to process these datasets or \"recipes\". While a majority of the recipes convert a legacy format such as NetCDF to Zarr stores, `pangeo-forge-recipes` can also use Kerchunk under the hood to create reference recipes. \n",
"\n",
"It is important to note that `Kerchunk` can be used independently of `pangeo-forge-recipes` and in this example, `pangeo-forge-recipes` is acting as the runner for `Kerchunk`. \n",
"\n",
"\n",
"\n",
"## Prerequisites\n",
"| Concepts | Importance | Notes |\n",
"| --- | --- | --- |\n",
"| [Kerchunk Basics](../foundations/kerchunk_basics) | Required | Core |\n",
"| [Multiple Files and Kerchunk](../foundations/kerchunk_multi_file) | Required | Core |\n",
"| [Multi-File Datasets with Kerchunk](../case_studies/ARG_Weather.ipynb) | Required | IO/Visualization |\n",
"\n",
"- **Time to learn**: 45 minutes\n",
"---"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Getting to Know The Data\n",
"\n",
"`gridMET` is a high-resolution daily meteorological dataset covering CONUS from 1979-2023. It is produced by the Climatology Lab at UC Merced. In this example, we are going to look create a virtual Zarr dataset of a derived variable, Burn Index. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Examine a Single File"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"import xarray as xr\n",
"\n",
"ds = xr.open_dataset(\n",
" \"http://thredds.northwestknowledge.net:8080/thredds/dodsC/MET/bi/bi_2021.nc\"\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Plot the Dataset"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"ds.sel(day=\"2021-08-01\").burning_index_g.plot()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Create a File Pattern\n",
"\n",
"To build our `pangeo-forge` pipeline, we need to create a `FilePattern` object, which is composed of all of our input urls. This dataset ranges from 1979 through 2023 and is composed of one year per file. \n",
" \n",
"To speed up our example, we will `prune` our recipe to select the first two entries in the `FilePattern`"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"from pangeo_forge_recipes.patterns import ConcatDim, FilePattern, MergeDim\n",
"\n",
"years = list(range(1979, 2022 + 1))\n",
"\n",
"\n",
"time_dim = ConcatDim(\"time\", keys=years)\n",
"\n",
"\n",
"def format_function(time):\n",
" return f\"http://www.northwestknowledge.net/metdata/data/bi_{time}.nc\"\n",
"\n",
"\n",
"pattern = FilePattern(format_function, time_dim, file_type=\"netcdf4\")\n",
"\n",
"\n",
"pattern = pattern.prune()\n",
"\n",
"pattern"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Create a Location For Output\n",
"We write to local storage for this example, but the reference file could also be shared via cloud storage. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"target_root = \"references\"\n",
"store_name = \"Pangeo_Forge\""
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Build the Pangeo-Forge Beam Pipeline\n",
"\n",
"Next, we will chain together a bunch of methods to create a Pangeo-Forge - Apache Beam pipeline. \n",
"Processing steps are chained together with the pipe operator (`|`). Once the pipeline is built, it can be ran in the following cell. \n",
"\n",
"The steps are as follows:\n",
"1. Creates a starting collection of our input file patterns.\n",
"2. Passes those file_patterns to `OpenWithKerchunk`, which creates references of each file.\n",
"3. Combines the references files into a single reference file and write them with `WriteCombineReferences`\n",
"\n",
"Just like Kerchunk, you can specify the reference file type as either `.json` or `.parquet`.\n",
"\n",
"Note: You can add additional processing steps in this pipeline. \n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"import apache_beam as beam\n",
"from pangeo_forge_recipes.transforms import (\n",
" OpenWithKerchunk,\n",
" WriteCombinedReference,\n",
")\n",
"\n",
"transforms = (\n",
" # Create a beam PCollection from our input file pattern\n",
" beam.Create(pattern.items())\n",
" # Open with Kerchunk and create references for each file\n",
" | OpenWithKerchunk(file_type=pattern.file_type)\n",
" # Use Kerchunk's `MultiZarrToZarr` functionality to combine and then write references.\n",
" # *Note*: Setting the correct contact_dims and identical_dims is important.\n",
" | WriteCombinedReference(\n",
" target_root=target_root,\n",
" store_name=store_name,\n",
" output_file_name=\"reference.json\",\n",
" concat_dims=[\"day\"],\n",
" identical_dims=[\"lat\", \"lon\", \"crs\"],\n",
" )\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"%%time\n",
"\n",
"with beam.Pipeline() as p:\n",
" p | transforms"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.11"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
Original file line number Diff line number Diff line change
Expand Up @@ -270,7 +270,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.12"
"version": "3.11.0"
}
},
"nbformat": 4,
Expand Down
Loading

0 comments on commit 232482e

Please sign in to comment.