generated from ProjectPythia/cookbook-template
-
Notifications
You must be signed in to change notification settings - Fork 13
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Reorganize Kerchunk Pythia Cookbook (#48)
- Loading branch information
1 parent
556fd5a
commit 232482e
Showing
30 changed files
with
555 additions
and
900 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,232 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# Kerchunk and Pangeo-Forge\n", | ||
"\n" | ||
] | ||
}, | ||
{ | ||
"attachments": {}, | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Overview\n", | ||
"\n", | ||
"In this tutorial we are going to use the open-source ETL pipeline named pangeo-forge-recipes to generate Kerchunk references.\n", | ||
"\n", | ||
"Pangeo-Forge is a community project to build reproducible cloud-native ARCO (Analysis-Ready-Cloud-Optimized) datasets. The Python library (`pangeo-forge-recipes`) is the ETL pipeline to process these datasets or \"recipes\". While a majority of the recipes convert a legacy format such as NetCDF to Zarr stores, `pangeo-forge-recipes` can also use Kerchunk under the hood to create reference recipes. \n", | ||
"\n", | ||
"It is important to note that `Kerchunk` can be used independently of `pangeo-forge-recipes` and in this example, `pangeo-forge-recipes` is acting as the runner for `Kerchunk`. \n", | ||
"\n", | ||
"\n", | ||
"\n", | ||
"## Prerequisites\n", | ||
"| Concepts | Importance | Notes |\n", | ||
"| --- | --- | --- |\n", | ||
"| [Kerchunk Basics](../foundations/kerchunk_basics) | Required | Core |\n", | ||
"| [Multiple Files and Kerchunk](../foundations/kerchunk_multi_file) | Required | Core |\n", | ||
"| [Multi-File Datasets with Kerchunk](../case_studies/ARG_Weather.ipynb) | Required | IO/Visualization |\n", | ||
"\n", | ||
"- **Time to learn**: 45 minutes\n", | ||
"---" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Getting to Know The Data\n", | ||
"\n", | ||
"`gridMET` is a high-resolution daily meteorological dataset covering CONUS from 1979-2023. It is produced by the Climatology Lab at UC Merced. In this example, we are going to look create a virtual Zarr dataset of a derived variable, Burn Index. " | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"### Examine a Single File" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": { | ||
"tags": [] | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"import xarray as xr\n", | ||
"\n", | ||
"ds = xr.open_dataset(\n", | ||
" \"http://thredds.northwestknowledge.net:8080/thredds/dodsC/MET/bi/bi_2021.nc\"\n", | ||
")" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"#### Plot the Dataset" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": { | ||
"tags": [] | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"ds.sel(day=\"2021-08-01\").burning_index_g.plot()" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Create a File Pattern\n", | ||
"\n", | ||
"To build our `pangeo-forge` pipeline, we need to create a `FilePattern` object, which is composed of all of our input urls. This dataset ranges from 1979 through 2023 and is composed of one year per file. \n", | ||
" \n", | ||
"To speed up our example, we will `prune` our recipe to select the first two entries in the `FilePattern`" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": { | ||
"tags": [] | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"from pangeo_forge_recipes.patterns import ConcatDim, FilePattern, MergeDim\n", | ||
"\n", | ||
"years = list(range(1979, 2022 + 1))\n", | ||
"\n", | ||
"\n", | ||
"time_dim = ConcatDim(\"time\", keys=years)\n", | ||
"\n", | ||
"\n", | ||
"def format_function(time):\n", | ||
" return f\"http://www.northwestknowledge.net/metdata/data/bi_{time}.nc\"\n", | ||
"\n", | ||
"\n", | ||
"pattern = FilePattern(format_function, time_dim, file_type=\"netcdf4\")\n", | ||
"\n", | ||
"\n", | ||
"pattern = pattern.prune()\n", | ||
"\n", | ||
"pattern" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Create a Location For Output\n", | ||
"We write to local storage for this example, but the reference file could also be shared via cloud storage. " | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": { | ||
"tags": [] | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"target_root = \"references\"\n", | ||
"store_name = \"Pangeo_Forge\"" | ||
] | ||
}, | ||
{ | ||
"attachments": {}, | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Build the Pangeo-Forge Beam Pipeline\n", | ||
"\n", | ||
"Next, we will chain together a bunch of methods to create a Pangeo-Forge - Apache Beam pipeline. \n", | ||
"Processing steps are chained together with the pipe operator (`|`). Once the pipeline is built, it can be ran in the following cell. \n", | ||
"\n", | ||
"The steps are as follows:\n", | ||
"1. Creates a starting collection of our input file patterns.\n", | ||
"2. Passes those file_patterns to `OpenWithKerchunk`, which creates references of each file.\n", | ||
"3. Combines the references files into a single reference file and write them with `WriteCombineReferences`\n", | ||
"\n", | ||
"Just like Kerchunk, you can specify the reference file type as either `.json` or `.parquet`.\n", | ||
"\n", | ||
"Note: You can add additional processing steps in this pipeline. \n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": { | ||
"tags": [] | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"import apache_beam as beam\n", | ||
"from pangeo_forge_recipes.transforms import (\n", | ||
" OpenWithKerchunk,\n", | ||
" WriteCombinedReference,\n", | ||
")\n", | ||
"\n", | ||
"transforms = (\n", | ||
" # Create a beam PCollection from our input file pattern\n", | ||
" beam.Create(pattern.items())\n", | ||
" # Open with Kerchunk and create references for each file\n", | ||
" | OpenWithKerchunk(file_type=pattern.file_type)\n", | ||
" # Use Kerchunk's `MultiZarrToZarr` functionality to combine and then write references.\n", | ||
" # *Note*: Setting the correct contact_dims and identical_dims is important.\n", | ||
" | WriteCombinedReference(\n", | ||
" target_root=target_root,\n", | ||
" store_name=store_name,\n", | ||
" output_file_name=\"reference.json\",\n", | ||
" concat_dims=[\"day\"],\n", | ||
" identical_dims=[\"lat\", \"lon\", \"crs\"],\n", | ||
" )\n", | ||
")" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": { | ||
"tags": [] | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"%%time\n", | ||
"\n", | ||
"with beam.Pipeline() as p:\n", | ||
" p | transforms" | ||
] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": "Python 3 (ipykernel)", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.10.11" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 4 | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.