diff --git a/README.md b/README.md index 32945d2..b7d6b8a 100644 --- a/README.md +++ b/README.md @@ -55,29 +55,17 @@ how to use `Kerchunk` to create reference sets from single file sources, as well as to create multi-file virtual datasets from collections of files. -### Section 2 Case Studies +### Section 2 Generating Reference Files -The notebooks in the `Case Studies` section +The notebooks in the `Generating Reference Files` section demonstrate how to use `Kerchunk` to create datasets for all the supported file formats. `Kerchunk` currently supports NetCDF3, -NetCDF4/HDF5, GRIB2, TIFF (including CoG) -and FITS, but more file formats will -be available in the future. +NetCDF4/HDF5, GRIB2, TIFF (including CoG). -### Future Additions / Wishlist +### Section 3 Using Pre-Generated References -This Pythia cookbook is a start, but there are -many more details of `Kerchunk` that could be -covered. If you have an idea of what to add or -would like to contribute, please open up a PR or issue. - -Some possible additions: - -- Diving into the details: The nitty-gritty on how `Kerchunk` works. -- `Kerchunk` and `Parquet`: what are the benefits of using parquet for reference file storage. -- Appending to a Kerchunk dataset: - How to schedule processing of newly added data files and how to add them to a `Kerchunk` dataset. +The `Pre-Generated References` section contains notebooks demonstrating how to load existing references into `Xarray` and `Xarray-Datatree`, generated coordinates for GeoTiffs using `xrefcoord` and plotting using `Hvplot Datashader`. ## Running the Notebooks diff --git a/_toc.yml b/_toc.yml index 80e14f9..97d53f5 100644 --- a/_toc.yml +++ b/_toc.yml @@ -4,18 +4,27 @@ parts: - caption: Preamble chapters: - file: notebooks/how-to-cite + - caption: Foundations chapters: - file: notebooks/foundations/01_kerchunk_basics - file: notebooks/foundations/02_kerchunk_multi_file - file: notebooks/foundations/03_kerchunk_dask - - file: notebooks/foundations/04_kerchunk_reference_storage.ipynb - - caption: Case Studies + - caption: Advanced + chapters: + - file: notebooks/advanced/Parquet_Reference_Storage + - file: notebooks/advanced/Pangeo_Forge + + - caption: Generating Reference Files + chapters: + - file: notebooks/generating_references/NetCDF + - file: notebooks/generating_references/GRIB2 + - file: notebooks/generating_references/GeoTIFF + + - caption: Using Pre-Generated References chapters: - - file: notebooks/case_studies/NetCDF_SMN_Arg - - file: notebooks/case_studies/GRIB2_HRRR - - file: notebooks/case_studies/GeoTIFF_FMI - - file: notebooks/case_studies/NetCDF_Pangeo_Forge_gridMET - - file: notebooks/case_studies/Streaming_Visualizations_with_Hvplot_Datashader - - file: notebooks/case_studies/Kerchunk_DataTree.ipynb + - file: notebooks/using_references/Xarray + - file: notebooks/using_references/Xrefcoord + - file: notebooks/using_references/Datatree + - file: notebooks/using_references/Hvplot_Datashader diff --git a/environment.yml b/environment.yml index 350868c..6f0d41e 100644 --- a/environment.yml +++ b/environment.yml @@ -19,7 +19,6 @@ dependencies: - jupyter-book - jupyterlab - jupyterlab=3 - - kerchunk - mamba - matplotlib - netcdf4 @@ -43,3 +42,4 @@ dependencies: - "apache-beam[interactive, dataframe]" - git+https://github.com/pangeo-forge/pangeo-forge-recipes - git+https://github.com/carbonplan/xrefcoord.git + - git+https://github.com/fsspec/kerchunk diff --git a/notebooks/advanced/Pangeo_Forge.ipynb b/notebooks/advanced/Pangeo_Forge.ipynb new file mode 100644 index 0000000..6a9678e --- /dev/null +++ b/notebooks/advanced/Pangeo_Forge.ipynb @@ -0,0 +1,232 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Kerchunk and Pangeo-Forge\n", + "\n" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Overview\n", + "\n", + "In this tutorial we are going to use the open-source ETL pipeline named pangeo-forge-recipes to generate Kerchunk references.\n", + "\n", + "Pangeo-Forge is a community project to build reproducible cloud-native ARCO (Analysis-Ready-Cloud-Optimized) datasets. The Python library (`pangeo-forge-recipes`) is the ETL pipeline to process these datasets or \"recipes\". While a majority of the recipes convert a legacy format such as NetCDF to Zarr stores, `pangeo-forge-recipes` can also use Kerchunk under the hood to create reference recipes. \n", + "\n", + "It is important to note that `Kerchunk` can be used independently of `pangeo-forge-recipes` and in this example, `pangeo-forge-recipes` is acting as the runner for `Kerchunk`. \n", + "\n", + "\n", + "\n", + "## Prerequisites\n", + "| Concepts | Importance | Notes |\n", + "| --- | --- | --- |\n", + "| [Kerchunk Basics](../foundations/kerchunk_basics) | Required | Core |\n", + "| [Multiple Files and Kerchunk](../foundations/kerchunk_multi_file) | Required | Core |\n", + "| [Multi-File Datasets with Kerchunk](../case_studies/ARG_Weather.ipynb) | Required | IO/Visualization |\n", + "\n", + "- **Time to learn**: 45 minutes\n", + "---" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Getting to Know The Data\n", + "\n", + "`gridMET` is a high-resolution daily meteorological dataset covering CONUS from 1979-2023. It is produced by the Climatology Lab at UC Merced. In this example, we are going to look create a virtual Zarr dataset of a derived variable, Burn Index. " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Examine a Single File" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "import xarray as xr\n", + "\n", + "ds = xr.open_dataset(\n", + " \"http://thredds.northwestknowledge.net:8080/thredds/dodsC/MET/bi/bi_2021.nc\"\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Plot the Dataset" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "ds.sel(day=\"2021-08-01\").burning_index_g.plot()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Create a File Pattern\n", + "\n", + "To build our `pangeo-forge` pipeline, we need to create a `FilePattern` object, which is composed of all of our input urls. This dataset ranges from 1979 through 2023 and is composed of one year per file. \n", + " \n", + "To speed up our example, we will `prune` our recipe to select the first two entries in the `FilePattern`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "from pangeo_forge_recipes.patterns import ConcatDim, FilePattern, MergeDim\n", + "\n", + "years = list(range(1979, 2022 + 1))\n", + "\n", + "\n", + "time_dim = ConcatDim(\"time\", keys=years)\n", + "\n", + "\n", + "def format_function(time):\n", + " return f\"http://www.northwestknowledge.net/metdata/data/bi_{time}.nc\"\n", + "\n", + "\n", + "pattern = FilePattern(format_function, time_dim, file_type=\"netcdf4\")\n", + "\n", + "\n", + "pattern = pattern.prune()\n", + "\n", + "pattern" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Create a Location For Output\n", + "We write to local storage for this example, but the reference file could also be shared via cloud storage. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "target_root = \"references\"\n", + "store_name = \"Pangeo_Forge\"" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Build the Pangeo-Forge Beam Pipeline\n", + "\n", + "Next, we will chain together a bunch of methods to create a Pangeo-Forge - Apache Beam pipeline. \n", + "Processing steps are chained together with the pipe operator (`|`). Once the pipeline is built, it can be ran in the following cell. \n", + "\n", + "The steps are as follows:\n", + "1. Creates a starting collection of our input file patterns.\n", + "2. Passes those file_patterns to `OpenWithKerchunk`, which creates references of each file.\n", + "3. Combines the references files into a single reference file and write them with `WriteCombineReferences`\n", + "\n", + "Just like Kerchunk, you can specify the reference file type as either `.json` or `.parquet`.\n", + "\n", + "Note: You can add additional processing steps in this pipeline. \n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "import apache_beam as beam\n", + "from pangeo_forge_recipes.transforms import (\n", + " OpenWithKerchunk,\n", + " WriteCombinedReference,\n", + ")\n", + "\n", + "transforms = (\n", + " # Create a beam PCollection from our input file pattern\n", + " beam.Create(pattern.items())\n", + " # Open with Kerchunk and create references for each file\n", + " | OpenWithKerchunk(file_type=pattern.file_type)\n", + " # Use Kerchunk's `MultiZarrToZarr` functionality to combine and then write references.\n", + " # *Note*: Setting the correct contact_dims and identical_dims is important.\n", + " | WriteCombinedReference(\n", + " target_root=target_root,\n", + " store_name=store_name,\n", + " output_file_name=\"reference.json\",\n", + " concat_dims=[\"day\"],\n", + " identical_dims=[\"lat\", \"lon\", \"crs\"],\n", + " )\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "%%time\n", + "\n", + "with beam.Pipeline() as p:\n", + " p | transforms" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.11" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/notebooks/foundations/04_kerchunk_reference_storage.ipynb b/notebooks/advanced/Parquet_Reference_Storage.ipynb similarity index 99% rename from notebooks/foundations/04_kerchunk_reference_storage.ipynb rename to notebooks/advanced/Parquet_Reference_Storage.ipynb index 884c7ac..79f8d58 100644 --- a/notebooks/foundations/04_kerchunk_reference_storage.ipynb +++ b/notebooks/advanced/Parquet_Reference_Storage.ipynb @@ -270,7 +270,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.10.12" + "version": "3.11.0" } }, "nbformat": 4, diff --git a/notebooks/case_studies/NetCDF_Pangeo_Forge_gridMET.ipynb b/notebooks/case_studies/NetCDF_Pangeo_Forge_gridMET.ipynb deleted file mode 100644 index b711817..0000000 --- a/notebooks/case_studies/NetCDF_Pangeo_Forge_gridMET.ipynb +++ /dev/null @@ -1,526 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Kerchunk and Pangeo-Forge\n", - "\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Overview\n", - "\n", - "In this tutorial we are going to use Kerchunk to create reference files of a dataset. \n", - "This allows us to read an entire dataset as if it were a single Zarr store instead of a collection of NetCDF files. \n", - "Using Kerchunk, we don't have to create a copy of the data, instead we create a collection of reference files, so that the original data files can be read as if they were Zarr.\n", - "\n", - "\n", - "This notebook shares some similarities with the [Multi-File Datasets with Kerchunk](../case_studies/ARG_Weather.ipynb), as they both create references from NetCDF files. However, this notebook differs as it uses `Pangeo-Forge` as the runner to create the reference files.\n", - "\n", - "\n", - "\n", - "## Prerequisites\n", - "| Concepts | Importance | Notes |\n", - "| --- | --- | --- |\n", - "| [Kerchunk Basics](../foundations/kerchunk_basics) | Required | Core |\n", - "| [Multiple Files and Kerchunk](../foundations/kerchunk_multi_file) | Required | Core |\n", - "| [Kerchunk and Dask](../foundations/kerchunk_dask) | Required | Core |\n", - "| [Multi-File Datasets with Kerchunk](../case_studies/ARG_Weather.ipynb) | Required | IO/Visualization |\n", - "\n", - "- **Time to learn**: 45 minutes\n", - "---" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Motivation" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Why Kerchunk\n", - "\n", - "For many traditional data processing pipelines, the start involves download a large amount of files to a local computer and then subsetting them for future analysis. Kerchunk gives us two large advantages: \n", - "1. A massive reduction in used disk space.\n", - "2. Performance improvements with through parallel, chunk-specific access of the dataset. \n", - "\n", - "In addition to these speedups, once the consolidated Kerchunk reference file has been created, it can be easily shared for other users to access the dataset. \n", - "\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "tags": [] - }, - "source": [ - "## Pangeo-Forge & Kerchunk\n", - "\n", - "Pangeo-Forge is a community project to build reproducible cloud-native ARCO (Analysis-Ready-Cloud-Optimized) datasets. The Python library (`pangeo-forge-recipes`) is the ETL pipeline to process these datasets or \"recipes\". While a majority of the recipes convert a legacy format such as NetCDF to Zarr stores, `pangeo-forge-recipes` can also use Kerchunk under the hood to create reference recipes. \n", - "\n", - "It is important to note that `Kerchunk` can be used independently of `pangeo-forge-recipes` and in this example, `pangeo-forge-recipes` is acting as the runner for `Kerchunk`. " - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Why Pangeo-Forge & Kerchunk\n", - "\n", - "While you can use `Kerchunk` without `pangeo-forge`, we hope that `pangeo-forge` can be another tool to create sharable ARCO datasets using `Kerchunk`. \n", - "A few potential benefits of creating `Kerchunk` based reference recipes with `pangeo-forge` may include:\n", - "- Recipe processing pipelines may be more standardized than case-by-case custom Kerchunk processing functions.\n", - "- Recipe processing can be scaled through `pangeo-forge-cloud` for large datasets.\n", - "- The infrastructure of `pangeo-forge` in GitHub may allow more community feedback on recipes.\n", - "- Additional features such as appending to datasets as new data is generated may be available in future releases of `pangeo-forge`.\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Getting to Know The Data\n", - "\n", - "`gridMET` is a high-resolution daily meteorological dataset covering CONUS from 1979-2023. It is produced by the Climatology Lab at UC Merced. In this example, we are going to look create a virtual Zarr dataset of a derived variable, Burn Index. " - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Examine a Single File" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "import xarray as xr\n", - "\n", - "ds = xr.open_dataset(\n", - " \"http://thredds.northwestknowledge.net:8080/thredds/dodsC/MET/bi/bi_2021.nc\"\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### Plot the Dataset" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "ds.sel(day=\"2021-08-01\").burning_index_g.plot()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Create a File Pattern\n", - "\n", - "To build our `pangeo-forge` pipeline, we need to create a `FilePattern` object, which is composed of all of our input urls. This dataset ranges from 1979 through 2023 and is composed of one year per file. \n", - " \n", - "To speed up our example, we will `prune` our recipe to select the first two entries in the `FilePattern`" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "from pangeo_forge_recipes.patterns import ConcatDim, FilePattern, MergeDim\n", - "\n", - "years = list(range(1979, 2022 + 1))\n", - "\n", - "\n", - "time_dim = ConcatDim(\"time\", keys=years)\n", - "\n", - "\n", - "def format_function(time):\n", - " return f\"http://www.northwestknowledge.net/metdata/data/bi_{time}.nc\"\n", - "\n", - "\n", - "pattern = FilePattern(format_function, time_dim, file_type=\"netcdf4\")\n", - "\n", - "\n", - "pattern = pattern.prune()\n", - "\n", - "pattern" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Create a Location For Output\n", - "We write to local storage for this example, but the reference file could also be shared via cloud storage. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "target_root = \"references\"\n", - "store_name = \"Pangeo_Forge\"" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Build the Pangeo-Forge Beam Pipeline\n", - "\n", - "Next, we will chain together a bunch of methods to create a Pangeo-Forge - Apache Beam pipeline. \n", - "Processing steps are chained together with the pipe operator (`|`). Once the pipeline is built, it can be ran in the following cell. \n", - "\n", - "The steps are as follows:\n", - "1. Creates a starting collection of our input file patterns.\n", - "2. Passes those file_patterns to `OpenWithKerchunk`, which creates references of each file.\n", - "3. Combines the references files into a single reference file with `CombineReferences`.\n", - "4. Writes the combined reference file.\n", - "\n", - "Note: You can add additional processing steps in this pipeline. \n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "import apache_beam as beam\n", - "from pangeo_forge_recipes.transforms import (\n", - " OpenWithKerchunk,\n", - " WriteCombinedReference,\n", - ")\n", - "\n", - "transforms = (\n", - " # Create a beam PCollection from our input file pattern\n", - " beam.Create(pattern.items())\n", - " # Open with Kerchunk and create references for each file\n", - " | OpenWithKerchunk(file_type=pattern.file_type)\n", - " # Use Kerchunk's `MultiZarrToZarr` functionality to combine and then write references.\n", - " # *Note*: Setting the correct contact_dims and identical_dims is important.\n", - " | WriteCombinedReference(\n", - " target_root=target_root,\n", - " store_name=store_name,\n", - " concat_dims=[\"day\"],\n", - " identical_dims=[\"lat\", \"lon\", \"crs\"],\n", - " )\n", - ")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "%%time\n", - "\n", - "with beam.Pipeline() as p:\n", - " p | transforms" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "import os\n", - "\n", - "import fsspec\n", - "\n", - "full_path = os.path.join(target_root, store_name, \"reference.json\")\n", - "print(os.path.getsize(full_path) / 1e6)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Our reference .json file is about 1MB, instead of 108GBs. That is quite the storage savings! " - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Examine the Result" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "mapper = fsspec.get_mapper(\n", - " \"reference://\",\n", - " fo=full_path,\n", - " remote_protocol=\"http\",\n", - ")\n", - "ds = xr.open_dataset(\n", - " mapper, engine=\"zarr\", decode_coords=\"all\", backend_kwargs={\"consolidated\": False}\n", - ")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "ds" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "ds.isel(day=220).burning_index_g.plot()" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "tags": [] - }, - "source": [ - "## Access Speed Benchmark - Kerchunk vs NetCDF\n", - "\n", - "In the access test below, we had almost a 3x speedup in access time using the `Kerchunk` reference dataset vs the NetCDF file collection. This isn't a huge speed-up, but will vary a lot depending on chunking schema, access patterns etc. \n", - "| Kerchunk | Time (s) |\n", - "| ------------- | ----------- |\n", - "| Kerchunk | 10 |\n", - "| Cloud NetCDF | 28 |\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Kerchunk" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "import fsspec\n", - "import xarray as xr" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "%%time\n", - "\n", - "kerchunk_path = os.path.join(target_root, store_name, \"reference.json\")\n", - "\n", - "mapper = fsspec.get_mapper(\n", - " \"reference://\",\n", - " fo=kerchunk_path,\n", - " remote_protocol=\"http\",\n", - ")\n", - "kerchunk_ds = xr.open_dataset(\n", - " mapper, engine=\"zarr\", decode_coords=\"all\", backend_kwargs={\"consolidated\": False}\n", - ")\n", - "kerchunk_ds.sel(lat=slice(48, 47), lon=slice(-123, -122)).burning_index_g.max().values" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "kerchunk_ds" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "That took almost 10 seconds." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "tags": [] - }, - "source": [ - "### NetCDF Cloud Access" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "# prepare urls\n", - "def url_gen(year):\n", - " return (\n", - " f\"http://thredds.northwestknowledge.net:8080/thredds/dodsC/MET/bi/bi_{year}.nc\"\n", - " )\n", - "\n", - "\n", - "urls_list = [url_gen(year) for year in years]" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "%%time\n", - "netcdf_ds = xr.open_mfdataset(urls_list, engine=\"netcdf4\")\n", - "netcdf_ds.sel(lat=slice(48, 47), lon=slice(-123, -122)).burning_index_g.mean().values" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "tags": [] - }, - "source": [ - "That took about about 28 seconds. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "netcdf_ds" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "tags": [] - }, - "source": [ - "## Storage Benchmark - Kerchunk vs NetCDF \n", - "## 5200x Storage Savings\n", - "\n", - "| Storage | Mb (s) |\n", - "| ------------- | ----------- |\n", - "| Kerchunk | 10 |\n", - "| Cloud NetCDF | 52122 |\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "# Kerchunk Reference File\n", - "import os\n", - "\n", - "print(f\"{round(os.path.getsize(kerchunk_path) / 1e6, 1)} Mb\")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "# NetCDF Files\n", - "print(f\"{round(netcdf_ds.nbytes/1e6,1)} Mb\")\n", - "print(\"or\")\n", - "print(f\"{round(netcdf_ds.nbytes/1e9,1)} Gb\")" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3 (ipykernel)", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.10.11" - } - }, - "nbformat": 4, - "nbformat_minor": 4 -} diff --git a/notebooks/foundations/combined.parq/.zmetadata b/notebooks/foundations/combined.parq/.zmetadata new file mode 100644 index 0000000..450e248 --- /dev/null +++ b/notebooks/foundations/combined.parq/.zmetadata @@ -0,0 +1 @@ +{"metadata":{".zgroup":{"zarr_format":2},"time\/.zarray":{"chunks":[2],"compressor":null,"dtype":"\n" + "\"HRRR\n" ] }, { + "attachments": {}, "cell_type": "markdown", "id": "9eced552", "metadata": {}, @@ -28,7 +32,6 @@ "1. Generating a list of GRIB2 files on a remote filesystem using `fsspec`\n", "1. How to create reference files of GRIB2 files using ``Kerchunk`` \n", "1. Combining multiple `Kerchunk` reference files using `MultiZarrToZarr`\n", - "1. Reading the output with `Xarray` and `Intake`\n", "\n", "This notebook shares many similarities with the [Multi-File Datasets with Kerchunk](../foundations/kerchunk_multi_file.ipynb) and the [NetCDF/HDF5 Argentinian Weather Dataset Case Study](../case_studies/ARG_Weather.ipynb), however this case studies examines another data format and uses `kerchunk.scan_grib` to create reference files. \n", "\n", @@ -42,7 +45,6 @@ "| [Multiple Files and Kerchunk](../foundations/kerchunk_multi_file) | Required | Core |\n", "| [Kerchunk and Dask](../foundations/kerchunk_dask) | Required | Core |\n", "| [Introduction to Xarray](https://foundations.projectpythia.org/core/xarray/xarray-intro.html) | Required | IO/Visualization |\n", - "| [Intake Introduction](https://projectpythia.org/intake-cookbook/notebooks/intake_introduction.html) | Recommended | IO |\n", "- **Time to learn**: 45 minutes\n", "---" ] @@ -113,8 +115,7 @@ "import xarray as xr\n", "from distributed import Client\n", "from kerchunk.combine import MultiZarrToZarr\n", - "from kerchunk.grib2 import scan_grib\n", - "from tqdm import tqdm" + "from kerchunk.grib2 import scan_grib" ] }, { @@ -152,16 +153,6 @@ " files = files[0:2]" ] }, - { - "cell_type": "code", - "execution_count": null, - "id": "a2e351e5", - "metadata": {}, - "outputs": [], - "source": [ - "files[0]" - ] - }, { "cell_type": "markdown", "id": "3bec2881", @@ -287,16 +278,6 @@ "multi_kerchunk = mzz.translate()" ] }, - { - "cell_type": "code", - "execution_count": null, - "id": "aa3e29dd", - "metadata": {}, - "outputs": [], - "source": [ - "multi_kerchunk" - ] - }, { "cell_type": "markdown", "id": "59c4209d", @@ -313,78 +294,10 @@ "outputs": [], "source": [ "# Write Kerchunk .json record\n", - "output_fname = \"references/HRRR_combined.json\"\n", + "output_fname = \"HRRR_combined.json\"\n", "with open(f\"{output_fname}\", \"wb\") as f:\n", " f.write(ujson.dumps(multi_kerchunk).encode())" ] - }, - { - "cell_type": "markdown", - "id": "28283d9e", - "metadata": {}, - "source": [ - "## Load Kerchunked dataset" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "c05643d6", - "metadata": {}, - "outputs": [], - "source": [ - "# open dataset as zarr object using fsspec reference file system and Xarray\n", - "fs = fsspec.filesystem(\n", - " \"reference\",\n", - " fo=\"references/HRRR_combined.json\",\n", - " remote_protocol=\"s3\",\n", - " remote_options={\"anon\": True},\n", - ")\n", - "m = fs.get_mapper(\"\")\n", - "ds = xr.open_dataset(\n", - " m, engine=\"zarr\", backend_kwargs=dict(consolidated=False), chunks={\"valid_time\": 1}\n", - ")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "fcf6d3e5", - "metadata": {}, - "outputs": [], - "source": [ - "ds" - ] - }, - { - "cell_type": "markdown", - "id": "24cde8c5", - "metadata": {}, - "source": [ - "## Plot a slice of the dataset\n", - "\n", - "Here we are using `Xarray` to select a single time slice of the dataset and plot a temperature map of CONUS." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "d803d720", - "metadata": {}, - "outputs": [], - "source": [ - "ds[\"t2m\"][-1].plot()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "52a3b1dd", - "metadata": {}, - "outputs": [], - "source": [ - "ds[\"t2m\"][:, 500, 500].plot()" - ] } ], "metadata": { diff --git a/notebooks/case_studies/GeoTIFF_FMI.ipynb b/notebooks/generating_references/GeoTIFF.ipynb similarity index 73% rename from notebooks/case_studies/GeoTIFF_FMI.ipynb rename to notebooks/generating_references/GeoTIFF.ipynb index 8889893..3fefc11 100644 --- a/notebooks/case_studies/GeoTIFF_FMI.ipynb +++ b/notebooks/generating_references/GeoTIFF.ipynb @@ -1,10 +1,12 @@ { "cells": [ { + "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ - "# Kerchunk, GeoTIFF and Generating Coordinates with `xrefcoord`\n" + "# GeoTIFF\n", + "Generating Kerchunk References from GeoTIFF files" ] }, { @@ -15,6 +17,7 @@ ] }, { + "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ @@ -24,7 +27,6 @@ "\n", "1. How to generate `Kerchunk` references of GeoTIFFs.\n", "1. Combining `Kerchunk` references into a virtual dataset.\n", - "1. Generating Coordinates with the `xrefcoord` accessor.\n", "\n", "\n", "## Prerequisites\n", @@ -69,11 +71,9 @@ "import s3fs\n", "import ujson\n", "import xarray as xr\n", - "import xrefcoord # noqa\n", "from distributed import Client\n", "from kerchunk.combine import MultiZarrToZarr\n", - "from kerchunk.tiff import tiff_to_zarr\n", - "from tqdm import tqdm" + "from kerchunk.tiff import tiff_to_zarr" ] }, { @@ -256,101 +256,10 @@ "multi_kerchunk = mzz.translate()\n", "\n", "# Write kerchunk .json record\n", - "output_fname = \"references/RADAR.json\"\n", + "output_fname = \"RADAR.json\"\n", "with open(f\"{output_fname}\", \"wb\") as f:\n", " f.write(ujson.dumps(multi_kerchunk).encode())" ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Open Combined Reference Dataset" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "fs = fsspec.filesystem(\n", - " \"reference\",\n", - " fo=\"references/RADAR.json\",\n", - " remote_protocol=\"s3\",\n", - " remote_options={\"anon\": True},\n", - " skip_instance_cache=True,\n", - ")\n", - "m = fs.get_mapper(\"\")\n", - "ds = xr.open_dataset(m, engine=\"zarr\")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "ds" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Use `xrefcoord` to Generate Coordinates\n", - "When using `Kerchunk` to generate reference datasets for GeoTIFF's, only the dimensions are preserved. `xrefcoord` is a small utility that allows us to generate coordinates for these reference datasets using the geospatial metadata. Similar to other accessor add-on libraries for `Xarray` such as `rioxarray` and `xwrf`, `xrefcord` provides an accessor for an `Xarray` dataset. Importing `xrefcoord` allows us to use the `.xref` accessor to access additional methods. \n", - "\n", - "In the following cell, we will use the `generate_coords` method to build coordinates for the `Xarray` dataset. `xrefcoord` is *very experimental* and makes assumptions about the underlying data, such as each variable shares the same dimensions etc. Use with caution!\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Generate coordinates from reference dataset\n", - "ref_ds = ds.xref.generate_coords(time_dim_name=\"time\", x_dim_name=\"X\", y_dim_name=\"Y\")\n", - "# Rename to rain accumulation in 24 hour period\n", - "ref_ds = ref_ds.rename({\"0\": \"rr24h\"})" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Create a Map\n", - "\n", - "Here we are using `Xarray` to select a single time slice and create a map of 24 hour accumulated rainfall." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "ref_ds[\"rr24h\"].where(ref_ds.rr24h < 60000).isel(time=0).plot(robust=True)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Create a Time-Series\n", - "\n", - "Next we are plotting accumulated rain as a function of time for a specific point." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "ref_ds[\"rr24h\"][:, 700, 700].plot()" - ] } ], "metadata": { diff --git a/notebooks/case_studies/NetCDF_SMN_Arg.ipynb b/notebooks/generating_references/NetCDF.ipynb similarity index 78% rename from notebooks/case_studies/NetCDF_SMN_Arg.ipynb rename to notebooks/generating_references/NetCDF.ipynb index 832d583..ab507ec 100644 --- a/notebooks/case_studies/NetCDF_SMN_Arg.ipynb +++ b/notebooks/generating_references/NetCDF.ipynb @@ -5,7 +5,8 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# Kerchunk and NetCDF/HDF5: A Case Study using the Argentinian High Resolution Weather Forecast Dataset\n" + "# NetCDF\n", + "Generating Kerchunk References from NetCDF files\n" ] }, { @@ -13,7 +14,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "\"ARG\"" + "\"ARG\"" ] }, { @@ -27,7 +28,6 @@ "\n", "1. How to access remote NetCDF data using `Kerchunk`\n", "1. Combining multiple `Kerchunk` reference files using `MultiZarrToZarr`\n", - "1. Reading the output with `Xarray` and `Intake`\n", "\n", "This notebook shares many similarities with the [Multi-File Datasets with Kerchunk](../foundations/kerchunk_multi_file.ipynb). If you are confused on the function of a block of code, please refer there for a more detailed breakdown of what each line is doing.\n", "\n", @@ -38,8 +38,6 @@ "| [Kerchunk Basics](../foundations/kerchunk_basics) | Required | Core |\n", "| [Multiple Files and Kerchunk](../foundations/kerchunk_multi_file) | Required | Core |\n", "| [Kerchunk and Dask](../foundations/kerchunk_dask) | Required | Core |\n", - "| [Introduction to Xarray](https://foundations.projectpythia.org/core/xarray/xarray-intro.html) | Required | IO/Visualization |\n", - "| [Intake Introduction](https://projectpythia.org/intake-cookbook/notebooks/intake_introduction.html) | Recommended | IO |\n", "- **Time to learn**: 45 minutes\n", "---" ] @@ -61,7 +59,7 @@ "source": [ "## About the Dataset\n", "\n", - "The SMN-Arg is a WRF deterministic weather forecasting dataset created by the `Servicio Meteorológico Nacional de Argentina` that covers Argentina as well as many neighboring countries at a 4km spatial resolution. \n", + "For this example, we will look at a weather dataset composed of multiple NetCDF files.The SMN-Arg is a WRF deterministic weather forecasting dataset created by the `Servicio Meteorológico Nacional de Argentina` that covers Argentina as well as many neighboring countries at a 4km spatial resolution. \n", "The model is initialized twice daily at 00 & 12 UTC with hourly forecasts for variables such as temperature, relative humidity, precipitation, wind direction and magnitude etc. for multiple atmospheric levels.\n", "The data is output at hourly intervals with a maximum prediction lead time of 72 hours in NetCDF files.\n", "\n", @@ -112,8 +110,7 @@ "import xarray as xr\n", "from distributed import Client\n", "from kerchunk.combine import MultiZarrToZarr\n", - "from kerchunk.hdf import SingleHdf5ToZarr\n", - "from tqdm import tqdm" + "from kerchunk.hdf import SingleHdf5ToZarr" ] }, { @@ -289,86 +286,11 @@ "# save translate reference in memory for later visualization\n", "multi_kerchunk = mzz.translate()\n", "\n", - "# Write kerchunk .json record\n", - "output_fname = \"references/ARG_combined.json\"\n", + "# Write kerchunk .json record.\n", + "output_fname = \"ARG_combined.json\"\n", "with open(f\"{output_fname}\", \"wb\") as f:\n", " f.write(ujson.dumps(multi_kerchunk).encode())" ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Load kerchunked dataset\n", - "\n", - "Now the dataset is a logical view over all of the files we scanned." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# create an fsspec reference filesystem from the Kerchunk output\n", - "import fsspec\n", - "\n", - "fs = fsspec.filesystem(\n", - " \"reference\",\n", - " fo=\"references/ARG_combined.json\",\n", - " remote_protocol=\"s3\",\n", - " remote_options={\"anon\": True},\n", - " skip_instance_cache=True,\n", - ")\n", - "m = fs.get_mapper(\"\")\n", - "ds = xr.open_dataset(m, engine=\"zarr\")" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Create a Map\n", - "\n", - "Here we are using `Xarray` to select a single time slice and create a map of 2-m temperature across the region." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "ds.isel(time=0).T2.plot()" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Create a Time-Series\n", - "\n", - "Next we are plotting temperature as a function of time for a specific point." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "ds[\"T2\"][:, 500, 500].plot()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [] } ], "metadata": { diff --git a/notebooks/case_studies/Kerchunk_DataTree.ipynb b/notebooks/using_references/Datatree.ipynb similarity index 99% rename from notebooks/case_studies/Kerchunk_DataTree.ipynb rename to notebooks/using_references/Datatree.ipynb index 3048b64..7140010 100644 --- a/notebooks/case_studies/Kerchunk_DataTree.ipynb +++ b/notebooks/using_references/Datatree.ipynb @@ -5,7 +5,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# Kerchunk and Xarray-Datatree at Scale" + "# Kerchunk and Xarray-Datatree" ] }, { @@ -68,7 +68,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## Read the Reference Catalog\n", + "## Read the reference catalog\n", "\n", "The **NASA NEX-GDDP-CMIP6** dataset is organized by GCM, Scenario and Ensemble Member. Each of these Scenario/GCM combinations is represented as a combined reference file, which was created by merging across variables and concatenating along time-steps. All of these references are organized into a simple `.csv` catalog in the schema: \n", "| GCM/Scenario | url |\n", diff --git a/notebooks/case_studies/Streaming_Visualizations_with_Hvplot_Datashader.ipynb b/notebooks/using_references/Hvplot_Datashader.ipynb similarity index 68% rename from notebooks/case_studies/Streaming_Visualizations_with_Hvplot_Datashader.ipynb rename to notebooks/using_references/Hvplot_Datashader.ipynb index c96be7e..fa79079 100644 --- a/notebooks/case_studies/Streaming_Visualizations_with_Hvplot_Datashader.ipynb +++ b/notebooks/using_references/Hvplot_Datashader.ipynb @@ -9,6 +9,7 @@ ] }, { + "attachments": {}, "cell_type": "markdown", "id": "9eced552", "metadata": {}, @@ -17,13 +18,12 @@ " \n", "This notebook will demonstrate how to use Kerchunk with hvPlot and Datashader to lazily visualize a reference dataset in a streaming fashion.\n", "\n", - "We will be building off content from [Kerchunk and Pangeo-Forge](../case_studies/NetCDF_Pangeo_Forge_gridMET.ipynb), so it's encouraged you first go through that.\n", + "We will be building off the references generated through the notebook content from the[Pangeo_Forge](notebooks/generating_references/Pangeo_Forge.ipynb) notebook, so it's encouraged you first go through that.\n", "\n", "## Prerequisites\n", "| Concepts | Importance | Notes |\n", "| --- | --- | --- |\n", "| [Kerchunk Basics](../foundations/kerchunk_basics) | Required | Core |\n", - "| [Multiple Files and Kerchunk](../foundations/kerchunk_multi_file) | Required | Core |\n", "| [Introduction to Xarray](https://foundations.projectpythia.org/core/xarray/xarray-intro.html) | Required | IO |\n", "| [Introduction to hvPlot](https://hvplot.holoviz.org/) | Required | Data Visualization |\n", "| [Introduction to Datashader](https://datashader.org/index.html) | Required | Big Data Visualization |\n", @@ -44,11 +44,12 @@ ] }, { + "attachments": {}, "cell_type": "markdown", "id": "8e2c4765", "metadata": {}, "source": [ - "## Getting to Know The Data\n", + "### Getting to Know The Data\n", "\n", "`gridMET` is a high-resolution daily meteorological dataset covering CONUS from 1979-2023. It is produced by the Climatology Lab at UC Merced. In this example, we are going to look create a virtual Zarr dataset of a derived variable, Burn Index. " ] @@ -68,65 +69,8 @@ "metadata": {}, "outputs": [], "source": [ - "import os\n", - "import time\n", - "\n", - "import apache_beam as beam\n", - "import fsspec\n", "import hvplot.xarray\n", - "import xarray as xr\n", - "from pangeo_forge_recipes.patterns import ConcatDim, FilePattern\n", - "from pangeo_forge_recipes.transforms import (\n", - " OpenWithKerchunk,\n", - " WriteCombinedReference,\n", - ")" - ] - }, - { - "cell_type": "markdown", - "id": "858399ce", - "metadata": {}, - "source": [ - "## Preprocess Dataset\n", - "\n", - "Here we will be preparing the Kerchunk reference files by using the recipe described in [Kerchunk and Pangeo-Forge](../case_studies/NetCDF_Pangeo_Forge_gridMET.ipynb).\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "06fb2af3", - "metadata": {}, - "outputs": [], - "source": [ - "# Constants\n", - "target_root = \"references\"\n", - "store_name = \"Pangeo_Forge\"\n", - "full_path = os.path.join(target_root, store_name, \"reference.json\")\n", - "years = list(range(1979, 1980))\n", - "time_dim = ConcatDim(\"time\", keys=years)\n", - "\n", - "\n", - "# Functions\n", - "def format_function(time):\n", - " return f\"http://www.northwestknowledge.net/metdata/data/bi_{time}.nc\"\n", - "\n", - "\n", - "# Patterns\n", - "pattern = FilePattern(format_function, time_dim, file_type=\"netcdf4\")\n", - "pattern = pattern.prune()\n", - "\n", - "# Apache Beam transforms\n", - "transforms = (\n", - " beam.Create(pattern.items())\n", - " | OpenWithKerchunk(file_type=pattern.file_type)\n", - " | WriteCombinedReference(\n", - " target_root=target_root,\n", - " store_name=store_name,\n", - " concat_dims=[\"day\"],\n", - " identical_dims=[\"lat\", \"lon\", \"crs\"],\n", - " )\n", - ")" + "import xarray as xr" ] }, { @@ -150,14 +94,20 @@ "source": [ "%%timeit -r 1 -n 1\n", "\n", - "mapper = fsspec.get_mapper(\n", - " \"reference://\",\n", - " fo=full_path,\n", - " remote_protocol=\"http\",\n", - ")\n", + "\n", + "storage_options = {\n", + " \"remote_protocol\": \"http\",\n", + " \"skip_instance_cache\": True,\n", + "} # options passed to fsspec\n", + "open_dataset_options = {\"chunks\": {}, \"decode_coords\": \"all\"} # opens passed to xarray\n", + "\n", "ds_kerchunk = xr.open_dataset(\n", - " mapper, engine=\"zarr\", decode_coords=\"all\", backend_kwargs={\"consolidated\": False}\n", + " \"references/Pangeo_Forge/reference.json\",\n", + " engine=\"kerchunk\",\n", + " storage_options=storage_options,\n", + " open_dataset_options=open_dataset_options,\n", ")\n", + "\n", "display(ds_kerchunk.hvplot(\"lon\", \"lat\", rasterize=True))" ] }, @@ -199,6 +149,7 @@ " )\n", "\n", "\n", + "years = list(range(1979, 1980))\n", "urls_list = [url_gen(year) for year in years]\n", "netcdf_ds = xr.open_mfdataset(urls_list, engine=\"netcdf4\")\n", "display(netcdf_ds.hvplot(\"lon\", \"lat\", rasterize=True))" @@ -229,7 +180,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.10.11" + "version": "3.10.12" }, "vscode": { "interpreter": { diff --git a/notebooks/using_references/Xarray.ipynb b/notebooks/using_references/Xarray.ipynb new file mode 100644 index 0000000..c5afb09 --- /dev/null +++ b/notebooks/using_references/Xarray.ipynb @@ -0,0 +1,115 @@ +{ + "cells": [ + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Load Kerchunked dataset with Xarray\n", + "\n" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Overview\n", + " \n", + "Within this notebook, we will cover:\n", + "\n", + "1. How to load a Kerchunk pre-generated reference file into Xarray as if it were a Zarr store.\n", + "\n", + "## Prerequisites\n", + "| Concepts | Importance | Notes |\n", + "| --- | --- | --- |\n", + "| [Kerchunk Basics](../foundations/kerchunk_basics) | Required | Core |\n", + "| [Xarray Tutorial](https://tutorial.xarray.dev/intro.html) | Required | Core |\n", + "\n", + "- **Time to learn**: 45 minutes\n", + "---" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Opening Reference Dataset with Fsspec and Xarray\n", + "One way of using our reference dataset is opening it with `Xarray`. To do this, we will create an `fsspec` filesystem and pass it to `Xarray`. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# create an fsspec reference filesystem from the Kerchunk output\n", + "import fsspec\n", + "import xarray as xr\n", + "import kerchunk\n", + "\n", + "fs = fsspec.filesystem(\n", + " \"reference\",\n", + " fo=\"references/ARG_combined.json\",\n", + " remote_protocol=\"s3\",\n", + " skip_instance_cache=True,\n", + ")\n", + "m = fs.get_mapper(\"\")\n", + "ds = xr.open_dataset(m, engine=\"zarr\")" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Opening Reference Dataset with Xarray and the `Kerchunk` Engine\n", + "As of writing, the latest version of Kerchunk supports opening an reference dataset with Xarray without specifically creating an fsspec filesystem. This is the same behavior as the example above, just a few less lines of code. \n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "storage_options = {\n", + " \"remote_protocol\": \"s3\",\n", + " \"skip_instance_cache\": True,\n", + "} # options passed to fsspec\n", + "open_dataset_options = {\"chunks\": {}} # opens passed to xarray\n", + "\n", + "ds = xr.open_dataset(\n", + " \"references/ARG_combined.json\",\n", + " engine=\"kerchunk\",\n", + " storage_options=storage_options,\n", + " open_dataset_options=open_dataset_options,\n", + ")" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.12" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/notebooks/using_references/Xrefcoord.ipynb b/notebooks/using_references/Xrefcoord.ipynb new file mode 100644 index 0000000..f23a9a6 --- /dev/null +++ b/notebooks/using_references/Xrefcoord.ipynb @@ -0,0 +1,134 @@ +{ + "cells": [ + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Use `xrefcoord` to Generate Coordinates\n", + "\n", + "When using `Kerchunk` to generate reference datasets for GeoTIFF's, only the dimensions are preserved. `xrefcoord` is a small utility that allows us to generate coordinates for these reference datasets using the geospatial metadata. Similar to other accessor add-on libraries for `Xarray` such as `rioxarray` and `xwrf`, `xrefcord` provides an accessor for an `Xarray` dataset. Importing `xrefcoord` allows us to use the `.xref` accessor to access additional methods. \n", + "\n", + "In this tutorial we will use the `generate_coords` method to build coordinates for the `Xarray` dataset. `xrefcoord` is *very experimental* and makes assumptions about the underlying data, such as each variable shares the same dimensions etc. Use with caution!\n" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Overview\n", + " \n", + "Within this notebook, we will cover:\n", + "\n", + "1. How to load a Kerchunk reference dataset created from a collection of GeoTIFFs\n", + "1. How to use `xrefcoord` to generate coordinates from a GeoTIFF reference dataset\n", + "\n", + "## Prerequisites\n", + "| Concepts | Importance | Notes |\n", + "| --- | --- | --- |\n", + "| [Kerchunk Basics](../foundations/kerchunk_basics) | Required | Core |\n", + "| [Xarray Tutorial](https://tutorial.xarray.dev/intro.html) | Required | Core |\n", + "\n", + "- **Time to learn**: 45 minutes\n", + "---" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import fsspec\n", + "import xarray as xr\n", + "import xrefcoord\n", + "\n", + "\n", + "storage_options = {\n", + " \"remote_protocol\": \"s3\",\n", + " \"skip_instance_cache\": True,\n", + "} # options passed to fsspec\n", + "open_dataset_options = {\"chunks\": {}} # opens passed to xarray\n", + "\n", + "ds = xr.open_dataset(\n", + " \"references/RADAR.json\",\n", + " engine=\"kerchunk\",\n", + " storage_options=storage_options,\n", + " open_dataset_options=open_dataset_options,\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Generate coordinates from reference dataset\n", + "ref_ds = ds.xref.generate_coords(time_dim_name=\"time\", x_dim_name=\"X\", y_dim_name=\"Y\")\n", + "# Rename to rain accumulation in 24 hour period\n", + "ref_ds = ref_ds.rename({\"0\": \"rr24h\"})" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Create a Map\n", + "\n", + "Here we are using `Xarray` to select a single time slice and create a map of 24 hour accumulated rainfall." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "ref_ds[\"rr24h\"].where(ref_ds.rr24h < 60000).isel(time=0).plot(robust=True)" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Create a Time-Series\n", + "\n", + "Next we are plotting accumulated rain as a function of time for a specific point." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "ref_ds[\"rr24h\"][:, 700, 700].plot()" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.12" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/notebooks/case_studies/references/ARG_combined.json b/notebooks/using_references/references/ARG_combined.json similarity index 100% rename from notebooks/case_studies/references/ARG_combined.json rename to notebooks/using_references/references/ARG_combined.json diff --git a/notebooks/case_studies/references/HRRR_combined.json b/notebooks/using_references/references/HRRR_combined.json similarity index 100% rename from notebooks/case_studies/references/HRRR_combined.json rename to notebooks/using_references/references/HRRR_combined.json diff --git a/notebooks/case_studies/references/Pangeo_Forge/reference.json b/notebooks/using_references/references/Pangeo_Forge/reference.json similarity index 100% rename from notebooks/case_studies/references/Pangeo_Forge/reference.json rename to notebooks/using_references/references/Pangeo_Forge/reference.json diff --git a/notebooks/case_studies/references/RADAR.json b/notebooks/using_references/references/RADAR.json similarity index 100% rename from notebooks/case_studies/references/RADAR.json rename to notebooks/using_references/references/RADAR.json