Update notebooks to use VirtualiZarr (#69)

ProjectPythia · Dec 2, 2024 · c890147 · c890147
1 parent f70420e
commit c890147
Show file tree

Hide file tree

Showing 8 changed files with 328 additions and 503 deletions.
diff --git a/README.md b/README.md
@@ -1,33 +1,33 @@
 <img src="thumbnail.png" alt="thumbnail" width="300"/>
 
-# Kerchunk Cookbook
+# Virtual Zarr Cookbook (Kerchunk and VirtualiZarr)
 
 [![nightly-build](https://github.com/ProjectPythia/kerchunk-cookbook/actions/workflows/nightly-build.yaml/badge.svg)](https://github.com/ProjectPythia/kerchunk-cookbook/actions/workflows/nightly-build.yaml)
 [![Binder](https://binder.projectpythia.org/badge_logo.svg)](https://binder.projectpythia.org/v2/gh/ProjectPythia/kerchunk-cookbook/main?labpath=notebooks)
 [![DOI](https://zenodo.org/badge/588661659.svg)](https://zenodo.org/badge/latestdoi/588661659)
 
-This Project Pythia Cookbook covers using the [Kerchunk](https://fsspec.github.io/kerchunk/)
-library to access archival data formats as if they were
-ARCO (Analysis-Ready-Cloud-Optimized) data.
+This Project Pythia Cookbook covers using the [Kerchunk](https://fsspec.github.io/kerchunk/), [VirtualiZarr](https://virtualizarr.readthedocs.io/en/latest/index.html), and [Zarr-Python](https://zarr.readthedocs.io/en/stable/) libraries to access archival data formats as if they were ARCO (Analysis-Ready-Cloud-Optimized) data.
 
 ## Motivation
 
-The `Kerchunk` library allows you to access chunked and compressed
+The `Kerchunk` library pioneered the access of chunked and compressed
 data formats (such as NetCDF3. HDF5, GRIB2, TIFF & FITS), many of
 which are the primary data formats for many data archives, as if
 they were in ARCO formats such as Zarr which allows for parallel,
 chunk-specific access. Instead of creating a new copy of the dataset
 in the Zarr spec/format, `Kerchunk` reads through the data archive
 and extracts the byte range and compression information of each
-chunk, then writes that information to a .json file (or alternate
-backends in future releases). For more details on how this process
-works please see this page on the
-[Kerchunk docs](https://fsspec.github.io/kerchunk/detail.html)).
-These summary files can then be combined to generated a `Kerchunk`
-reference for that dataset, which can be read via
-[Zarr](https://zarr.readthedocs.io) and
+chunk, then writes that information to a "virtual Zarr store" using a
+JSON or Parquet "reference file". The `VirtualiZarr`
+library provides a simple way to create these "virtual stores" using familiary
+`xarray` syntax. Lastly, the `icechunk` provides a new way to store and re-use these references.
+
+These virtual Zarr stores can be re-used and read via [Zarr](https://zarr.readthedocs.io) and
 [Xarray](https://docs.xarray.dev/en/stable/).
 
+For more details on how this process works please see this page on the
+[Kerchunk docs](https://fsspec.github.io/kerchunk/detail.html)).
+
 ## Authors
 
 [Raphael Hagen](https://github.com/norlandrhagen)
@@ -48,24 +48,24 @@ the creator of `Kerchunk` and the
 This cookbook is broken up into two sections,
 Foundations and Example Notebooks.
 
-### Section 1 Foundations
+### Section 1 - Foundations
 
 In the `Foundations` section we will demonstrate
-how to use `Kerchunk` to create reference sets
+how to use `Kerchunk` and `VirtualiZarr` to create reference files
 from single file sources, as well as to create
-multi-file virtual datasets from collections of files.
+multi-file virtual Zarr stores from collections of files.
 
-### Section 2 Generating Reference Files
+### Section 2 - Generating Virtual Zarr Stores
 
-The notebooks in the `Generating Reference Files` section
-demonstrate how to use `Kerchunk` to create
+The notebooks in the `Generating Virtual Zarr Stores` section
+demonstrates how to use `Kerchunk` and `VirtualiZarr` to create
 datasets for all the supported file formats.
-`Kerchunk` currently supports NetCDF3,
-NetCDF4/HDF5, GRIB2, TIFF (including CoG).
+These libraries currently support virtualizing NetCDF3,
+NetCDF4/HDF5, GRIB2, TIFF (including COG).
 
-### Section 3 Using Pre-Generated References
+### Section 3 - Using Virtual Zarr Stores
 
-The `Pre-Generated References` section contains notebooks demonstrating how to load existing references into `Xarray` and `Xarray-Datatree`, generated coordinates for GeoTiffs using `xrefcoord` and plotting using `Hvplot Datashader`.
+The `Using Virtual Zarr Stores` section contains notebooks demonstrating how to load existing references into `Xarray`, generating coordinates for GeoTiffs using `xrefcoord`, and plotting using `Hvplot Datashader`.
 
 ## Running the Notebooks
 

diff --git a/environment.yml b/environment.yml
@@ -3,7 +3,7 @@ channels:
   - conda-forge
 dependencies:
   - cfgrib
-  - dask
+  - dask>=2024.10.0
   - dask-labextension
   - datashader
   - distributed
@@ -28,12 +28,12 @@ dependencies:
   - pooch
   - pre-commit
   - pyOpenSSL
-  - python=3.10
   - rioxarray
   - s3fs
   - scipy
   - tifffile
   - ujson
+  - virtualizarr
   - xarray>=2024.10.0
   - zarr
   - sphinx-pythia-theme

diff --git a/notebooks/advanced/Parquet_Reference_Storage.ipynb b/notebooks/advanced/Parquet_Reference_Storage.ipynb
@@ -14,7 +14,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# Store Kerchunk Reference Files as Parquet"
+    "# Store virtual datasets as Kerchunk Parquet references"
    ]
   },
   {
@@ -24,18 +24,18 @@
    "source": [
     "## Overview\n",
     "   \n",
-    "In this notebook we will cover how to store Kerchunk references as Parquet files instead of json. For large reference datasets, using Parquet should have performance implications as the overall reference file size should be smaller and the memory overhead of combining the reference files should be lower. \n",
+    "In this notebook we will cover how to store virtual datasets as Kerchunk Parquet references instead of Kerchunk JSON references. For large virtual datasets, using Parquet should have performance implications as the overall reference file size should be smaller and the memory overhead of combining the reference files should be lower. \n",
     "\n",
     "\n",
     "This notebook builds upon the [Kerchunk Basics](notebooks/foundations/01_kerchunk_basics.ipynb), [Multi-File Datasets with Kerchunk](notebooks/foundations/02_kerchunk_multi_file.ipynb) and the [Kerchunk and Dask](notebooks/foundations/03_kerchunk_dask.ipynb) notebooks. \n",
     "\n",
     "## Prerequisites\n",
     "| Concepts | Importance | Notes |\n",
     "| --- | --- | --- |\n",
-    "| [Kerchunk Basics](../foundations/kerchunk_basics) | Required | Core |\n",
-    "| [Multiple Files and Kerchunk](../foundations/kerchunk_multi_file) | Required | Core |\n",
-    "| [Introduction to Xarray](https://foundations.projectpythia.org/core/xarray/xarray-intro.html) | Recommended | IO/Visualization |\n",
-    "| [Intro to Dask](https://tutorial.dask.org/00_overview.html) | Required | Parallel Processing |\n",
+    "| [Basics of virtual Zarr stores](../foundations/01_kerchunk_basics.ipynb) | Required | Core |\n",
+    "| [Multi-file virtual datasets with VirtualiZarr](../foundations/02_kerchunk_multi_file.ipynb) | Required | Core |\n",
+    "| [Parallel virtual dataset creation with VirtualiZarr, Kerchunk, and Dask](../foundations/03_kerchunk_dask) | Required | Core |\n",
+    "| [Introduction to Xarray](https://foundations.projectpythia.org/core/xarray/xarray-intro.html) | Required | IO/Visualization |\n",
     "\n",
     "- **Time to learn**: 30 minutes\n",
     "---"
@@ -47,8 +47,6 @@
    "metadata": {},
    "source": [
     "## Imports\n",
-    "In addition to the previous imports we used throughout the tutorial, we are adding a few imports:\n",
-    "- `LazyReferenceMapper` and `ReferenceFileSystem` from `fsspec.implementations.reference` for lazy Parquet.\n",
     "\n",
     "\n"
    ]
@@ -60,15 +58,12 @@
    "outputs": [],
    "source": [
     "import logging\n",
-    "import os\n",
     "\n",
     "import dask\n",
     "import fsspec\n",
     "import xarray as xr\n",
     "from distributed import Client\n",
-    "from fsspec.implementations.reference import LazyReferenceMapper, ReferenceFileSystem\n",
-    "from kerchunk.combine import MultiZarrToZarr\n",
-    "from kerchunk.hdf import SingleHdf5ToZarr"
+    "from virtualizarr import open_virtual_dataset"
    ]
   },
   {
@@ -111,10 +106,28 @@
     "files_paths = fs_read.glob(\"s3://smn-ar-wrf/DATA/WRF/DET/2022/12/31/12/*\")\n",
     "\n",
     "# Here we prepend the prefix 's3://', which points to AWS.\n",
-    "file_pattern = sorted([\"s3://\" + f for f in files_paths])\n",
-    "\n",
-    "# Grab the first seven files to speed up example.\n",
-    "file_pattern = file_pattern[0:2]"
+    "files_paths = sorted([\"s3://\" + f for f in files_paths])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Subset the Data\n",
+    "To speed up our example, lets take a subset of the year of data. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# If the subset_flag == True (default), the list of input files will\n",
+    "# be subset to speed up the processing\n",
+    "subset_flag = True\n",
+    "if subset_flag:\n",
+    "    files_paths = files_paths[0:4]"
    ]
   },
   {
@@ -123,7 +136,8 @@
    "metadata": {},
    "source": [
     "# Generate Lazy References\n",
-    "Below we will create a `fsspec` filesystem to read the references from `s3` and create a function to generate dask delayed tasks."
+    "\n",
+    "Here we create a function to generate a list of Dask delayed objects."
    ]
   },
   {
@@ -132,20 +146,18 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "# Use Kerchunk's `SingleHdf5ToZarr` method to create a `Kerchunk`\n",
-    "# index from a NetCDF file.\n",
-    "fs_read = fsspec.filesystem(\"s3\", anon=True, skip_instance_cache=True)\n",
-    "so = dict(mode=\"rb\", anon=True, default_fill_cache=False, default_cache_type=\"first\")\n",
-    "\n",
-    "\n",
-    "def generate_json_reference(fil):\n",
-    "    with fs_read.open(fil, **so) as infile:\n",
-    "        h5chunks = SingleHdf5ToZarr(infile, fil, inline_threshold=300)\n",
-    "        return h5chunks.translate()  # outf\n",
+    "def generate_virtual_dataset(file, storage_options):\n",
+    "    return open_virtual_dataset(\n",
+    "        file, indexes={}, reader_options={\"storage_options\": storage_options}\n",
+    "    )\n",
     "\n",
     "\n",
+    "storage_options = dict(anon=True, default_fill_cache=False, default_cache_type=\"first\")\n",
     "# Generate Dask Delayed objects\n",
-    "tasks = [dask.delayed(generate_json_reference)(fil) for fil in file_pattern]"
+    "tasks = [\n",
+    "    dask.delayed(generate_virtual_dataset)(file, storage_options)\n",
+    "    for file in files_paths\n",
+    "]"
    ]
   },
   {
@@ -163,7 +175,15 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "single_refs = dask.compute(tasks)[0]"
+    "virtual_datasets = list(dask.compute(*tasks))"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Combine virtual datasets using VirtualiZarr"
    ]
   },
   {
@@ -172,23 +192,17 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "len(single_refs)"
+    "combined_vds = xr.combine_nested(\n",
+    "    virtual_datasets, concat_dim=[\"time\"], coords=\"minimal\", compat=\"override\"\n",
+    ")\n",
+    "combined_vds"
    ]
   },
   {
-   "attachments": {},
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Combine In-Memory References with MultiZarrToZarr\n",
-    "This section will look notably different than the previous examples that have written to `.json`.\n",
-    "\n",
-    "In the following code block we are:\n",
-    "- Creating an `fsspec` filesystem.\n",
-    "- Create a empty `parquet` file to write to. \n",
-    "- Creating an `fsspec` `LazyReferenceMapper` to pass into `MultiZarrToZarr`\n",
-    "- Building a `MultiZarrToZarr` object of the combined references.\n",
-    "- Calling `.flush()` on our LazyReferenceMapper, to write the combined reference to our `parquet` file."
+    "## Write the virtual dataset to a Kerchunk Parquet reference"
    ]
   },
   {
@@ -197,26 +211,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "fs = fsspec.filesystem(\"file\")\n",
-    "\n",
-    "if os.path.exists(\"combined.parq\"):\n",
-    "    import shutil\n",
-    "\n",
-    "    shutil.rmtree(\"combined.parq\")\n",
-    "os.makedirs(\"combined.parq\")\n",
-    "\n",
-    "out = LazyReferenceMapper.create(root=\"combined.parq\", fs=fs, record_size=1000)\n",
-    "\n",
-    "mzz = MultiZarrToZarr(\n",
-    "    single_refs,\n",
-    "    remote_protocol=\"s3\",\n",
-    "    concat_dims=[\"time\"],\n",
-    "    identical_dims=[\"y\", \"x\"],\n",
-    "    remote_options={\"anon\": True},\n",
-    "    out=out,\n",
-    ").translate()\n",
-    "\n",
-    "out.flush()"
+    "combined_vds.virtualize.to_kerchunk(\"combined.parq\", format=\"parquet\")"
    ]
   },
   {
@@ -255,31 +250,28 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "fs = ReferenceFileSystem(\n",
+    "storage_options = {\n",
+    "    \"remote_protocol\": \"s3\",\n",
+    "    \"skip_instance_cache\": True,\n",
+    "    \"remote_options\": {\"anon\": True},\n",
+    "    \"target_protocol\": \"file\",\n",
+    "    \"lazy\": True,\n",
+    "}  # options passed to fsspec\n",
+    "open_dataset_options = {\"chunks\": {}}  # opens passed to xarray\n",
+    "\n",
+    "ds = xr.open_dataset(\n",
     "    \"combined.parq\",\n",
-    "    remote_protocol=\"s3\",\n",
-    "    target_protocol=\"file\",\n",
-    "    lazy=True,\n",
-    "    remote_options={\"anon\": True},\n",
+    "    engine=\"kerchunk\",\n",
+    "    storage_options=storage_options,\n",
+    "    open_dataset_options=open_dataset_options,\n",
     ")\n",
-    "ds = xr.open_dataset(\n",
-    "    fs.get_mapper(), engine=\"zarr\", backend_kwargs={\"consolidated\": False}\n",
-    ")"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
     "ds"
    ]
   }
  ],
  "metadata": {
   "kernelspec": {
-   "display_name": "kerchunk-cookbook-dev",
+   "display_name": "kerchunk-cookbook",
    "language": "python",
    "name": "python3"
   },
@@ -293,7 +285,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.11.9"
+   "version": "3.12.7"
   }
  },
  "nbformat": 4,