From c890147df8b8dfdeeebb4645d3fd8a05fa5d6851 Mon Sep 17 00:00:00 2001 From: Max Jones <14077947+maxrjones@users.noreply.github.com> Date: Mon, 2 Dec 2024 13:41:54 -0700 Subject: [PATCH] Update notebooks to use VirtualiZarr (#69) --- README.md | 44 ++--- environment.yml | 4 +- .../advanced/Parquet_Reference_Storage.ipynb | 148 +++++++-------- .../foundations/01_kerchunk_basics.ipynb | 95 ++++------ .../foundations/02_kerchunk_multi_file.ipynb | 171 ++++++------------ notebooks/foundations/03_kerchunk_dask.ipynb | 114 +++++------- notebooks/generating_references/GeoTIFF.ipynb | 131 ++++++-------- notebooks/generating_references/NetCDF.ipynb | 124 +++++-------- 8 files changed, 328 insertions(+), 503 deletions(-) diff --git a/README.md b/README.md index d2cb32c5..28f38625 100644 --- a/README.md +++ b/README.md @@ -1,33 +1,33 @@ thumbnail -# Kerchunk Cookbook +# Virtual Zarr Cookbook (Kerchunk and VirtualiZarr) [![nightly-build](https://github.com/ProjectPythia/kerchunk-cookbook/actions/workflows/nightly-build.yaml/badge.svg)](https://github.com/ProjectPythia/kerchunk-cookbook/actions/workflows/nightly-build.yaml) [![Binder](https://binder.projectpythia.org/badge_logo.svg)](https://binder.projectpythia.org/v2/gh/ProjectPythia/kerchunk-cookbook/main?labpath=notebooks) [![DOI](https://zenodo.org/badge/588661659.svg)](https://zenodo.org/badge/latestdoi/588661659) -This Project Pythia Cookbook covers using the [Kerchunk](https://fsspec.github.io/kerchunk/) -library to access archival data formats as if they were -ARCO (Analysis-Ready-Cloud-Optimized) data. +This Project Pythia Cookbook covers using the [Kerchunk](https://fsspec.github.io/kerchunk/), [VirtualiZarr](https://virtualizarr.readthedocs.io/en/latest/index.html), and [Zarr-Python](https://zarr.readthedocs.io/en/stable/) libraries to access archival data formats as if they were ARCO (Analysis-Ready-Cloud-Optimized) data. ## Motivation -The `Kerchunk` library allows you to access chunked and compressed +The `Kerchunk` library pioneered the access of chunked and compressed data formats (such as NetCDF3. HDF5, GRIB2, TIFF & FITS), many of which are the primary data formats for many data archives, as if they were in ARCO formats such as Zarr which allows for parallel, chunk-specific access. Instead of creating a new copy of the dataset in the Zarr spec/format, `Kerchunk` reads through the data archive and extracts the byte range and compression information of each -chunk, then writes that information to a .json file (or alternate -backends in future releases). For more details on how this process -works please see this page on the -[Kerchunk docs](https://fsspec.github.io/kerchunk/detail.html)). -These summary files can then be combined to generated a `Kerchunk` -reference for that dataset, which can be read via -[Zarr](https://zarr.readthedocs.io) and +chunk, then writes that information to a "virtual Zarr store" using a +JSON or Parquet "reference file". The `VirtualiZarr` +library provides a simple way to create these "virtual stores" using familiary +`xarray` syntax. Lastly, the `icechunk` provides a new way to store and re-use these references. + +These virtual Zarr stores can be re-used and read via [Zarr](https://zarr.readthedocs.io) and [Xarray](https://docs.xarray.dev/en/stable/). +For more details on how this process works please see this page on the +[Kerchunk docs](https://fsspec.github.io/kerchunk/detail.html)). + ## Authors [Raphael Hagen](https://github.com/norlandrhagen) @@ -48,24 +48,24 @@ the creator of `Kerchunk` and the This cookbook is broken up into two sections, Foundations and Example Notebooks. -### Section 1 Foundations +### Section 1 - Foundations In the `Foundations` section we will demonstrate -how to use `Kerchunk` to create reference sets +how to use `Kerchunk` and `VirtualiZarr` to create reference files from single file sources, as well as to create -multi-file virtual datasets from collections of files. +multi-file virtual Zarr stores from collections of files. -### Section 2 Generating Reference Files +### Section 2 - Generating Virtual Zarr Stores -The notebooks in the `Generating Reference Files` section -demonstrate how to use `Kerchunk` to create +The notebooks in the `Generating Virtual Zarr Stores` section +demonstrates how to use `Kerchunk` and `VirtualiZarr` to create datasets for all the supported file formats. -`Kerchunk` currently supports NetCDF3, -NetCDF4/HDF5, GRIB2, TIFF (including CoG). +These libraries currently support virtualizing NetCDF3, +NetCDF4/HDF5, GRIB2, TIFF (including COG). -### Section 3 Using Pre-Generated References +### Section 3 - Using Virtual Zarr Stores -The `Pre-Generated References` section contains notebooks demonstrating how to load existing references into `Xarray` and `Xarray-Datatree`, generated coordinates for GeoTiffs using `xrefcoord` and plotting using `Hvplot Datashader`. +The `Using Virtual Zarr Stores` section contains notebooks demonstrating how to load existing references into `Xarray`, generating coordinates for GeoTiffs using `xrefcoord`, and plotting using `Hvplot Datashader`. ## Running the Notebooks diff --git a/environment.yml b/environment.yml index 34e2d2b5..5b6c1aba 100644 --- a/environment.yml +++ b/environment.yml @@ -3,7 +3,7 @@ channels: - conda-forge dependencies: - cfgrib - - dask + - dask>=2024.10.0 - dask-labextension - datashader - distributed @@ -28,12 +28,12 @@ dependencies: - pooch - pre-commit - pyOpenSSL - - python=3.10 - rioxarray - s3fs - scipy - tifffile - ujson + - virtualizarr - xarray>=2024.10.0 - zarr - sphinx-pythia-theme diff --git a/notebooks/advanced/Parquet_Reference_Storage.ipynb b/notebooks/advanced/Parquet_Reference_Storage.ipynb index a041ce3d..a323699c 100644 --- a/notebooks/advanced/Parquet_Reference_Storage.ipynb +++ b/notebooks/advanced/Parquet_Reference_Storage.ipynb @@ -14,7 +14,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# Store Kerchunk Reference Files as Parquet" + "# Store virtual datasets as Kerchunk Parquet references" ] }, { @@ -24,7 +24,7 @@ "source": [ "## Overview\n", " \n", - "In this notebook we will cover how to store Kerchunk references as Parquet files instead of json. For large reference datasets, using Parquet should have performance implications as the overall reference file size should be smaller and the memory overhead of combining the reference files should be lower. \n", + "In this notebook we will cover how to store virtual datasets as Kerchunk Parquet references instead of Kerchunk JSON references. For large virtual datasets, using Parquet should have performance implications as the overall reference file size should be smaller and the memory overhead of combining the reference files should be lower. \n", "\n", "\n", "This notebook builds upon the [Kerchunk Basics](notebooks/foundations/01_kerchunk_basics.ipynb), [Multi-File Datasets with Kerchunk](notebooks/foundations/02_kerchunk_multi_file.ipynb) and the [Kerchunk and Dask](notebooks/foundations/03_kerchunk_dask.ipynb) notebooks. \n", @@ -32,10 +32,10 @@ "## Prerequisites\n", "| Concepts | Importance | Notes |\n", "| --- | --- | --- |\n", - "| [Kerchunk Basics](../foundations/kerchunk_basics) | Required | Core |\n", - "| [Multiple Files and Kerchunk](../foundations/kerchunk_multi_file) | Required | Core |\n", - "| [Introduction to Xarray](https://foundations.projectpythia.org/core/xarray/xarray-intro.html) | Recommended | IO/Visualization |\n", - "| [Intro to Dask](https://tutorial.dask.org/00_overview.html) | Required | Parallel Processing |\n", + "| [Basics of virtual Zarr stores](../foundations/01_kerchunk_basics.ipynb) | Required | Core |\n", + "| [Multi-file virtual datasets with VirtualiZarr](../foundations/02_kerchunk_multi_file.ipynb) | Required | Core |\n", + "| [Parallel virtual dataset creation with VirtualiZarr, Kerchunk, and Dask](../foundations/03_kerchunk_dask) | Required | Core |\n", + "| [Introduction to Xarray](https://foundations.projectpythia.org/core/xarray/xarray-intro.html) | Required | IO/Visualization |\n", "\n", "- **Time to learn**: 30 minutes\n", "---" @@ -47,8 +47,6 @@ "metadata": {}, "source": [ "## Imports\n", - "In addition to the previous imports we used throughout the tutorial, we are adding a few imports:\n", - "- `LazyReferenceMapper` and `ReferenceFileSystem` from `fsspec.implementations.reference` for lazy Parquet.\n", "\n", "\n" ] @@ -60,15 +58,12 @@ "outputs": [], "source": [ "import logging\n", - "import os\n", "\n", "import dask\n", "import fsspec\n", "import xarray as xr\n", "from distributed import Client\n", - "from fsspec.implementations.reference import LazyReferenceMapper, ReferenceFileSystem\n", - "from kerchunk.combine import MultiZarrToZarr\n", - "from kerchunk.hdf import SingleHdf5ToZarr" + "from virtualizarr import open_virtual_dataset" ] }, { @@ -111,10 +106,28 @@ "files_paths = fs_read.glob(\"s3://smn-ar-wrf/DATA/WRF/DET/2022/12/31/12/*\")\n", "\n", "# Here we prepend the prefix 's3://', which points to AWS.\n", - "file_pattern = sorted([\"s3://\" + f for f in files_paths])\n", - "\n", - "# Grab the first seven files to speed up example.\n", - "file_pattern = file_pattern[0:2]" + "files_paths = sorted([\"s3://\" + f for f in files_paths])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Subset the Data\n", + "To speed up our example, lets take a subset of the year of data. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# If the subset_flag == True (default), the list of input files will\n", + "# be subset to speed up the processing\n", + "subset_flag = True\n", + "if subset_flag:\n", + " files_paths = files_paths[0:4]" ] }, { @@ -123,7 +136,8 @@ "metadata": {}, "source": [ "# Generate Lazy References\n", - "Below we will create a `fsspec` filesystem to read the references from `s3` and create a function to generate dask delayed tasks." + "\n", + "Here we create a function to generate a list of Dask delayed objects." ] }, { @@ -132,20 +146,18 @@ "metadata": {}, "outputs": [], "source": [ - "# Use Kerchunk's `SingleHdf5ToZarr` method to create a `Kerchunk`\n", - "# index from a NetCDF file.\n", - "fs_read = fsspec.filesystem(\"s3\", anon=True, skip_instance_cache=True)\n", - "so = dict(mode=\"rb\", anon=True, default_fill_cache=False, default_cache_type=\"first\")\n", - "\n", - "\n", - "def generate_json_reference(fil):\n", - " with fs_read.open(fil, **so) as infile:\n", - " h5chunks = SingleHdf5ToZarr(infile, fil, inline_threshold=300)\n", - " return h5chunks.translate() # outf\n", + "def generate_virtual_dataset(file, storage_options):\n", + " return open_virtual_dataset(\n", + " file, indexes={}, reader_options={\"storage_options\": storage_options}\n", + " )\n", "\n", "\n", + "storage_options = dict(anon=True, default_fill_cache=False, default_cache_type=\"first\")\n", "# Generate Dask Delayed objects\n", - "tasks = [dask.delayed(generate_json_reference)(fil) for fil in file_pattern]" + "tasks = [\n", + " dask.delayed(generate_virtual_dataset)(file, storage_options)\n", + " for file in files_paths\n", + "]" ] }, { @@ -163,7 +175,15 @@ "metadata": {}, "outputs": [], "source": [ - "single_refs = dask.compute(tasks)[0]" + "virtual_datasets = list(dask.compute(*tasks))" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Combine virtual datasets using VirtualiZarr" ] }, { @@ -172,23 +192,17 @@ "metadata": {}, "outputs": [], "source": [ - "len(single_refs)" + "combined_vds = xr.combine_nested(\n", + " virtual_datasets, concat_dim=[\"time\"], coords=\"minimal\", compat=\"override\"\n", + ")\n", + "combined_vds" ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ - "## Combine In-Memory References with MultiZarrToZarr\n", - "This section will look notably different than the previous examples that have written to `.json`.\n", - "\n", - "In the following code block we are:\n", - "- Creating an `fsspec` filesystem.\n", - "- Create a empty `parquet` file to write to. \n", - "- Creating an `fsspec` `LazyReferenceMapper` to pass into `MultiZarrToZarr`\n", - "- Building a `MultiZarrToZarr` object of the combined references.\n", - "- Calling `.flush()` on our LazyReferenceMapper, to write the combined reference to our `parquet` file." + "## Write the virtual dataset to a Kerchunk Parquet reference" ] }, { @@ -197,26 +211,7 @@ "metadata": {}, "outputs": [], "source": [ - "fs = fsspec.filesystem(\"file\")\n", - "\n", - "if os.path.exists(\"combined.parq\"):\n", - " import shutil\n", - "\n", - " shutil.rmtree(\"combined.parq\")\n", - "os.makedirs(\"combined.parq\")\n", - "\n", - "out = LazyReferenceMapper.create(root=\"combined.parq\", fs=fs, record_size=1000)\n", - "\n", - "mzz = MultiZarrToZarr(\n", - " single_refs,\n", - " remote_protocol=\"s3\",\n", - " concat_dims=[\"time\"],\n", - " identical_dims=[\"y\", \"x\"],\n", - " remote_options={\"anon\": True},\n", - " out=out,\n", - ").translate()\n", - "\n", - "out.flush()" + "combined_vds.virtualize.to_kerchunk(\"combined.parq\", format=\"parquet\")" ] }, { @@ -255,31 +250,28 @@ "metadata": {}, "outputs": [], "source": [ - "fs = ReferenceFileSystem(\n", + "storage_options = {\n", + " \"remote_protocol\": \"s3\",\n", + " \"skip_instance_cache\": True,\n", + " \"remote_options\": {\"anon\": True},\n", + " \"target_protocol\": \"file\",\n", + " \"lazy\": True,\n", + "} # options passed to fsspec\n", + "open_dataset_options = {\"chunks\": {}} # opens passed to xarray\n", + "\n", + "ds = xr.open_dataset(\n", " \"combined.parq\",\n", - " remote_protocol=\"s3\",\n", - " target_protocol=\"file\",\n", - " lazy=True,\n", - " remote_options={\"anon\": True},\n", + " engine=\"kerchunk\",\n", + " storage_options=storage_options,\n", + " open_dataset_options=open_dataset_options,\n", ")\n", - "ds = xr.open_dataset(\n", - " fs.get_mapper(), engine=\"zarr\", backend_kwargs={\"consolidated\": False}\n", - ")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ "ds" ] } ], "metadata": { "kernelspec": { - "display_name": "kerchunk-cookbook-dev", + "display_name": "kerchunk-cookbook", "language": "python", "name": "python3" }, @@ -293,7 +285,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.11.9" + "version": "3.12.7" } }, "nbformat": 4, diff --git a/notebooks/foundations/01_kerchunk_basics.ipynb b/notebooks/foundations/01_kerchunk_basics.ipynb index 67b72dbe..7a51cd48 100644 --- a/notebooks/foundations/01_kerchunk_basics.ipynb +++ b/notebooks/foundations/01_kerchunk_basics.ipynb @@ -13,7 +13,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# Kerchunk Basics" + "# Basics of virtual Zarr stores" ] }, { @@ -22,12 +22,12 @@ "source": [ "## Overview\n", "\n", - "This notebook is intended as an introduction to using `Kerchunk`.\n", + "This notebook is intended as an introduction to creating and using virtual Zarr stores.\n", "In this tutorial we will:\n", - "- Scan a single NetCDF4/HDF5 file to create a `Kerchunk` virtual dataset\n", - "- Learn how to use the output using `Xarray` and `fsspec`.\n", + "- Scan a single NetCDF4/HDF5 file to create a virtual dataset\n", + "- Learn how to use the output using `Xarray` and `Zarr`.\n", "\n", - "While this notebook only examines using `Kerchunk` on a single NetCDF file, `Kerchunk` can be used to create virtual `Zarr` datasets from collections of many input files. In the following notebook, we will demonstrate this. " + "While this notebook only examines using `VirtualiZarr` and `Kerchunk` on a single NetCDF file, these libraries can be used to create virtual `Zarr` datasets from collections of many input files. In the following notebook, we will demonstrate this. " ] }, { @@ -50,10 +50,8 @@ "## Imports\n", "\n", "Here we will import a few `Python` libraries to help with our data processing. \n", - "- `fsspec` will be used to read remote and local filesystems. \n", - "- `kerchunk.hdf` will be used to read a NetCDF file and create a `Kerchunk` reference set.\n", - "- `ujson` for writing the `Kerchunk` output to the `.json` file format.\n", - "- `Xarray` for examining out output dataset\n" + "- `virtualizarr` will be used to generate the virtual Zarr store\n", + "- `Xarray` for examining the output dataset\n" ] }, { @@ -62,26 +60,22 @@ "metadata": {}, "outputs": [], "source": [ - "import fsspec\n", - "import kerchunk.hdf\n", - "import ujson\n", - "import xarray as xr" + "import xarray as xr\n", + "from virtualizarr import open_virtual_dataset" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "### Define kwargs for `fsspec`\n", + "### Define storage_options arguments\n", "\n", - "In the dictionary definition in the next cell, we are passing options to [`fsspec.open`](https://filesystem-spec.readthedocs.io/en/latest/api.html?highlight=.open#fsspec.open). Any additional kwargs passed in this dictionary through `fsspec.open` will pass as kwargs to the file system, in our case `s3`. The API docs for the `s3fs` filesystem spec can be found [here](https://s3fs.readthedocs.io/en/latest/api.html).\n", + "In the dictionary definition in the next cell, we are defining the options that will be passed to [`fsspec.open`](https://filesystem-spec.readthedocs.io/en/latest/api.html?highlight=.open#fsspec.open). Any additional kwargs passed in this dictionary through `fsspec.open` will pass as kwargs to the file system, in our case `s3`. The API docs for the `s3fs` filesystem spec can be found [here](https://s3fs.readthedocs.io/en/latest/api.html).\n", "\n", "In this example we are passing a few kwargs. In short they are:\n", "- `anon=True`: This is a `s3fs` kwarg that specifies you are not passing any connection credentials and are connecting to a public bucket.\n", "- `default_fill_cache=False`: `s3fs` kwarg that avoids caching in between chunks of files. This may lower memory usage when reading large files.\n", - "- `default_cache_type=\"first\"`: `fsspec` kwarg that specifies the caching strategy used by `fsspec`. In this case, `first` caches the first block of a file only.\n", - "\n", - "Don't worry too much about the details here; the cache options are those that have typically proven efficient for HDF5 files. " + "- `default_cache_type=\"none\"`: `fsspec` kwarg that specifies the caching strategy used by `fsspec`. In this case, we turn off caching entirely to lower memory usage when only using the information from the file once..\n" ] }, { @@ -90,23 +84,20 @@ "metadata": {}, "outputs": [], "source": [ - "so = dict(anon=True, default_fill_cache=False, default_cache_type=\"first\")" + "storage_options = dict(anon=True, default_fill_cache=False, default_cache_type=\"none\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "### Parse a single NetCDF file with kerchunk\n", + "### Virtualize a single NetCDF file\n", "\n", - "Below we will access a NetCDF file stored on the AWS cloud. This dataset is a single time slice of a climate downscaled product for Alaska.\n", + "Below we will virtualize a NetCDF file stored on the AWS cloud. This dataset is a single time slice of a climate downscaled product for Alaska.\n", "\n", "The steps in the cell below are as follows:\n", - "1. Define the url that points to the `NetCDF` file we want to process\n", - "1. Use `fsspec.open` along with the dictionary of arguments we created (`so`) to open the URL pointing to the NetCDF file.\n", - "1. Use `kerchunk.hdf.SingleHdf5ToZarr` method to read through the `NetCDF` file and extract the byte ranges, compression information and metadata.\n", - "1. Use `Kerchunk's` `.translate` method on the output from the `kerchunk.hdf.SingleHdf5ToZarr` to translate content of the NetCDF file into the `Zarr` format.\n", - "1. Create a `.json` file named `single_file_kerchunk.json` and write the dataset information to disk.\n" + "1. Create a virtual dataset using `open_virtual_dataset`\n", + "1. Write the virtual store as a Kerchunk reference JSON using the `to_kerchunk` method.\n" ] }, { @@ -118,24 +109,23 @@ "# Input URL to dataset. Note this is a netcdf file stored on s3 (cloud dataset).\n", "url = \"s3://wrf-se-ak-ar5/ccsm/rcp85/daily/2060/WRFDS_2060-01-01.nc\"\n", "\n", - "# Uses kerchunk to scan through the netcdf file to create kerchunk mapping and\n", - "# then save output as .json.\n", - "# Note: In this example, we write the kerchunk output to a .json file.\n", - "# You could also keep this information in memory and pass it to fsspec\n", - "with fsspec.open(url, **so) as inf:\n", - " h5chunks = kerchunk.hdf.SingleHdf5ToZarr(inf, url, inline_threshold=100)\n", - " h5chunks.translate()\n", - " with open(\"single_file_kerchunk.json\", \"wb\") as f:\n", - " f.write(ujson.dumps(h5chunks.translate()).encode())" + "\n", + "# Create a virtual dataset using VirtualiZarr.\n", + "# We specify `indexes={}` to avoid creating in-memory pandas indexes for each 1D coordinate, since concatenating with pandas indexes is not yet supported in VirtualiZarr\n", + "virtual_ds = open_virtual_dataset(\n", + " url, indexes={}, reader_options={\"storage_options\": storage_options}\n", + ")\n", + "# Write the virtual dataset to disk as a Kerchunk JSON. We could alternative write to a Kerchunk JSON or Icechunk Store.\n", + "virtual_ds.virtualize.to_kerchunk(\"single_file_kerchunk.json\", format=\"json\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "### Load `Kerchunk` Reference File\n", + "### Opening virtual datasets\n", "\n", - "In the section below we will use `fsspec.filesystem` along with the `Kerchunk` `.json` reference file to open the `NetCDF` file as if it were a `Zarr` dataset." + "In the section below we will use the previously created `Kerchunk` reference JSON to open the `NetCDF` file as if it were a `Zarr` dataset." ] }, { @@ -144,26 +134,24 @@ "metadata": {}, "outputs": [], "source": [ - "# use fsspec to create filesystem from .json reference file\n", - "fs = fsspec.filesystem(\n", - " \"reference\",\n", - " fo=\"single_file_kerchunk.json\",\n", - " remote_protocol=\"s3\",\n", - " remote_options=dict(anon=True),\n", - " skip_instance_cache=True,\n", + "# We once again need to provide information for fsspec to access the remote file\n", + "storage_options = dict(\n", + " remote_protocol=\"s3\", remote_options=dict(anon=True), skip_instance_cache=True\n", ")\n", - "\n", - "# load kerchunked dataset with xarray\n", + "# We will use the \"kerchunk\" engine in `xr.open_dataset` and pass the `storage_options` to the `kerchunk` engine through `backend_kwargs`\n", "ds = xr.open_dataset(\n", - " fs.get_mapper(\"\"), engine=\"zarr\", backend_kwargs={\"consolidated\": False}\n", - ")" + " \"single_file_kerchunk.json\",\n", + " engine=\"kerchunk\",\n", + " backend_kwargs={\"storage_options\": storage_options},\n", + ")\n", + "ds" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "### Plot Dataset" + "### Plot dataset" ] }, { @@ -192,7 +180,7 @@ ], "metadata": { "kernelspec": { - "display_name": "kerchunk-cookbook-dev", + "display_name": "kerchunk-cookbook", "language": "python", "name": "python3" }, @@ -206,12 +194,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.10.12" - }, - "vscode": { - "interpreter": { - "hash": "89095a95fbc59e1db286735bee0073a08e46abd63daa66f53634eb5c8cc2192a" - } + "version": "3.12.7" } }, "nbformat": 4, diff --git a/notebooks/foundations/02_kerchunk_multi_file.ipynb b/notebooks/foundations/02_kerchunk_multi_file.ipynb index 5942aa75..176aabff 100644 --- a/notebooks/foundations/02_kerchunk_multi_file.ipynb +++ b/notebooks/foundations/02_kerchunk_multi_file.ipynb @@ -11,7 +11,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# Multi-File Datasets with Kerchunk" + "# Multi-file virtual datasets with VirtualiZarr" ] }, { @@ -20,13 +20,13 @@ "source": [ "## Overview\n", "\n", - "This notebook is intends to build off of the [Kerchunk Basics notebook](./kerchunk_basics.ipynb).\n", + "This notebook is intends to build off of the [Basics of virtual Zarr stores](./01_kerchunk_basics.ipynb).\n", "\n", "In this tutorial we will:\n", "- Create a list of input paths for a collection of NetCDF files stored on the cloud.\n", - "- Iterate through our file input list and create `Kerchunk` reference `.jsons` for each file.\n", - "- Combine the reference `.jsons` into a single combined dataset reference with the rechunker class, `MultiZarrToZarr`\n", - "- Learn how to read the combined dataset using [`Xarray`](https://docs.xarray.dev/en/stable/) and [`fsspec`](https://filesystem-spec.readthedocs.io/en/latest/).\n" + "- Create virtual datasets for each input datasets\n", + "- Combine the virtual datasets using combine_nested\n", + "- Read the combined dataset using [`Xarray`](https://docs.xarray.dev/en/stable/) and [`fsspec`](https://filesystem-spec.readthedocs.io/en/latest/).\n" ] }, { @@ -36,7 +36,7 @@ "## Prerequisites\n", "| Concepts | Importance | Notes |\n", "| --- | --- | --- |\n", - "| [Kerchunk Basics](./kerchunk_basics.ipynb) | Required | Basic features |\n", + "| [Basics of virtual Zarr stores](./01_kerchunk_basics.ipynb) | Required | Basic features |\n", "| [Introduction to Xarray](https://foundations.projectpythia.org/core/xarray/xarray-intro.html) | Recommended | IO |\n", "\n", "- **Time to learn**: 60 minutes\n", @@ -66,12 +66,10 @@ "metadata": {}, "source": [ "## Imports\n", - "In our imports block we are using similar imports to the [Kerchunk Basics Tutorial](./kerchunk_basics.ipynb), with a few libraries added.\n", + "In our imports block we are using similar imports to the [Basics of virtual Zarr stores tutorial](./01_kerchunk_basics.ipynb):\n", "- `fsspec` for reading and writing to remote file systems\n", - "- `ujson` for writing `Kerchunk` reference files as `.json`\n", - "- `Xarray` for visualizing and examining our datasets\n", - "- `Kerchunk's` `SingleHdf5ToZarr` and `MultiZarrToZarr` classes. \n", - "- `tqdm` for timing cell progress\n", + "- `virtualizarr` will be used to generate the virtual Zarr store\n", + "- `Xarray` for examining the output dataset\n", "\n" ] }, @@ -81,14 +79,9 @@ "metadata": {}, "outputs": [], "source": [ - "from tempfile import TemporaryDirectory\n", - "\n", "import fsspec\n", - "import ujson\n", "import xarray as xr\n", - "from kerchunk.combine import MultiZarrToZarr\n", - "from kerchunk.hdf import SingleHdf5ToZarr\n", - "from tqdm import tqdm" + "from virtualizarr import open_virtual_dataset" ] }, { @@ -97,14 +90,11 @@ "source": [ "### Create a File Pattern from a list of input NetCDF files\n", "\n", - "Below we will create a list of input files we want `Kerchunk` to index. In the [Kerchunk Basics Tutorial](./kerchunk_basics.ipynb), we looked at a single file of climate downscaled data over Southern Alaska. In this example, we will build off of that work and use `Kerchunk` to combine multiple NetCDF files of this dataset into a virtual dataset that can be read as if it were a `Zarr` store - without copying any data.\n", - "\n", - "Specifically, in the cell below, we use `fsspec` to create a `s3` filesystem to read the `NetCDF` files and a local file system to write our reference files to. You can, alternatively, write to a cloud filesystem instead of a local one, or even just keep the reference sets in temporary memory without writing at all.\n", + "Below we will create a list of input files we want to virtualize. In the [Basics of virtual Zarr stores tutorial](./01_kerchunk_basics.ipynb), we looked at a single file of climate downscaled data over Southern Alaska. In this example, we will build off of that work and use `Kerchunk` and `VirtualiZarr` to combine multiple NetCDF files of this dataset into a virtual dataset that can be read as if it were a `Zarr` store - without copying any data.\n", "\n", - "After that, we use the `fsspec` **fs_read** `s3` filesystem's *glob* method to create a list of files matching a file pattern. We supply the base url of `s3://wrf-se-ak-ar5/ccsm/rcp85/daily/2060/`, which is pointing to an `AWS` public bucket, for daily rcp85 ccsm downscaled data for the year 2060. After this base url, we tacked on *`*`*, which acts as a wildcard for all the files in the directory. We should expect 365 daily `NetCDF` files.\n", + "We use the `fsspec` `s3` filesystem's *glob* method to create a list of files matching a file pattern. We supply the base url of `s3://wrf-se-ak-ar5/ccsm/rcp85/daily/2060/`, which is pointing to an `AWS` public bucket, for daily rcp85 ccsm downscaled data for the year 2060. After this base url, we tacked on *`*`*, which acts as a wildcard for all the files in the directory. We should expect 365 daily `NetCDF` files.\n", "\n", - "Finally, we are appending the string `s3://` to the list of return files. This will ensure the list of files we get back are `s3` urls and can be read by `Kerchunk`.\n", - "\n" + "Finally, we are appending the string `s3://` to the list of return files. This will ensure the list of files we get back are `s3` urls and can be read by `VirtualiZarr` and `Kerchunk`." ] }, { @@ -120,7 +110,7 @@ "files_paths = fs_read.glob(\"s3://wrf-se-ak-ar5/ccsm/rcp85/daily/2060/*\")\n", "\n", "# Here we prepend the prefix 's3://', which points to AWS.\n", - "file_pattern = sorted([\"s3://\" + f for f in files_paths])" + "files_paths = sorted([\"s3://\" + f for f in files_paths])" ] }, { @@ -136,7 +126,7 @@ "metadata": {}, "outputs": [], "source": [ - "print(f\"{len(file_pattern)} file paths were retrieved.\")" + "print(f\"{len(files_paths)} file paths were retrieved.\")" ] }, { @@ -148,7 +138,7 @@ "# If the subset_flag == True (default), the list of input files will\n", "# be subset to speed up the processing\n", "if subset_flag:\n", - " file_pattern = file_pattern[0:4]" + " files_paths = files_paths[0:4]" ] }, { @@ -176,9 +166,9 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## Create `Kerchunk` References for every file in the `File_Pattern` list\n", + "## Create virtual datasets for every file in the `File_Pattern` list\n", "\n", - "Now that we have a list of NetCDF files, we can use `Kerchunk` to create reference files for each one of these. To do this, we will iterate through each file and create a reference `.json`. To speed this process up, you could use `Dask` to parallelize this." + "Now that we have a list of NetCDF files, we can use `VirtualiZarr` to create virtual datasets for each one of these." ] }, { @@ -187,7 +177,7 @@ "metadata": {}, "source": [ "### Define kwargs for `fsspec`\n", - "In the cell below, we are creating a dictionary of `kwargs` to pass to `fsspec` and the `s3` filesystem. Details on this can be found in the [Kerchunk Basics Tutorial](./kerchunk_basics.ipynb) in the **```(Define kwargs for fsspec)```** section. In addition, we are creating a temporary directory to store our reference files in." + "In the cell below, we are creating a dictionary of `kwargs` to pass to `fsspec` and the `s3` filesystem. Details on this can be found in the [Basics of virtual Zarr stores tutorial](./01_kerchunk_basics.ipynb) in the **Define storage_options arguments** section" ] }, { @@ -196,14 +186,7 @@ "metadata": {}, "outputs": [], "source": [ - "so = dict(mode=\"rb\", anon=True, default_fill_cache=False, default_cache_type=\"first\")\n", - "output_dir = \"./\"\n", - "\n", - "# We are creating a temporary directory to store the .json reference\n", - "# files. Alternately, you could write these to cloud storage.\n", - "td = TemporaryDirectory()\n", - "temp_dir = td.name\n", - "temp_dir" + "storage_options = dict(anon=True, default_fill_cache=False, default_cache_type=\"none\")" ] }, { @@ -227,60 +210,21 @@ "metadata": {}, "outputs": [], "source": [ - "# Use Kerchunk's `SingleHdf5ToZarr` method to create a `Kerchunk` index\n", - "# from a NetCDF file.\n", - "\n", - "\n", - "def generate_json_reference(u, temp_dir: str):\n", - " with fs_read.open(u, **so) as infile:\n", - " h5chunks = SingleHdf5ToZarr(infile, u, inline_threshold=300)\n", - " fname = u.split(\"/\")[-1].strip(\".nc\")\n", - " outf = f\"{fname}.json\"\n", - " with open(outf, \"wb\") as f:\n", - " f.write(ujson.dumps(h5chunks.translate()).encode())\n", - " return outf\n", - "\n", - "\n", - "# Iterate through filelist to generate Kerchunked files. Good use for `Dask`\n", - "output_files = []\n", - "for fil in tqdm(file_pattern):\n", - " outf = generate_json_reference(fil, temp_dir)\n", - " output_files.append(outf)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### Here we can view the generated list of output `Kerchunk` reference files" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "output_files" + "virtual_datasets = [\n", + " open_virtual_dataset(\n", + " filepath, indexes={}, reader_options={\"storage_options\": storage_options}\n", + " )\n", + " for filepath in files_paths\n", + "]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "## Combine `.json` `Kerchunk` reference files and write a combined `Kerchunk` reference dataset.\n", - "\n", - "After we have generated a `Kerchunk` reference file for each `NetCDF` file, we can combine these into a single virtual dataset using `Kerchunk's` `MultiZarrToZarr` method. \n", - "\n", - "Note that it is not strictly necessary write the reference sets of the individual input files to JSON, or to save these for later. However, in typical workflows, it may be useful to access individual datasets, or to repeat the combine step below in new ways, so we recommend writing and keeping these files.\n", + "## Combine virtual datasets\n", "\n", - "In our example below we are passing in our list of reference files (`output_files`), along with `concat_dims` and `identical_dims`.\n", - "- `concat_dims` should be a list of the name(s) of the dimensions(s) that you want to concatenate along. In our example, our input files were single time steps. Because of this, we will concatenate along the `Time` axis only. \n", - "- `identical_dims` are variables that are shared across all the input files. They should not vary across the files.\n", - "\n", - "After using `MultiZarrToZarr` to combine the reference files, we will call `.translate()` to store this combined reference dataset into memory. Note: by passing `filename` to `.translate()`, you can write the combined `Kerchunk` multi-file dataset to disk as a `.json` file, but we choose to do this as an explicit separate step.\n", - "\n", - "ex: ```mzz.translate(filename='combined_reference.json')```\n" + "After we have generated a virtual dataset for each `NetCDF` file, we can combine these into a single virtual dataset using Xarray's `combine_nested` function.\n" ] }, { @@ -289,22 +233,17 @@ "metadata": {}, "outputs": [], "source": [ - "# combine individual references into single consolidated reference\n", - "mzz = MultiZarrToZarr(\n", - " output_files,\n", - " concat_dims=[\"Time\"],\n", - " identical_dims=[\"XLONG\", \"XLAT\", \"interp_levels\"],\n", + "combined_vds = xr.combine_nested(\n", + " virtual_datasets, concat_dim=[\"Time\"], coords=\"minimal\", compat=\"override\"\n", ")\n", - "\n", - "\n", - "multi_kerchunk = mzz.translate()" + "combined_vds" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "## Write combined kerchunk index for future use\n", + "## Write combined virtual dataset to a Kerchunk JSON for future use\n", "If we want to keep the combined reference information in memory as well as write the file to `.json`, we can run the code snippet below." ] }, @@ -316,8 +255,7 @@ "source": [ "# Write kerchunk .json record\n", "output_fname = \"combined_kerchunk.json\"\n", - "with open(f\"{output_fname}\", \"wb\") as f:\n", - " f.write(ujson.dumps(multi_kerchunk).encode())" + "combined_vds.virtualize.to_kerchunk(output_fname, format=\"json\")" ] }, { @@ -326,22 +264,17 @@ "source": [ "## Using the output\n", "\n", - "Now that we have built a virtual dataset using `Kerchunk`, we can read all of those original `NetCDF` files as if they were a single `Zarr` dataset. \n", + "Now that we have built a virtual dataset using `VirtualiZarr` and `Kerchunk`, we can read all of those original `NetCDF` files as if they were a single `Zarr` dataset. \n", "\n", "\n", - "**Since we saved the combined reference `.json` file, this work doesn't have to be repeated for anyone else to use this dataset. All they need is to pass the combined reference file to `Xarray` and it is as if they had a `Zarr` dataset! The cells below here no longer need kerchunk.** " + "**Since we saved the combined virtual dataset, this work doesn't have to be repeated for anyone else to use this dataset. All they need is to pass the reference file storing the virtual dataset to `Xarray` and it is as if they had a `Zarr` dataset!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "### Open combined `Kerchunk` dataset with `fsspec` and `Xarray`\n", - "\n", - "Below we are using the result of the `MultiZarrtoZarr` method as input to a `fsspec` filesystem. `Fsspec` can read this `Kerchunk` reference file as if it were a `Zarr` dataset.\n", - "\n", - "- `fsspec.filesystem` creates a remote filesystem using the combined reference, along with arguments to specify which type of filesystem it's reading from `s3` as well as some kwargs for `s3`, such as `remote_options`. Replace `multi_kerchunk` with `\"combined_kerchunk.json\"` if you are starting here.\n", - "- We can pass the `fsspec.filesystems` mapper object to `Xarray` to open the combined reference recipe as if it were a `Zarr` dataset. \n" + "### Open combined virtual dataset with Kerchunk\n" ] }, { @@ -350,12 +283,16 @@ "metadata": {}, "outputs": [], "source": [ - "# open dataset as zarr object using fsspec reference file system and Xarray\n", - "fs = fsspec.filesystem(\n", - " \"reference\", fo=multi_kerchunk, remote_protocol=\"s3\", remote_options={\"anon\": True}\n", + "# We once again need to provide information for fsspec to access the remote file\n", + "storage_options = dict(\n", + " remote_protocol=\"s3\", remote_options=dict(anon=True), skip_instance_cache=True\n", + ")\n", + "# We will use the \"kerchunk\" engine in `xr.open_dataset` and pass the `storage_options` to the `kerchunk` engine through `backend_kwargs`\n", + "ds = xr.open_dataset(\n", + " output_fname,\n", + " engine=\"kerchunk\",\n", + " backend_kwargs={\"storage_options\": storage_options},\n", ")\n", - "m = fs.get_mapper(\"\")\n", - "ds = xr.open_dataset(m, engine=\"zarr\", backend_kwargs=dict(consolidated=False))\n", "ds" ] }, @@ -376,11 +313,18 @@ "source": [ "ds.isel(Time=0).SNOW.plot()" ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] } ], "metadata": { "kernelspec": { - "display_name": "kerchunk-cookbook-dev", + "display_name": "kerchunk-cookbook", "language": "python", "name": "python3" }, @@ -394,14 +338,9 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.10.13" - }, - "vscode": { - "interpreter": { - "hash": "89095a95fbc59e1db286735bee0073a08e46abd63daa66f53634eb5c8cc2192a" - } + "version": "3.12.7" } }, "nbformat": 4, - "nbformat_minor": 2 + "nbformat_minor": 4 } diff --git a/notebooks/foundations/03_kerchunk_dask.ipynb b/notebooks/foundations/03_kerchunk_dask.ipynb index 2302eec8..c0e5ca68 100644 --- a/notebooks/foundations/03_kerchunk_dask.ipynb +++ b/notebooks/foundations/03_kerchunk_dask.ipynb @@ -14,7 +14,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# Kerchunk and Dask" + "# Parallel virtual dataset creation with VirtualiZarr, Kerchunk, and Dask" ] }, { @@ -26,17 +26,16 @@ " \n", "In this notebook we will cover:\n", "\n", - "1. How to parallelize the creation of `Kerchunk` reference files using the `Dask` library.\n", - "1. How to scale up the creation of large `Kerchunk` datasets using `Dask`. \n", + "1. How to parallelize the creation of virtual datasets using the `Dask` library.\n", "\n", - "This notebook builds upon the [Kerchunk Basics](./kerchunk_basics.ipynb) and the [Multi-File Datasets with Kerchunk](../foundations/kerchunk_multi_file.ipynb) notebooks. A basic understanding of `Dask` will be helpful, but is not required. This notebook is not intended as a tutorial for using `Dask`, but will show how to use `Dask` to greatly speedup the the generation of `Kerchunk` reference files. \n", + "This notebook builds upon the [Basics of virtual Zarr stores](./01_kerchunk_basics.ipynb) and the [Multi-file virtual datasets with VirtualiZarr](./02_kerchunk_multi_file.ipynb) notebooks. A basic understanding of `Dask` will be helpful, but is not required. This notebook is not intended as a tutorial for using `Dask`, but will show how to use `Dask` to greatly speedup the the generation of virtual datasets.\n", "\n", "\n", "## Prerequisites\n", "| Concepts | Importance | Notes |\n", "| --- | --- | --- |\n", - "| [Kerchunk Basics](../foundations/kerchunk_basics) | Required | Core |\n", - "| [Multiple Files and Kerchunk](../foundations/kerchunk_multi_file) | Required | Core |\n", + "| [Basics of virtual Zarr stores](./01_kerchunk_basics.ipynb) | Required | Core |\n", + "| [Multi-file virtual datasets with VirtualiZarr](./02_kerchunk_multi_file.ipynb) | Required | Core |\n", "| [Introduction to Xarray](https://foundations.projectpythia.org/core/xarray/xarray-intro.html) | Recommended | IO/Visualization |\n", "| [Intro to Dask](https://tutorial.dask.org/00_overview.html) | Recommended | Parallel Processing |\n", "\n", @@ -51,13 +50,13 @@ "source": [ "## Dask and Parallel Processing\n", "\n", - "Dask is a `Python` library for parallel computing. It plays well with `Xarray`, but can be used in many ways across the `Python` ecosystem. This notebook is not intended to be a guide for how to use `Dask`, but just an example of how to use `Dask` to parallelize some `Kerchunk` functionality. \n", + "Dask is a `Python` library for parallel computing. It plays well with `Xarray`, but can be used in many ways across the `Python` ecosystem. This notebook is not intended to be a guide for how to use `Dask`, but just an example of how to use `Dask` to parallelize some `VirtualiZarr` and `Kerchunk` functionality. \n", "\n", - "In the previous notebook [Multiple Files and Kerchunk](../foundations/kerchunk_multi_file), we were looking at daily downscaled climate data over South-Eastern Alaska. In our function named `generate_json_reference`, we were looping over, one at a time, the input NetCDF4 files and using `Kerchunk's` `SingleHdf5ToZarr` method to create a `Kerchunk` index for each file.\n", + "In the previous notebook [Multi-file virtual datasets with VirtualiZarr](./02_kerchunk_multi_file.ipynb), we were looking at daily downscaled climate data over South-Eastern Alaska. We created a virtual dataset for each input file using list comprehension in Python.\n", "\n", - "With `Dask`, we can call `SingleHdf5ToZarr` in parallel, which allows us to create multiple `Kerchunk` reference files at the same time.\n", + "With `Dask`, we can call `open_virtual_dataset` in parallel, which allows us to create multiple virtual datasets at the same time.\n", "\n", - "Further on in this notebook, we will show how using `Dask` can greatly speed-up the process of creating a virtual dataset using `Kerchunk`\n", + "Further on in this notebook, we will show how using `Dask` can greatly speed-up the process of creating a virtual datasets.\n", "\n" ] }, @@ -118,7 +117,7 @@ "metadata": {}, "source": [ "## Building off of our Previous Work\n", - "In the next section, we will re-use some of the code from [Multiple Files and Kerchunk](../foundations/kerchunk_multi_file) notebook. However, we will modify it slightly to make it compatible with `Dask`.\n", + "In the next section, we will re-use some of the code from the [multi-file virtual datasets with VirtualiZarr](./02_kerchunk_multi_file.ipynb) notebook. However, we will modify it slightly to make it compatible with `Dask`.\n", "\n", "The following two cells should look the same as before. As a reminder we are importing the required libraries, using `fsspec` to create a list of our input files and setting up some kwargs for `fsspec` to use. " ] @@ -129,12 +128,9 @@ "metadata": {}, "outputs": [], "source": [ - "from tempfile import TemporaryDirectory\n", - "\n", "import dask\n", "import fsspec\n", - "import ujson\n", - "from kerchunk.hdf import SingleHdf5ToZarr" + "from virtualizarr import open_virtual_dataset" ] }, { @@ -150,17 +146,7 @@ "files_paths = fs_read.glob(\"s3://wrf-se-ak-ar5/ccsm/rcp85/daily/2060/*\")\n", "\n", "# Here we prepend the prefix 's3://', which points to AWS.\n", - "file_pattern = sorted([\"s3://\" + f for f in files_paths])\n", - "\n", - "# Define kwargs for `fsspec`\n", - "so = dict(mode=\"rb\", anon=True, default_fill_cache=False, default_cache_type=\"first\")\n", - "\n", - "\n", - "# We are creating a temporary directory to store the .json reference files\n", - "# Alternately, you could write these to cloud storage.\n", - "td = TemporaryDirectory()\n", - "temp_dir = td.name\n", - "temp_dir" + "files_paths = sorted([\"s3://\" + f for f in files_paths])" ] }, { @@ -178,7 +164,11 @@ "metadata": {}, "outputs": [], "source": [ - "file_pattern = file_pattern[0:40]" + "# If the subset_flag == True (default), the list of input files will\n", + "# be subset to speed up the processing\n", + "subset_flag = True\n", + "if subset_flag:\n", + " files_paths = files_paths[0:4]" ] }, { @@ -188,9 +178,9 @@ "source": [ "## Dask Specific Changes\n", "\n", - "Here is the section of code that will change. Instead of iterating through each input file and using `generate_json_reference` to create the `Kerchunk` reference files, we are iterating through our input file list and creating `Dask Delayed Objects`. It is not super important to understand this, but a `Dask Delayed Object` is lazy, meaning it is not computed eagerly. Once we have iterated through all our input files, we end up with a list of `Dask Delayed Objects`. \n", + "Here is the section of code that will change. Instead of iterating through each input file and using `open_virtual_dataset` to create the virtual datasets, we are iterating through our input file list and creating `Dask Delayed Objects`. It is not super important to understand this, but a `Dask Delayed Object` is lazy, meaning it is not computed eagerly. Once we have iterated through all our input files, we end up with a list of `Dask Delayed Objects`. \n", "\n", - "When we are ready, we can call `dask.compute` on this list of delayed objects to create `Kerchunk` reference files in parallel. \n", + "When we are ready, we can call `dask.compute` on this list of delayed objects to create virtual datasets in parallel. \n", "\n" ] }, @@ -200,22 +190,24 @@ "metadata": {}, "outputs": [], "source": [ - "# Use Kerchunk's `SingleHdf5ToZarr` method to create a `Kerchunk` index\n", - "# from a NetCDF file.\n", - "\n", - "\n", - "def generate_json_reference(fil, temp_dir: str):\n", - " with fs_read.open(fil, **so) as infile:\n", - " h5chunks = SingleHdf5ToZarr(infile, fil, inline_threshold=300)\n", - " fname = fil.split(\"/\")[-1].strip(\".nc\")\n", - " outf = f\"{temp_dir}/{fname}.json\"\n", - " with open(outf, \"wb\") as f:\n", - " f.write(ujson.dumps(h5chunks.translate()).encode())\n", - " return outf\n", - "\n", - "\n", + "def generate_virtual_dataset(file, storage_options):\n", + " return open_virtual_dataset(\n", + " file, indexes={}, reader_options={\"storage_options\": storage_options}\n", + " )" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "storage_options = dict(anon=True, default_fill_cache=False, default_cache_type=\"none\")\n", "# Generate Dask Delayed objects\n", - "tasks = [dask.delayed(generate_json_reference)(fil, temp_dir) for fil in file_pattern]" + "tasks = [\n", + " dask.delayed(generate_virtual_dataset)(file, storage_options)\n", + " for file in files_paths\n", + "]" ] }, { @@ -273,8 +265,7 @@ "metadata": {}, "outputs": [], "source": [ - "# Start parallel computation of `SingleHDF5ToZarr`\n", - "dask.compute(tasks)" + "dask.compute(*tasks)" ] }, { @@ -300,28 +291,27 @@ "source": [ "## Timing\n", "\n", - "To demonstrate how Dask can speed-up your `Kerchunk` reference generation, the next section will show the timing of generating reference files with and without Dask.\n", - "For reference, the timing was run on a Large AWS Jupyter-Hub (managed by the fine folks at [2i2c](https://2i2c.org/)) with ~32 CPU and ~256G RAM. It is also important to note that the data is also hosted on AWS. \n", + "To demonstrate how Dask can speed-up your virtual dataset generation, the next section will show the timing of generating reference files with and without Dask.\n", + "For reference, the timing was run on a large AWS Jupyter-Hub (managed by the fine folks at [2i2c](https://2i2c.org/)) with ~16 CPU and ~64 GB RAM. It is also important to note that the data is also hosted on AWS. \n", "\n", "\n", "\n", - "| Serial Kerchunk | Parallel Kerchunk (Dask) |\n", + "| Serial Virtualization | Parallel Virtualization (Dask) |\n", "| -------------------- | ------------------------ |\n", - "| 3min 41s | 39.2 s |\n", + "| 7 min 22 s | 36 s |\n", "\n", "\n", "\n", - "Running our `Dask` version on a subset of 40 files took only ~39 seconds. In comparison, computing the `Kerchunk` indices one-by-one in took about 3 minutes and 41 seconds.\n", + "Running our `Dask` version on the year of data took only ~36 seconds. In comparison, creating the `VirtualiZarr` virtual datasets one-by-one took about 7 minutes and 22 seconds.\n", "\n", "\n", - "Just by changing a few lines of code and using `Dask`, we got our code to run almost **6x faster**. One other detail to note is that there is usually a bit of a delay as `Dask` builds its task graph before any of the tasks are started. All that to say, you may see even better performance when using `Dask` and `Kerchunk` on larger datasets.\n", + "Just by changing a few lines of code and using `Dask`, we got our code to run **11x faster**. One other detail to note is that there is usually a bit of a delay as `Dask` builds its task graph before any of the tasks are started. All that to say, you may see even better performance when using `Dask`, `VirtualiZarr`, and `Kerchunk` on larger datasets.\n", "\n", "Note: These timings may vary for you. There are many factors that may affect performance, such as:\n", "- Geographical location of your compute and the source data\n", "- Internet speed\n", "- Compute resources, IO limits and # of workers given to `Dask`\n", "- Location of where to write reference files to (cloud vs local)\n", - "- Type of reference files (`Parquet` vs `.JSON`)\n", "\n", "This is meant to be an example of how `Dask` can be used to speed-up `Kerchunk` not a detailed benchmark in `Kerchunk/Dask` performance. \n", "\n", @@ -336,21 +326,16 @@ "source": [ "## Next Steps\n", "\n", - "In this notebook we demonstrated how `Dask` can be used to parallelize the creation of `Kerchunk` reference files. In the following `Case Studies` section, we will walk though examples of using `Kerchunk` with `Dask` to create virtual cloud-optimized datasets. \n", + "In this notebook we demonstrated how `Dask` can be used to parallelize the creation of virtual datasets. In the following `Case Studies` section, we will walk though examples of using `VirtualiZarr` and `Kerchunk` with `Dask` to create virtual cloud-optimized datasets. \n", "\n", "Additionally, if you wish to explore more of `Kerchunk's` `Dask` integration, you can try checking out `Kerchunk's` [auto_dask](https://fsspec.github.io/kerchunk/reference.html?highlight=auto%20dask#kerchunk.combine.auto_dask) method, which combines many of the `Dask` setup steps into a single convenient function. \n", "\n" ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [] } ], "metadata": { "kernelspec": { - "display_name": "kerchunk-cookbook-dev", + "display_name": "kerchunk-cookbook", "language": "python", "name": "python3" }, @@ -364,14 +349,9 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.10.13" - }, - "vscode": { - "interpreter": { - "hash": "89095a95fbc59e1db286735bee0073a08e46abd63daa66f53634eb5c8cc2192a" - } + "version": "3.12.7" } }, "nbformat": 4, - "nbformat_minor": 2 + "nbformat_minor": 4 } diff --git a/notebooks/generating_references/GeoTIFF.ipynb b/notebooks/generating_references/GeoTIFF.ipynb index 05593a96..cb9fbb81 100644 --- a/notebooks/generating_references/GeoTIFF.ipynb +++ b/notebooks/generating_references/GeoTIFF.ipynb @@ -6,7 +6,7 @@ "metadata": {}, "source": [ "# GeoTIFF\n", - "Generating Kerchunk References from GeoTIFF files" + "Generating virutal datasets from GeoTiff files" ] }, { @@ -25,16 +25,16 @@ "\n", "In this tutorial we will cover:\n", "\n", - "1. How to generate `Kerchunk` references of GeoTIFFs.\n", - "1. Combining `Kerchunk` references into a virtual dataset.\n", + "1. How to generate virtual datasets from GeoTIFFs.\n", + "1. Combining virtual datasets.\n", "\n", "\n", "## Prerequisites\n", "| Concepts | Importance | Notes |\n", "| --- | --- | --- |\n", - "| [Kerchunk Basics](../foundations/kerchunk_basics) | Required | Core |\n", - "| [Multiple Files and Kerchunk](../foundations/kerchunk_multi_file) | Required | Core |\n", - "| [Kerchunk and Dask](../foundations/kerchunk_dask) | Required | Core |\n", + "| [Basics of virtual Zarr stores](../foundations/01_kerchunk_basics.ipynb) | Required | Core |\n", + "| [Multi-file virtual datasets with VirtualiZarr](../foundations/02_kerchunk_multi_file.ipynb) | Required | Core |\n", + "| [Parallel virtual dataset creation with VirtualiZarr, Kerchunk, and Dask](../foundations/03_kerchunk_dask) | Required | Core |\n", "| [Introduction to Xarray](https://foundations.projectpythia.org/core/xarray/xarray-intro.html) | Required | IO/Visualization |\n", "- **Time to learn**: 30 minutes\n", "---" @@ -59,19 +59,16 @@ "metadata": {}, "outputs": [], "source": [ - "import glob\n", "import logging\n", - "from tempfile import TemporaryDirectory\n", + "from datetime import datetime\n", "\n", "import dask\n", "import fsspec\n", - "import numpy as np\n", "import rioxarray\n", "import s3fs\n", - "import ujson\n", + "import xarray as xr\n", "from distributed import Client\n", - "from kerchunk.combine import MultiZarrToZarr\n", - "from kerchunk.tiff import tiff_to_zarr" + "from virtualizarr import open_virtual_dataset" ] }, { @@ -138,24 +135,7 @@ " \"s3://fmi-opendata-radar-geotiff/2023/01/01/FIN-ACRR-3067-1KM/*24H-3067-1KM.tif\"\n", ")\n", "# Here we prepend the prefix 's3://', which points to AWS.\n", - "file_pattern = sorted([\"s3://\" + f for f in files_paths])" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# This dictionary will be passed as kwargs to `fsspec`. For more details,\n", - "# check out the `foundations/kerchunk_basics` notebook.\n", - "so = dict(mode=\"rb\", anon=True, default_fill_cache=False, default_cache_type=\"first\")\n", - "\n", - "# We are creating a temporary directory to store the .json reference files\n", - "# Alternately, you could write these to cloud storage.\n", - "td = TemporaryDirectory()\n", - "temp_dir = td.name\n", - "temp_dir" + "files_paths = sorted([\"s3://\" + f for f in files_paths])" ] }, { @@ -183,20 +163,36 @@ "metadata": {}, "outputs": [], "source": [ - "# Use Kerchunk's `tiff_to_zarr` method to create create reference files\n", - "\n", - "\n", - "def generate_json_reference(fil, output_dir: str):\n", - " tiff_chunks = tiff_to_zarr(fil, remote_options={\"protocol\": \"s3\", \"anon\": True})\n", - " fname = fil.split(\"/\")[-1].split(\"_\")[0]\n", - " outf = f\"{output_dir}/{fname}.json\"\n", - " with open(outf, \"wb\") as f:\n", - " f.write(ujson.dumps(tiff_chunks).encode())\n", - " return outf\n", - "\n", - "\n", + "def generate_virtual_dataset(file):\n", + " storage_options = dict(\n", + " anon=True, default_fill_cache=False, default_cache_type=\"none\"\n", + " )\n", + " vds = open_virtual_dataset(\n", + " file,\n", + " indexes={},\n", + " filetype=\"tiff\",\n", + " reader_options={\n", + " \"remote_options\": {\"anon\": True},\n", + " \"storage_options\": storage_options,\n", + " },\n", + " )\n", + " # Pre-process virtual datasets to extract time step information from the filename\n", + " subst = file.split(\"/\")[-1].split(\".json\")[0].split(\"_\")[0]\n", + " time_val = datetime.strptime(subst, \"%Y%m%d%H%M\")\n", + " vds = vds.expand_dims(dim={\"time\": [time_val]})\n", + " # Only include the raw data, not the overviews\n", + " vds = vds[[\"0\"]]\n", + " return vds" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ "# Generate Dask Delayed objects\n", - "tasks = [dask.delayed(generate_json_reference)(fil, temp_dir) for fil in file_pattern]" + "tasks = [dask.delayed(generate_virtual_dataset)(file) for file in files_paths]" ] }, { @@ -209,17 +205,14 @@ "import warnings\n", "\n", "warnings.filterwarnings(\"ignore\")\n", - "dask.compute(tasks)" + "virtual_datasets = dask.compute(*tasks)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "## Combine Reference Files into Multi-File Reference Dataset\n", - "\n", - "Now we will combine all the reference files generated into a single reference dataset. Since each TIFF file is a single timeslice and the only temporal information is stored in the filepath, we will have to specify the `coo_map` kwarg in `MultiZarrToZarr` to build a dimension from the filepath attributes. \n", - "\n" + "## Combine virtual datasets" ] }, { @@ -228,35 +221,8 @@ "metadata": {}, "outputs": [], "source": [ - "ref_files = sorted(glob.iglob(f\"{temp_dir}/*.json\"))\n", - "\n", - "\n", - "# Custom Kerchunk function from `coo_map` to create dimensions\n", - "def fn_to_time(index, fs, var, fn):\n", - " import datetime\n", - "\n", - " subst = fn.split(\"/\")[-1].split(\".json\")[0]\n", - " return datetime.datetime.strptime(subst, \"%Y%m%d%H%M\")\n", - "\n", - "\n", - "mzz = MultiZarrToZarr(\n", - " path=ref_files,\n", - " indicts=ref_files,\n", - " remote_protocol=\"s3\",\n", - " remote_options={\"anon\": True},\n", - " coo_map={\"time\": fn_to_time},\n", - " coo_dtypes={\"time\": np.dtype(\"M8[s]\")},\n", - " concat_dims=[\"time\"],\n", - " identical_dims=[\"X\", \"Y\"],\n", - ")\n", - "\n", - "# # save translate reference in memory for later visualization\n", - "multi_kerchunk = mzz.translate()\n", - "\n", - "# Write kerchunk .json record\n", - "output_fname = \"RADAR.json\"\n", - "with open(f\"{output_fname}\", \"wb\") as f:\n", - " f.write(ujson.dumps(multi_kerchunk).encode())" + "combined_vds = xr.concat(virtual_datasets, dim=\"time\")\n", + "combined_vds" ] }, { @@ -274,11 +240,18 @@ "source": [ "client.shutdown()" ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] } ], "metadata": { "kernelspec": { - "display_name": "Python 3 (ipykernel)", + "display_name": "kerchunk-cookbook", "language": "python", "name": "python3" }, @@ -292,7 +265,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.10.11" + "version": "3.12.7" } }, "nbformat": 4, diff --git a/notebooks/generating_references/NetCDF.ipynb b/notebooks/generating_references/NetCDF.ipynb index 1ceef291..cd06a806 100644 --- a/notebooks/generating_references/NetCDF.ipynb +++ b/notebooks/generating_references/NetCDF.ipynb @@ -6,7 +6,7 @@ "metadata": {}, "source": [ "# NetCDF\n", - "Generating Kerchunk References from NetCDF files\n" + "Generating virtual datasets from NetCDF files\n" ] }, { @@ -26,18 +26,19 @@ " \n", "Within this notebook, we will cover:\n", "\n", - "1. How to access remote NetCDF data using `Kerchunk`\n", - "1. Combining multiple `Kerchunk` reference files using `MultiZarrToZarr`\n", + "1. How to access remote NetCDF data using `VirtualiZarr` and `Kerchunk`\n", + "1. Combining multiple virtual datasets\n", "\n", - "This notebook shares many similarities with the [Multi-File Datasets with Kerchunk](../foundations/kerchunk_multi_file.ipynb). If you are confused on the function of a block of code, please refer there for a more detailed breakdown of what each line is doing.\n", + "This notebook shares many similarities with the [multi-file virtual datasets with VirtualiZarr](./02_kerchunk_multi_file.ipynb) notebook. If you are confused on the function of a block of code, please refer there for a more detailed breakdown of what each line is doing.\n", "\n", "\n", "## Prerequisites\n", "| Concepts | Importance | Notes |\n", "| --- | --- | --- |\n", - "| [Kerchunk Basics](../foundations/kerchunk_basics) | Required | Core |\n", - "| [Multiple Files and Kerchunk](../foundations/kerchunk_multi_file) | Required | Core |\n", - "| [Kerchunk and Dask](../foundations/kerchunk_dask) | Required | Core |\n", + "| [Basics of virtual Zarr stores](../foundations/01_kerchunk_basics.ipynb) | Required | Core |\n", + "| [Multi-file virtual datasets with VirtualiZarr](../foundations/02_kerchunk_multi_file.ipynb) | Required | Core |\n", + "| [Parallel virtual dataset creation with VirtualiZarr, Kerchunk, and Dask](../foundations/03_kerchunk_dask) | Required | Core |\n", + "| [Introduction to Xarray](https://foundations.projectpythia.org/core/xarray/xarray-intro.html) | Required | IO/Visualization |\n", "- **Time to learn**: 45 minutes\n", "---" ] @@ -49,7 +50,7 @@ "source": [ "## Motivation\n", "\n", - "NetCDF4/HDF5 is one of the most universally adopted file formats in earth sciences, with support of much of the community as well as scientific agencies, data centers and university labs. A huge amount of legacy data has been generated in this format. Fortunately, using `Kerchunk`, we can read these datasets as if they were an Analysis-Read Cloud-Optimized (ARCO) format such as `Zarr`." + "NetCDF4/HDF5 is one of the most universally adopted file formats in earth sciences, with support of much of the community as well as scientific agencies, data centers and university labs. A huge amount of legacy data has been generated in this format. Fortunately, using `VirtualiZarr` and `Kerchunk`, we can read these datasets as if they were an Analysis-Read Cloud-Optimized (ARCO) format such as `Zarr`." ] }, { @@ -60,10 +61,10 @@ "## About the Dataset\n", "\n", "For this example, we will look at a weather dataset composed of multiple NetCDF files.The SMN-Arg is a WRF deterministic weather forecasting dataset created by the `Servicio Meteorológico Nacional de Argentina` that covers Argentina as well as many neighboring countries at a 4km spatial resolution. \n", + "\n", "The model is initialized twice daily at 00 & 12 UTC with hourly forecasts for variables such as temperature, relative humidity, precipitation, wind direction and magnitude etc. for multiple atmospheric levels.\n", "The data is output at hourly intervals with a maximum prediction lead time of 72 hours in NetCDF files.\n", "\n", - "\n", "More details on this dataset can be found [here](https://registry.opendata.aws/smn-ar-wrf-dataset/).\n" ] }, @@ -99,18 +100,14 @@ "metadata": {}, "outputs": [], "source": [ - "import glob\n", "import logging\n", - "from tempfile import TemporaryDirectory\n", "\n", "import dask\n", "import fsspec\n", "import s3fs\n", - "import ujson\n", "import xarray as xr\n", "from distributed import Client\n", - "from kerchunk.combine import MultiZarrToZarr\n", - "from kerchunk.hdf import SingleHdf5ToZarr" + "from virtualizarr import open_virtual_dataset" ] }, { @@ -120,7 +117,7 @@ "source": [ "### Examining a Single NetCDF File\n", "\n", - "Before we use `Kerchunk` to create indices for multiple files, we can load a single NetCDF file to examine it. \n", + "Before we use `VirtualiZarr` to create virtual datasets for multiple files, we can load a single NetCDF file to examine it. \n", "\n" ] }, @@ -139,22 +136,13 @@ "ds = xr.open_dataset(fs.open(url), engine=\"h5netcdf\")" ] }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "ds" - ] - }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Here we see the `repr` from the `Xarray` Dataset of a single `NetCDF` file. From examining the output, we can tell that the Dataset dimensions are `['time','y','x']`, with time being only a single step.\n", - "Later, when we use `Kerchunk's` `MultiZarrToZarr` functionality, we will need to know on which dimensions to concatenate across. \n", + "Later, when we use `Xarray's` `combine_nested` functionality, we will need to know on which dimensions to concatenate across. \n", "\n" ] }, @@ -180,30 +168,13 @@ "files_paths = fs_read.glob(\"s3://smn-ar-wrf/DATA/WRF/DET/2022/12/31/12/*\")\n", "\n", "# Here we prepend the prefix 's3://', which points to AWS.\n", - "file_pattern = sorted([\"s3://\" + f for f in files_paths])\n", + "files_paths = sorted([\"s3://\" + f for f in files_paths])\n", "\n", "\n", "# If the subset_flag == True (default), the list of input files will be subset\n", "# to speed up the processing\n", "if subset_flag:\n", - " file_pattern = file_pattern[0:8]" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# This dictionary will be passed as kwargs to `fsspec`. For more details, check out the\n", - "# `foundations/kerchunk_basics` notebook.\n", - "so = dict(mode=\"rb\", anon=True, default_fill_cache=False, default_cache_type=\"first\")\n", - "\n", - "# We are creating a temporary directory to store the .json reference files\n", - "# Alternately, you could write these to cloud storage.\n", - "td = TemporaryDirectory()\n", - "temp_dir = td.name\n", - "temp_dir" + " files_paths = files_paths[0:8]" ] }, { @@ -232,22 +203,18 @@ "metadata": {}, "outputs": [], "source": [ - "# Use Kerchunk's `SingleHdf5ToZarr` method to create a `Kerchunk` index from\n", - "# a NetCDF file.\n", - "\n", - "\n", - "def generate_json_reference(fil, output_dir: str):\n", - " with fs_read.open(fil, **so) as infile:\n", - " h5chunks = SingleHdf5ToZarr(infile, fil, inline_threshold=300)\n", - " fname = fil.split(\"/\")[-1].strip(\".nc\")\n", - " outf = f\"{output_dir}/{fname}.json\"\n", - " with open(outf, \"wb\") as f:\n", - " f.write(ujson.dumps(h5chunks.translate()).encode())\n", - " return outf\n", + "def generate_virtual_dataset(file, storage_options):\n", + " return open_virtual_dataset(\n", + " file, indexes={}, reader_options={\"storage_options\": storage_options}\n", + " )\n", "\n", "\n", + "storage_options = dict(anon=True, default_fill_cache=False, default_cache_type=\"none\")\n", "# Generate Dask Delayed objects\n", - "tasks = [dask.delayed(generate_json_reference)(fil, temp_dir) for fil in file_pattern]" + "tasks = [\n", + " dask.delayed(generate_virtual_dataset)(file, storage_options)\n", + " for file in files_paths\n", + "]" ] }, { @@ -260,7 +227,7 @@ "import warnings\n", "\n", "warnings.filterwarnings(\"ignore\")\n", - "dask.compute(tasks)" + "virtual_datasets = list(dask.compute(*tasks))" ] }, { @@ -268,9 +235,9 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## Combine .json `Kerchunk` reference files and write a combined `Kerchunk` index\n", + "## Combine virtual datasets and write a Kerchunk reference JSON to store the virtual Zarr store\n", "\n", - "In the following cell, we are combining all the `.json` reference files that were generated above into a single reference file and writing that file to disk." + "In the following cell, we are combining all the `virtual datasets that were generated above into a single reference file and writing that file to disk." ] }, { @@ -279,23 +246,19 @@ "metadata": {}, "outputs": [], "source": [ - "# Create a list of reference json files\n", - "output_files = glob.glob(f\"{temp_dir}/*.json\")\n", - "\n", - "# combine individual references into single consolidated reference\n", - "mzz = MultiZarrToZarr(\n", - " output_files,\n", - " concat_dims=[\"time\"],\n", - " identical_dims=[\"y\", \"x\"],\n", - " remote_options={\"anon\": True},\n", + "combined_vds = xr.combine_nested(\n", + " virtual_datasets, concat_dim=[\"time\"], coords=\"minimal\", compat=\"override\"\n", ")\n", - "# save translate reference in memory for later visualization\n", - "multi_kerchunk = mzz.translate()\n", - "\n", - "# Write kerchunk .json record.\n", - "output_fname = \"ARG_combined.json\"\n", - "with open(f\"{output_fname}\", \"wb\") as f:\n", - " f.write(ujson.dumps(multi_kerchunk).encode())" + "combined_vds" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "combined_vds.virtualize.to_kerchunk(\"ARG_combined.json\", format=\"json\")" ] }, { @@ -317,7 +280,7 @@ ], "metadata": { "kernelspec": { - "display_name": "Python 3 (ipykernel)", + "display_name": "kerchunk-cookbook", "language": "python", "name": "python3" }, @@ -331,12 +294,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.10.11" - }, - "vscode": { - "interpreter": { - "hash": "89095a95fbc59e1db286735bee0073a08e46abd63daa66f53634eb5c8cc2192a" - } + "version": "3.12.7" } }, "nbformat": 4,