Skip to content

Commit

Permalink
Update notebooks to use VirtualiZarr (#69)
Browse files Browse the repository at this point in the history
  • Loading branch information
maxrjones authored Dec 2, 2024
1 parent f70420e commit c890147
Show file tree
Hide file tree
Showing 8 changed files with 328 additions and 503 deletions.
44 changes: 22 additions & 22 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,33 +1,33 @@
<img src="thumbnail.png" alt="thumbnail" width="300"/>

# Kerchunk Cookbook
# Virtual Zarr Cookbook (Kerchunk and VirtualiZarr)

[![nightly-build](https://github.com/ProjectPythia/kerchunk-cookbook/actions/workflows/nightly-build.yaml/badge.svg)](https://github.com/ProjectPythia/kerchunk-cookbook/actions/workflows/nightly-build.yaml)
[![Binder](https://binder.projectpythia.org/badge_logo.svg)](https://binder.projectpythia.org/v2/gh/ProjectPythia/kerchunk-cookbook/main?labpath=notebooks)
[![DOI](https://zenodo.org/badge/588661659.svg)](https://zenodo.org/badge/latestdoi/588661659)

This Project Pythia Cookbook covers using the [Kerchunk](https://fsspec.github.io/kerchunk/)
library to access archival data formats as if they were
ARCO (Analysis-Ready-Cloud-Optimized) data.
This Project Pythia Cookbook covers using the [Kerchunk](https://fsspec.github.io/kerchunk/), [VirtualiZarr](https://virtualizarr.readthedocs.io/en/latest/index.html), and [Zarr-Python](https://zarr.readthedocs.io/en/stable/) libraries to access archival data formats as if they were ARCO (Analysis-Ready-Cloud-Optimized) data.

## Motivation

The `Kerchunk` library allows you to access chunked and compressed
The `Kerchunk` library pioneered the access of chunked and compressed
data formats (such as NetCDF3. HDF5, GRIB2, TIFF & FITS), many of
which are the primary data formats for many data archives, as if
they were in ARCO formats such as Zarr which allows for parallel,
chunk-specific access. Instead of creating a new copy of the dataset
in the Zarr spec/format, `Kerchunk` reads through the data archive
and extracts the byte range and compression information of each
chunk, then writes that information to a .json file (or alternate
backends in future releases). For more details on how this process
works please see this page on the
[Kerchunk docs](https://fsspec.github.io/kerchunk/detail.html)).
These summary files can then be combined to generated a `Kerchunk`
reference for that dataset, which can be read via
[Zarr](https://zarr.readthedocs.io) and
chunk, then writes that information to a "virtual Zarr store" using a
JSON or Parquet "reference file". The `VirtualiZarr`
library provides a simple way to create these "virtual stores" using familiary
`xarray` syntax. Lastly, the `icechunk` provides a new way to store and re-use these references.

These virtual Zarr stores can be re-used and read via [Zarr](https://zarr.readthedocs.io) and
[Xarray](https://docs.xarray.dev/en/stable/).

For more details on how this process works please see this page on the
[Kerchunk docs](https://fsspec.github.io/kerchunk/detail.html)).

## Authors

[Raphael Hagen](https://github.com/norlandrhagen)
Expand All @@ -48,24 +48,24 @@ the creator of `Kerchunk` and the
This cookbook is broken up into two sections,
Foundations and Example Notebooks.

### Section 1 Foundations
### Section 1 - Foundations

In the `Foundations` section we will demonstrate
how to use `Kerchunk` to create reference sets
how to use `Kerchunk` and `VirtualiZarr` to create reference files
from single file sources, as well as to create
multi-file virtual datasets from collections of files.
multi-file virtual Zarr stores from collections of files.

### Section 2 Generating Reference Files
### Section 2 - Generating Virtual Zarr Stores

The notebooks in the `Generating Reference Files` section
demonstrate how to use `Kerchunk` to create
The notebooks in the `Generating Virtual Zarr Stores` section
demonstrates how to use `Kerchunk` and `VirtualiZarr` to create
datasets for all the supported file formats.
`Kerchunk` currently supports NetCDF3,
NetCDF4/HDF5, GRIB2, TIFF (including CoG).
These libraries currently support virtualizing NetCDF3,
NetCDF4/HDF5, GRIB2, TIFF (including COG).

### Section 3 Using Pre-Generated References
### Section 3 - Using Virtual Zarr Stores

The `Pre-Generated References` section contains notebooks demonstrating how to load existing references into `Xarray` and `Xarray-Datatree`, generated coordinates for GeoTiffs using `xrefcoord` and plotting using `Hvplot Datashader`.
The `Using Virtual Zarr Stores` section contains notebooks demonstrating how to load existing references into `Xarray`, generating coordinates for GeoTiffs using `xrefcoord`, and plotting using `Hvplot Datashader`.

## Running the Notebooks

Expand Down
4 changes: 2 additions & 2 deletions environment.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ channels:
- conda-forge
dependencies:
- cfgrib
- dask
- dask>=2024.10.0
- dask-labextension
- datashader
- distributed
Expand All @@ -28,12 +28,12 @@ dependencies:
- pooch
- pre-commit
- pyOpenSSL
- python=3.10
- rioxarray
- s3fs
- scipy
- tifffile
- ujson
- virtualizarr
- xarray>=2024.10.0
- zarr
- sphinx-pythia-theme
Expand Down
148 changes: 70 additions & 78 deletions notebooks/advanced/Parquet_Reference_Storage.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"# Store Kerchunk Reference Files as Parquet"
"# Store virtual datasets as Kerchunk Parquet references"
]
},
{
Expand All @@ -24,18 +24,18 @@
"source": [
"## Overview\n",
" \n",
"In this notebook we will cover how to store Kerchunk references as Parquet files instead of json. For large reference datasets, using Parquet should have performance implications as the overall reference file size should be smaller and the memory overhead of combining the reference files should be lower. \n",
"In this notebook we will cover how to store virtual datasets as Kerchunk Parquet references instead of Kerchunk JSON references. For large virtual datasets, using Parquet should have performance implications as the overall reference file size should be smaller and the memory overhead of combining the reference files should be lower. \n",
"\n",
"\n",
"This notebook builds upon the [Kerchunk Basics](notebooks/foundations/01_kerchunk_basics.ipynb), [Multi-File Datasets with Kerchunk](notebooks/foundations/02_kerchunk_multi_file.ipynb) and the [Kerchunk and Dask](notebooks/foundations/03_kerchunk_dask.ipynb) notebooks. \n",
"\n",
"## Prerequisites\n",
"| Concepts | Importance | Notes |\n",
"| --- | --- | --- |\n",
"| [Kerchunk Basics](../foundations/kerchunk_basics) | Required | Core |\n",
"| [Multiple Files and Kerchunk](../foundations/kerchunk_multi_file) | Required | Core |\n",
"| [Introduction to Xarray](https://foundations.projectpythia.org/core/xarray/xarray-intro.html) | Recommended | IO/Visualization |\n",
"| [Intro to Dask](https://tutorial.dask.org/00_overview.html) | Required | Parallel Processing |\n",
"| [Basics of virtual Zarr stores](../foundations/01_kerchunk_basics.ipynb) | Required | Core |\n",
"| [Multi-file virtual datasets with VirtualiZarr](../foundations/02_kerchunk_multi_file.ipynb) | Required | Core |\n",
"| [Parallel virtual dataset creation with VirtualiZarr, Kerchunk, and Dask](../foundations/03_kerchunk_dask) | Required | Core |\n",
"| [Introduction to Xarray](https://foundations.projectpythia.org/core/xarray/xarray-intro.html) | Required | IO/Visualization |\n",
"\n",
"- **Time to learn**: 30 minutes\n",
"---"
Expand All @@ -47,8 +47,6 @@
"metadata": {},
"source": [
"## Imports\n",
"In addition to the previous imports we used throughout the tutorial, we are adding a few imports:\n",
"- `LazyReferenceMapper` and `ReferenceFileSystem` from `fsspec.implementations.reference` for lazy Parquet.\n",
"\n",
"\n"
]
Expand All @@ -60,15 +58,12 @@
"outputs": [],
"source": [
"import logging\n",
"import os\n",
"\n",
"import dask\n",
"import fsspec\n",
"import xarray as xr\n",
"from distributed import Client\n",
"from fsspec.implementations.reference import LazyReferenceMapper, ReferenceFileSystem\n",
"from kerchunk.combine import MultiZarrToZarr\n",
"from kerchunk.hdf import SingleHdf5ToZarr"
"from virtualizarr import open_virtual_dataset"
]
},
{
Expand Down Expand Up @@ -111,10 +106,28 @@
"files_paths = fs_read.glob(\"s3://smn-ar-wrf/DATA/WRF/DET/2022/12/31/12/*\")\n",
"\n",
"# Here we prepend the prefix 's3://', which points to AWS.\n",
"file_pattern = sorted([\"s3://\" + f for f in files_paths])\n",
"\n",
"# Grab the first seven files to speed up example.\n",
"file_pattern = file_pattern[0:2]"
"files_paths = sorted([\"s3://\" + f for f in files_paths])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Subset the Data\n",
"To speed up our example, lets take a subset of the year of data. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# If the subset_flag == True (default), the list of input files will\n",
"# be subset to speed up the processing\n",
"subset_flag = True\n",
"if subset_flag:\n",
" files_paths = files_paths[0:4]"
]
},
{
Expand All @@ -123,7 +136,8 @@
"metadata": {},
"source": [
"# Generate Lazy References\n",
"Below we will create a `fsspec` filesystem to read the references from `s3` and create a function to generate dask delayed tasks."
"\n",
"Here we create a function to generate a list of Dask delayed objects."
]
},
{
Expand All @@ -132,20 +146,18 @@
"metadata": {},
"outputs": [],
"source": [
"# Use Kerchunk's `SingleHdf5ToZarr` method to create a `Kerchunk`\n",
"# index from a NetCDF file.\n",
"fs_read = fsspec.filesystem(\"s3\", anon=True, skip_instance_cache=True)\n",
"so = dict(mode=\"rb\", anon=True, default_fill_cache=False, default_cache_type=\"first\")\n",
"\n",
"\n",
"def generate_json_reference(fil):\n",
" with fs_read.open(fil, **so) as infile:\n",
" h5chunks = SingleHdf5ToZarr(infile, fil, inline_threshold=300)\n",
" return h5chunks.translate() # outf\n",
"def generate_virtual_dataset(file, storage_options):\n",
" return open_virtual_dataset(\n",
" file, indexes={}, reader_options={\"storage_options\": storage_options}\n",
" )\n",
"\n",
"\n",
"storage_options = dict(anon=True, default_fill_cache=False, default_cache_type=\"first\")\n",
"# Generate Dask Delayed objects\n",
"tasks = [dask.delayed(generate_json_reference)(fil) for fil in file_pattern]"
"tasks = [\n",
" dask.delayed(generate_virtual_dataset)(file, storage_options)\n",
" for file in files_paths\n",
"]"
]
},
{
Expand All @@ -163,7 +175,15 @@
"metadata": {},
"outputs": [],
"source": [
"single_refs = dask.compute(tasks)[0]"
"virtual_datasets = list(dask.compute(*tasks))"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Combine virtual datasets using VirtualiZarr"
]
},
{
Expand All @@ -172,23 +192,17 @@
"metadata": {},
"outputs": [],
"source": [
"len(single_refs)"
"combined_vds = xr.combine_nested(\n",
" virtual_datasets, concat_dim=[\"time\"], coords=\"minimal\", compat=\"override\"\n",
")\n",
"combined_vds"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Combine In-Memory References with MultiZarrToZarr\n",
"This section will look notably different than the previous examples that have written to `.json`.\n",
"\n",
"In the following code block we are:\n",
"- Creating an `fsspec` filesystem.\n",
"- Create a empty `parquet` file to write to. \n",
"- Creating an `fsspec` `LazyReferenceMapper` to pass into `MultiZarrToZarr`\n",
"- Building a `MultiZarrToZarr` object of the combined references.\n",
"- Calling `.flush()` on our LazyReferenceMapper, to write the combined reference to our `parquet` file."
"## Write the virtual dataset to a Kerchunk Parquet reference"
]
},
{
Expand All @@ -197,26 +211,7 @@
"metadata": {},
"outputs": [],
"source": [
"fs = fsspec.filesystem(\"file\")\n",
"\n",
"if os.path.exists(\"combined.parq\"):\n",
" import shutil\n",
"\n",
" shutil.rmtree(\"combined.parq\")\n",
"os.makedirs(\"combined.parq\")\n",
"\n",
"out = LazyReferenceMapper.create(root=\"combined.parq\", fs=fs, record_size=1000)\n",
"\n",
"mzz = MultiZarrToZarr(\n",
" single_refs,\n",
" remote_protocol=\"s3\",\n",
" concat_dims=[\"time\"],\n",
" identical_dims=[\"y\", \"x\"],\n",
" remote_options={\"anon\": True},\n",
" out=out,\n",
").translate()\n",
"\n",
"out.flush()"
"combined_vds.virtualize.to_kerchunk(\"combined.parq\", format=\"parquet\")"
]
},
{
Expand Down Expand Up @@ -255,31 +250,28 @@
"metadata": {},
"outputs": [],
"source": [
"fs = ReferenceFileSystem(\n",
"storage_options = {\n",
" \"remote_protocol\": \"s3\",\n",
" \"skip_instance_cache\": True,\n",
" \"remote_options\": {\"anon\": True},\n",
" \"target_protocol\": \"file\",\n",
" \"lazy\": True,\n",
"} # options passed to fsspec\n",
"open_dataset_options = {\"chunks\": {}} # opens passed to xarray\n",
"\n",
"ds = xr.open_dataset(\n",
" \"combined.parq\",\n",
" remote_protocol=\"s3\",\n",
" target_protocol=\"file\",\n",
" lazy=True,\n",
" remote_options={\"anon\": True},\n",
" engine=\"kerchunk\",\n",
" storage_options=storage_options,\n",
" open_dataset_options=open_dataset_options,\n",
")\n",
"ds = xr.open_dataset(\n",
" fs.get_mapper(), engine=\"zarr\", backend_kwargs={\"consolidated\": False}\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"ds"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "kerchunk-cookbook-dev",
"display_name": "kerchunk-cookbook",
"language": "python",
"name": "python3"
},
Expand All @@ -293,7 +285,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.9"
"version": "3.12.7"
}
},
"nbformat": 4,
Expand Down
Loading

0 comments on commit c890147

Please sign in to comment.