From f70420ecb7ad2c7c30d9136145b89f0086274346 Mon Sep 17 00:00:00 2001 From: Max Jones <14077947+maxrjones@users.noreply.github.com> Date: Mon, 11 Nov 2024 10:02:40 -0700 Subject: [PATCH] Remove Pangeo-Forge recipes notebook (#70) --- _toc.yml | 1 - environment.yml | 5 +- notebooks/advanced/Pangeo_Forge.ipynb | 230 ---------------------- notebooks/using_references/Datatree.ipynb | 5 +- 4 files changed, 3 insertions(+), 238 deletions(-) delete mode 100644 notebooks/advanced/Pangeo_Forge.ipynb diff --git a/_toc.yml b/_toc.yml index ea322130..f9b541d5 100644 --- a/_toc.yml +++ b/_toc.yml @@ -14,7 +14,6 @@ parts: - caption: Advanced chapters: - file: notebooks/advanced/Parquet_Reference_Storage - - file: notebooks/advanced/Pangeo_Forge - file: notebooks/advanced/appending - caption: Generating Reference Files diff --git a/environment.yml b/environment.yml index 57124f83..34e2d2b5 100644 --- a/environment.yml +++ b/environment.yml @@ -34,12 +34,9 @@ dependencies: - scipy - tifffile - ujson - - xarray - - xarray-datatree + - xarray>=2024.10.0 - zarr - sphinx-pythia-theme - pip: - - "apache-beam[interactive, dataframe]" - - git+https://github.com/pangeo-forge/pangeo-forge-recipes - git+https://github.com/carbonplan/xrefcoord.git - git+https://github.com/fsspec/kerchunk diff --git a/notebooks/advanced/Pangeo_Forge.ipynb b/notebooks/advanced/Pangeo_Forge.ipynb deleted file mode 100644 index 4aafcbe7..00000000 --- a/notebooks/advanced/Pangeo_Forge.ipynb +++ /dev/null @@ -1,230 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Kerchunk and Pangeo-Forge\n", - "\n" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Overview\n", - "\n", - "In this tutorial we are going to use the open-source ETL pipeline named pangeo-forge-recipes to generate Kerchunk references.\n", - "\n", - "Pangeo-Forge is a community project to build reproducible cloud-native ARCO (Analysis-Ready-Cloud-Optimized) datasets. The Python library (`pangeo-forge-recipes`) is the ETL pipeline to process these datasets or \"recipes\". While a majority of the recipes convert a legacy format such as NetCDF to Zarr stores, `pangeo-forge-recipes` can also use Kerchunk under the hood to create reference recipes. \n", - "\n", - "It is important to note that `Kerchunk` can be used independently of `pangeo-forge-recipes` and in this example, `pangeo-forge-recipes` is acting as the runner for `Kerchunk`. \n", - "\n", - "\n", - "\n", - "## Prerequisites\n", - "| Concepts | Importance | Notes |\n", - "| --- | --- | --- |\n", - "| [Kerchunk Basics](../foundations/kerchunk_basics) | Required | Core |\n", - "| [Multiple Files and Kerchunk](../foundations/kerchunk_multi_file) | Required | Core |\n", - "| [Multi-File Datasets with Kerchunk](../case_studies/ARG_Weather.ipynb) | Required | IO/Visualization |\n", - "\n", - "- **Time to learn**: 45 minutes\n", - "---" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Getting to Know The Data\n", - "\n", - "`gridMET` is a high-resolution daily meteorological dataset covering CONUS from 1979-2023. It is produced by the Climatology Lab at UC Merced. In this example, we are going to look create a virtual Zarr dataset of a derived variable, Burn Index. " - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Examine a Single File" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "import xarray as xr\n", - "\n", - "ds = xr.open_dataset(\n", - " \"http://thredds.northwestknowledge.net:8080/thredds/dodsC/MET/bi/bi_2021.nc\"\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### Plot the Dataset" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "ds.sel(day=\"2021-08-01\").burning_index_g.plot()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Create a File Pattern\n", - "\n", - "To build our `pangeo-forge` pipeline, we need to create a `FilePattern` object, which is composed of all of our input urls. This dataset ranges from 1979 through 2023 and is composed of one year per file. \n", - " \n", - "To speed up our example, we will `prune` our recipe to select the first two entries in the `FilePattern`" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "from pangeo_forge_recipes.patterns import ConcatDim, FilePattern\n", - "\n", - "years = list(range(1979, 2022 + 1))\n", - "\n", - "\n", - "time_dim = ConcatDim(\"time\", keys=years)\n", - "\n", - "\n", - "def format_function(time):\n", - " return f\"http://www.northwestknowledge.net/metdata/data/bi_{time}.nc\"\n", - "\n", - "\n", - "pattern = FilePattern(format_function, time_dim, file_type=\"netcdf4\")\n", - "\n", - "\n", - "pattern = pattern.prune()\n", - "\n", - "pattern" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Create a Location For Output\n", - "We write to local storage for this example, but the reference file could also be shared via cloud storage. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "target_root = \"references\"\n", - "store_name = \"Pangeo_Forge\"" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Build the Pangeo-Forge Beam Pipeline\n", - "\n", - "Next, we will chain together a bunch of methods to create a Pangeo-Forge - Apache Beam pipeline. \n", - "Processing steps are chained together with the pipe operator (`|`). Once the pipeline is built, it can be ran in the following cell. \n", - "\n", - "The steps are as follows:\n", - "1. Creates a starting collection of our input file patterns.\n", - "2. Passes those file_patterns to `OpenWithKerchunk`, which creates references of each file.\n", - "3. Combines the references files into a single reference file and write them with `WriteCombineReferences`\n", - "\n", - "Just like Kerchunk, you can specify the reference file type as either `.json` or `.parquet`.\n", - "\n", - "Note: You can add additional processing steps in this pipeline. \n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "import apache_beam as beam\n", - "from pangeo_forge_recipes.transforms import OpenWithKerchunk, WriteCombinedReference\n", - "\n", - "transforms = (\n", - " # Create a beam PCollection from our input file pattern\n", - " beam.Create(pattern.items())\n", - " # Open with Kerchunk and create references for each file\n", - " | OpenWithKerchunk(file_type=pattern.file_type)\n", - " # Use Kerchunk's `MultiZarrToZarr` functionality to combine and\n", - " # then write references. Note: Setting the correct contact_dims\n", - " # and identical_dims is important.\n", - " | WriteCombinedReference(\n", - " target_root=target_root,\n", - " store_name=store_name,\n", - " output_file_name=\"reference.json\",\n", - " concat_dims=[\"day\"],\n", - " identical_dims=[\"lat\", \"lon\", \"crs\"],\n", - " )\n", - ")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "%%time\n", - "\n", - "with beam.Pipeline() as p:\n", - " p | transforms" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3 (ipykernel)", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.10.11" - } - }, - "nbformat": 4, - "nbformat_minor": 4 -} diff --git a/notebooks/using_references/Datatree.ipynb b/notebooks/using_references/Datatree.ipynb index ded31caa..67cec16b 100644 --- a/notebooks/using_references/Datatree.ipynb +++ b/notebooks/using_references/Datatree.ipynb @@ -15,8 +15,7 @@ "source": [ "## Overview\n", "\n", - "In this tutorial we are going to use a large collection of pre-generated `Kerchunk` reference files and open them with [xarray-datatree](https://xarray-datatree.readthedocs.io/en/latest/). This chapter is heavily inspired by [this blog post](https://medium.com/pangeo/easy-ipcc-part-1-multi-model-datatree-469b87cf9114).\n", - "\n", + "In this tutorial we are going to use a large collection of pre-generated `Kerchunk` reference files and open them with Xarray's new [DataTree](https://docs.xarray.dev/en/stable/generated/xarray.DataTree.html) functionality. This chapter is heavily inspired by [this blog post](https://medium.com/pangeo/easy-ipcc-part-1-multi-model-datatree-469b87cf9114).\n", "\n", "\n", "### About the Dataset\n", @@ -58,7 +57,7 @@ "import hvplot.xarray # noqa\n", "import pandas as pd\n", "import xarray as xr\n", - "from datatree import DataTree\n", + "from xarray import DataTree\n", "from distributed import Client\n", "from fsspec.implementations.reference import ReferenceFileSystem" ]