Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added a new documentation page for faster GRIB aggregations #495

Merged
merged 7 commits into from
Sep 9, 2024
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
55 changes: 38 additions & 17 deletions docs/source/reference_aggregation.rst
Original file line number Diff line number Diff line change
@@ -1,30 +1,54 @@
Other Methods of Aggregations
Aggregation special cases
=============================

As we have already seen in this `page <https://fsspec.github.io/kerchunk/test_example.html#multi-file-jsons>`_,
that the main purpose of ``kerchunk`` it to generate references, to view whole archive of files like
GRIB2, NetCDF etc. allowing us for *in-situ* access to the data. In this part of the documentation,
we will see some other efficient ways of combining references.
that the main purpose of ``kerchunk`` it to generate references, to view whole archive
of files like GRIB2, NetCDF etc. allowing us for direct access to the data. In
this part of the documentation, we will see some other efficient ways of
combining references.

GRIB Aggregations
-----------------

This new method for reference aggregation, developed by **Camus Energy**, is based on GRIB2 files. Utilizing
this method can significantly reduce the time required to combine references, cutting it down to
a fraction of the previous duration. In reality, this approach builds upon consolidating references
with ``kerchunk.combine.MultiZarrtoZarr``, making it faster. The functions and operations used in this
will be a part of the ``kerchunk``'s API. You can follow this `video <https://discourse.pangeo.io/t/pangeo-showcase-optimizations-for-kerchunk-aggregation-and-zarr-i-o-at-scale-for-machine-learning/4074>`_
for the intial discussion.
This reference aggregation method of GRIB files, developed by **Camus Energy**, functions if
accompanying ``.idx`` files are present.

**But this procedure has some certain restrictions:**

- GRIB files must paired with their ``.idx`` files
- The ``.idx`` file must be of *text* type.
- Only specialised for time-series data, where GRIB files
have *identical* structure.
- Aggregation only works for a specific **forecast horizon** files.

Copy link
Contributor

@emfdavid emfdavid Aug 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reference index can be combined across many horizons but each horizon must be indexed separately.
Looking forward to seeing what you make of the reinflate api... there you can see all of the FMRC slices are supported against a collection of indexed data from many horizons, runtimes and valid times.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the reinflate api

ooh, what is this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The method to turn the k_index and the metadata back into a ref_spec you can use in zarr/xarray
https://github.com/asascience-open/nextgen-dmac/blob/main/grib_index_aggregation/dynamic_zarr_store.py#L198
I think @Anu-Ra-g is already working on adding it into kerchunk?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, in that case I suspect it works already, right @Anu-Ra-g : but you can only work on one set of horizons OR one set of timepoints, not both at once? Something like that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can return an array with multiple dimensions.
I didn't have a strong use for this so struggled to do something general and practical.
For instance if you request by Horizon, you can provide multiple horizon axis and you dimensions should include 'horizon' and 'valid_time". Similarly you can request multiple runtimes and then your dimensions should include 'runtime' and 'step'.
Honestly not sure if this is helpful or over complicated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@martindurant I tried it out with one set of horizons with the original code. Actually, I'm still figuring out the reinflating part of the code, aggregation types and the new indexes.

I noticed that for reinflating can also work with a grib_tree model made from a single grib file.
@emfdavid can you confirm this in this notebook that I made?

Utilizing this method can significantly reduce the time required to combine
references, cutting it down to a fraction of the previous duration. The original
idea was showcased in this `talk <https://discourse.pangeo.io/t/pangeo-showcase-optimizations-for-kerchunk-aggregation-and-zarr-i-o-at-scale-for-machine-learning/4074>`_.

*How is it faster*

Every GRIB file stored on cloud platforms such as **AWS** and **GCP** is accompanied by its
corresponding ``.idx`` file. This file otherwise known as an *index* file contains the key
The ``.idx`` file otherwise known as an *index* file contains the key
metadata of the messages in the GRIB files. These metadata include `index`, `offset`, `datetime`,
`variable` and `forecast time` for their respective messages stored in the files.

**It follows three step approach:**

1. Extract and persist metadata directly from a few arbitrary grib
files for a given product such as HRRR SUBH, GEFS, GFS etc.
2. Use the metadata mapping to build an index table of every grib
message from the ``.idx`` files
3. Combine the index data with the metadata to build any FMRC
slice (Horizon, RunTime, ValidTime, BestAvailable)

.. tip::
To confirm the indexing of messages, see this `notebook <https://gist.github.com/Anu-Ra-g/efa01ad1c274c1bd1c14ee01666caa77>`_.

These metadata will be used to build an k_index for every GRIB message that we will be
indexing. The indexing process primarily involves the `pandas <https://pandas.pydata.org/>`_ library.
Anu-Ra-g marked this conversation as resolved.
Show resolved Hide resolved
indexing. Indexing process primarily involves the `pandas <https://pandas.pydata.org/>`_ library.

.. note::
The index in ``.idx`` file indexes the GRIB messages where as the ``k_index`` (kerchunk index)
we build as part of this workflow index the variables in those messages.

.. list-table:: k_index for a single GRIB file
:header-rows: 1
Expand Down Expand Up @@ -110,10 +134,6 @@ indexing. The indexing process primarily involves the `pandas <https://pandas.py
- None


.. note::
The index in ``idx`` file indexes the GRIB messages where as the ``k_index`` (kerchunk index)
we build as part of this workflow index the variables in those messages.

*What now*

After creating the k_index as per the desired duration, we will use the ``DataTree`` model
Expand Down Expand Up @@ -181,6 +201,7 @@ GRIB files produced from **GEFS** model hosted in AWS S3 bucket.
Attributes:
typeOfLevel: nominalTop


.. tip::
For a full tutorial on this workflow, refer this `kerchunk cookbook <https://projectpythia.org/kerchunk-cookbook/README.html>`_
in `Project Pythia <https://projectpythia.org/>`_.
martindurant marked this conversation as resolved.
Show resolved Hide resolved
Expand Down
Loading