What do we need to move to V1? #662

savente93 · 2023-11-27T15:54:35Z

savente93
Nov 27, 2023
Maintainer

Documentation

Feedback from a recent user survey identifying the documentation as HydroMT's weakest aspect. We've also added more functionality to the CLI that doesn't fit into the build/update/clip structure of the current documentation (namely check and export commands). Therefore, the documentation should at least receive a restructuring, to give these better attention.

I also think there is a lack of documentation segmentation for users of different skill levels. I think the documentation would benefit from a restructuring into something like: lay person, basic user, advanced user, developer (not necessarily in that order or grouping).

There are also some technical considerations that would be good to address before moving to V1 to make maintenance easier for the developers. For example, currently the documentation is the most significant contributor to our CI time, checkout time, repo/history size and is (in my opinion) one of the more cumbersome parts of the repository to work with. While I haven't articulated the recommendations for the documentation just yet, I believe some work here would be beneficial.

Testing, Releases & Ownership

HydroMT's functionality is closely tied to its plugins, which raises questions about compatibility and responsibility for updates. Key considerations include:

How do we ensure plugin and core compatibility?
How do we determine responsibility for updating plugins and ensuring compatibility with core releases?
How do we ensure released software functions as intended?
What should our release cycle be?
How long should different versions be maintained?

Past releases have encountered issues, such as problems with pip and Conda releases. To avoid this, a release testing system should be implemented to catch problems before the release is (partially) completed. HydroMT currently has several release avenues: PyPi, Conda-forge, Deltares-forge, and Docker. All these releases should ideally be tested before release. Conda-forge and Delta-forge currently lack a good testing mechanism, but tools like grayskull could catch major issues for Conda-forge. It is currently not clear what forms of automated testing and releasing are available for delta-forge.

HydroMT has good test coverage, but test isolation, performance, and modularity could be improved. The current release cycle is a minor version every quarter, with additional releases upon request. A streamlined approach to releasing could facilitate an increased frequency. The proposed future release cycle includes a minor release every quarter, a new patch version every month, and major releases only when absolutely necessary. For support, it's proposed to offer two quarters of active support and one year of security support for each minor release.

Memory & performance characteristics

HydroMT's current code base focuses primarily on functionality, with less emphasis on optimisation, which usually results in acceptable performance, but instability in others. For instance, forcing can increase memory usage significantly, with De Bruin forcing known to spike up to 12 GB regardless of chunk size. To be considered production-ready, HydroMT's memory usage needs to be stable and configurable, even if it affects time efficiency.

We would need to provide a baseline benchmark to profile system runtime and resource usage, and to run regression tests. Ideally, we'd create benchmark suites should for both local and cloud-based data for performance comparison. However, potential egress costs associated with the cloud may make this unfeasible.

I can think of several options for improving performance consistency:

Making better use of preprocessing and cloud-optimised formats.

Providing more user-accessible performance-minded options to ensure HydroMT is suitable for production use.

More explicit use of lazy data loading and delayed computation available in Dask. This could improve scalability and performance and allow for internal optimizations like better caching.

While performance improvements are likely to positively affect both runtime and resource usage, I think we should give priority should to memory consumption to prevent program crashes. I think performance and memory characteristics will be the most complex implementations needed for moving to v1, so I think it would be best to leave this for last.

FS & Cloud interactions

HydroMT currently assumes all data is file-based (e.g. the DataAdapter requires a path), leading to several issues. Differences in handling POSIX and Windows paths have caused confusion, and the lack of consistent file operations for cloud storage has created friction in the ongoing cloud pilot. Given these challenges, a centralised approach is necessary to handle cross-platform requirements and cloud-native operations.

Currently, HydroMT's use of lazy loading and read-ahead metadata through Dask and xarray is minimal. To address this, I think a centralised solution is necessary. HydroMT's inability to consume data from streaming APIs also needs to be addressed because of potential data volume requirements.

The centralised solution should:
Handle paths in an OS-independent manner. Use Path library everywhere rather than strings.
Operate asynchronously to work efficiently with lazy loading. Potentially keep the file handles in the DataAdapter to be able to properly close files.
Handle cloud file systems correctly, preferably in a cloud-optimised way.
Potentially add authentication methods for accessing non-public data.

Implementing these changes will involve significant operations and affect most of the code base. I propose the following steps:

Centralise all OS/file interactions without changing them.
Provide a more specific API to simplify interactions with different objects.
Better abstract the centralised behaviour.
Optimise the centralised behaviour for improved performance.

External API

The external API, which will be crucial to the design of V1, needs to be flexible enough to meet users' needs for a considerable time. The proposed components of the external API include:

The Model API
The Data catalog API and all its available features
The workflows and their associated methods
The entry points necessary for the plugins to function

We should consider anything not included in these categories core-internal. By limiting the external surface area, the core team can maintain flexibility for internal operations and provide a concise API for users. This approach will be necessary for the core team to structure the code understandably.

An excellent discussion about how to reactor the model class is already taking place on GitHub and therefore I did not discuss it here.

Conclusion

To summarise the main points discussed:

The code base should be restructured to expose a more succinct API as well as centralise and make better use of things like Dask and IO.
The documentation should be restructured to better include the different type of users and the recent developments
More attention should be paid to ensuring consistent performance.
More concrete agreements are needed regarding the maintenance of plugins
The collaboration between the plugins could be made easier with additional testing and release automation.

Jaapel · 2023-11-30T10:54:18Z

Jaapel
Nov 30, 2023

3 most important point for my experience so far:

To be considered production-ready, HydroMT's memory usage needs to be stable and configurable, even if it affects time efficiency.

This is for me the main one, as this is really unacceptable when creating "production-ready" applications with HydroMT. If we are more strict with chunking and apply some metadata-reading to the data-catalog items, we can ensure that the memory used for loading the datasets stays under a certain threshold. This does mean that we cannot allow "tiny" machines to operate HydroMT, as we do not want to duplicate datasets with different levels of chunking. I guess it is good to solve these issues and specify some hardware requirements for the recommended chunk size of many datasets. Due to its nature HydroMT will always be memory intensive.

HydroMT currently assumes all data is file-based, leading to several issues.

There is much code assuming files. Often fsspec "file-like" API is not perfect and different methods have different results, or are not implemented at all. Looking at data as streams is indeed the way to go, as you can efficiently stream from a file.

Currently, HydroMT's use of lazy loading and read-ahead metadata through Dask and xarray is minimal. To address this, I think a centralised solution is necessary. HydroMT's inability to consume data from streaming APIs also needs to be addressed because of potential data volume requirements.

👍

2 replies

deltamarnix Dec 1, 2023
Collaborator

Nice overview. Thanks for the heads up.

Consider what makes a V1 release a V1 release. Do you really need to do all of these things? What has to be in place to make this a workable product? Some of these things might be nice to have for a V1, and can still be implemented in a V1.X. As long as the API and command line interface is set in stone and there is no need for deprecation.

If performance and memory usage are currently a real issue, it makes sense to pick this up.

One thing to consider as well, is to get rid of the whole command line interface, and only use it as a Python API. It might remove some of the sorrows of .ini/.yml files and all the file readers behind it that have to interpret all these different settings that can be used. I don't know enough of the product yet to say these things, but I can imagine that maintaining two interfaces is more work than one.

savente93 Jan 2, 2024
Maintainer Author

Personally I consider v1 to be about more than the api not changing although that is the majority of it. The reason I also suggested some things in here that do not relate to the api are things we need to do that may well require changes in the API, so it makes sense to land those before.

alimeshgi · 2023-12-01T09:40:12Z

alimeshgi
Dec 1, 2023
Maintainer

@JoostBuitink @dalmijn @kathrynroscoe @Leynse @roeldegoede @xldeltares @sibrenloos @MPWeeber Please see the discussion above on “What do we need to move to V1 for HydroMT core”. I would appreciate it if you could have a look and provide us with your comments and suggestions. Thank you in advance!

0 replies

hboisgon · 2023-12-04T04:41:39Z

hboisgon
Dec 4, 2023
Maintainer

I think for me, what matters most with V1 is what @deltamarnix also suggests:

V1: when the API and command line interface is set in stone and there is no need for deprecation.

Command line users

Command line interface
We already changed our command line interface so that the region argument becomes mandatory with -r. We still have to remove the deprecated -dd flag. But I think our command line should then be pretty stable.

Configuration and data catalog file
I think the configuration and data catalog file formats goes together with the command line interface in terms of priority as this concerns most of our users.
I suggest for both that we move to yaml format support only.

For the data_catalog file, we have done a lot of changes (like the variants) that should be implemented and well documented. With v1, we should aim that format and expected data properties in the catalog file should be more or less fixed. See below for the issues that would help achieve this.

For the CLI (build/update) configuration file, we had a recent suggestion from @deltamarnix in #669 to use only one configuration file per model and for updating, change that original file and have hydromt detect the changes compared to the previous build. If we decide to implement this I think this should be in v1 as this would modify a little usage of hydromt.

Releasing and installation
I think @savente93 covered a lot on releasing already. I think together with this I would like to also mention #428. So that with each new release we are more strict on each dependency version of hydromt so that older version are still easily installable with conda-forge (so maybe use max versions on our dependencies?). This would help plugins that do not go fast enough to be able to pin and effectively use older version of hydromt more safely (as well as not have hydromt install break if some of our dependency does have a major change).

Documentation
Clearer and re-organized documentation. This is indeed very important but I think the re-organization could come after the V1 release as long as the changed or new functionality is documented.

Memory usage and performance
I agree that hydromt should not break on memory usage and that performance should be improved. However there, I think not everything needs to be done for v1 and can be improved after. Especially better performance can be worked on after unless one of you foresees (breaking) consequences for external users and plugins?

Not breaking during hydromt running is more important. To my knowledge, there are two main points where memory error do happen. Catchment delineation based on DEM data but there we do not really have a choice. And preparing PET data (#618 and #32 ). It would be great if we can fix this one for v1 and I would leave the rest for after.

Python users and plugin developers

Model API and generic workflows
We have the discussion on reformatting and adding a new ModelComponent class in #636 . This should definitely be seen through as this would have implications at the plugin level.

For the generic workflows, I think it would be nice to have some examples to check if everything we want works correctly. We already have workflows for grid and mesh. I wonder if we could maybe work on an extra one for forcing / states object for which schematization is not fixed (eg add_grid_forcing_from_rasterdataset or add_vector_forcing_from_geodataset)? So that is clear in the future what part of the function will end in Model, ModelComponent, workflows and raster/vector/xugrid etc.

Data Catalog
For me being able to handle cloud and data download via API, as well as good cross-platform (Windows/Linux) compatibility is very important. Supporting this may help fixing the format of the data catalog but would also open HydroMT to a lot more users if manual data download before getting started is not a must anymore.
@savente93 mentions the necessary steps to go forward with this.

Another thing I would like for the data catalog, is to finalize functionalities like data dtype support and geometry type support for GeoDataFrame and GeoDataset, better slicing, nodata handling (see #665, #204 and #97).

A final issue is a better version of data name and unit harmonization in #45 so that when users prepare a data catalog, expected names and units are clearer.

And well finalize naming of the data catalog properties #180

Workflows and methods
Not sure if this is a must or nice-to-have but HydroMT does have a couple of workflows like forcing or rivers that are more tailored for grid schematization. If we support more schematizations (mesh, vector models), maybe we should rename or re-organize some of the workflows to make clear for which type of model schematization they work for.
We may have to do this in the future when we support more generic model methods, so it could help to already make clear the functions we do already support.

Plugin entry points
See #454. This should be done for v1.

Link with the plugins
I think the goal of v1 is to be more stable so that in the future there should be less breaking changes or deprecations. So in a way one of the goal of v1 is that dependency on hydromt for the plugins becomes more like maintaining classic dependencies like xarray or geopandas.

Maintainability and testing of hydromt

Deprecations
I put it here but we should clean all our old deprecation warnings for v1.

Better CI
I think wonders have already been done by @savente93 to improve hydromt CI. It can still be improved but as long as it does not fail I think this is more of a nice-to-have for v1.

Testing
@savente93 mentions:

HydroMT has good test coverage, but test isolation, performance, and modularity could be improved

I agree but I think this is more nice to have for v1. For me stability is important and having tests that ensure (new) functionality (keep) work the way they should. It is not easy to write good and meaningful tests but we are getting better. We could still have a last go at improving testing coverage for v1.

0 replies

DirkEilander · 2023-12-20T10:03:34Z

DirkEilander
Dec 20, 2023
Maintainer

Stable API

I agree with @hboisgon and @deltamarnix that the top priority for v1 is a stable API for the CLI, the public python API, and the plugin entrypoints. While the CLI is relatively small and stable, the Python API is large and changes still occur. The first task should be to define what should be part of the public vs private API, where we guaranty stability for the public API. The public API should include (These classes should also be available at the highest level when importing the package):

the main classes of HydroMT: Model, ModelComponent (see New `ModelComponent` class #636), DataCatalog, DataAdapter and DataDriver (see Support custom drivers to connect to data APIs #432).
the xarray accessors for raster and vector

Other submodules like the workflows and stats have less priority. Some reorganization of the scripts would be nice, specifically combining all gis related methods (including flw.py) in a gis folder (submodule). this would also change the api and is therefore important to do before v1.

Documentation

We should put the Model class at the center of the documentation. This will help users to understand better what HydroMT is about.

Data

This data components are currently the most complex part of HydroMT. This could be simplified by differentiating between the a DataAdapter class (for consistent naming, units, attributes) vs. a DataDriver class (for resolving paths, reading, slicing, and file handle management). A separate DataDriver class can also help to include cloud data specific methods where needed. Furthermore, there should be a unique mapping of driver names to methods. Currently, for instance 'vector' driver maps to different methods for the GeoDataFrameAdapter and GeoDatasetAdapter. Driver methods should ideally also be quite simple and lean. For instance, The GeoDataFrameAdapter vector driver now combines different reading methods, i.e., csv, geopandas.read_file.

Model

The main purpose of the Model class is to provide a framework to setup, read and write model data. It should be flexible enough to support different models software (e.g. via hydromt plugins). The ModelComponent class wil hopefully largely achieve this. In addition, it will be important to carefully review the class and minimize any unclear behaviour, For instance, we could minimize the use of class properties where these are not strictly needed, e.g. setting up a folder structure for a model could be done in write methods which is more straight forward than via the Model._FOLDERS class property and the Model.set_root method. The behaviour around read and write mode could be further improved by introducing a separate ModelRoot class which can manage this more properly.

Testing

It would be really good to increase coverage to 95% and make sure that the public API is especially carefully tested for different use cases.

Releases and ownership

@savente93 made some really good comments about this. It would be good to document our release cycle and what we consider to be our vs plugin responsibilities.

0 replies

hboisgon · 2023-12-22T16:16:07Z

hboisgon
Dec 22, 2023
Maintainer

Needed for v1

Goal: the top priority for v1 is a stable API for the CLI, the public python API, and the plugin entrypoints.

API

Clearly define what is part of the public API of HydroMT (CLI; DataCatalog, DataAdapter and DataDriver; Model and ModelComponent; xarray accessors in raster and vector; some workflows?)
Finalize objects/functions of the defined public API (robustness in the future)
Choose and fix supported format for HydroMT command line configuration files (keep only yaml, and remove ini and toml support) Support for .ini and .toml configuration files should be removed in v1.0 #548
Configuration file of HydroMT for update: only list components to update or list all and hydromt should notice the changes compared to the currently built model and update. Simplify build workflow by caching value hashes #669
Move all gis related methods (flw, gis_utils) in a gis folder. (GIS related functinoalities should be consolidated in their own submodule #714)

Documentation

Make sure all (new) features are well documented (CLI check and export data New CLI subcommands should be documented #644 )
Document how to use variants (versions & providers) and show these in the example. (The predefined data catalogs in the docs don't show variants (versions & providers) #709 )
Simplify installation guide (one page per installation type to make it clearer / easier to copy/paste) Improve the HydroMT installation guide #675
Distinguinsh documentation for advanced vs beginner vs lay users (Distinguish in documentation between beginner and advanced users #715 )
Restructure / put the Model class at the center of the documentation (Documentation topics should be reorganised. #716 )
Clearly document workflows and expectations/contracts around updating predefined data catalogs (Versioning scheme of the predefined catalogs needs to be better documented #725)

Model

Reformat and add new ModelComponent New ModelComponent class #671
Review the Model class and remove hidden/non clear behavior (eg Model._FOLDERS)
Introduce a separate ModelRoot class to manage better read/write behavior.

Data catalog and adapters

Reformat the drivers to a new DataDriver class Support custom drivers to connect to data APIs #432
Simplify drivers (one driver per reading method, eg simplify/split vector driver in geodataframe) (Simplify data drivers #720)
Good cross-platform support (Windows and Linux): Handle data sources paths in data catalog in an OS-independent manner. (Improve cross platform support for the DataCatalog root #721)
Handle cloud file systems in predefined (S3 & GCS) correctly (both raster and vector), and make sure non-public are possible through this API (probably handled by drivers) Support cloud FS for current data catalogs (both public and non-pulic) #722
Handle data available through custom APIs Support custom drivers to connect to data APIs #432 (probably handled by drivers)
Harmonise data name and unit conventions documentation - hydromt data input - harmonize conventions for data names and units between the different plugins #45
Harmonise names of drivers align names of drivers between data adapters #180
Move geometry checking from vector driver to geodataframe adapter Move gemoetry checking from the vector driver to the geodataframe adapter #724
Harmonise no data strategy throughout the codebase Consistently use the No data stragey throughout the code base #723
Ensure time slicing is always applied consistently Ensure and check time slicing is done properly in the data adapters if time_tuple is passed #665

Robustness

Fix major known bugs
Remove and update entrypoints deprecation to connect to plugins use of entrypoints should be deprecated #454
Very carefully test the defined public API of HydroMT (TBD)

Maintenance

Remove deprecations Voor v1 all deprication warnings need to be removed #717
Be more strict in dependencies versions to ensure older versions of hydromt can still be installed (helps plugin to pin hydromt version if needed). Better dependencies versions with each hydromt release #428

Workflows

Clearly document requirements to use each workflow Requirements for using generic workflows need to be better documented #719

Releases and ownership:

Define and document our release cycle and hydromt vs plugin responsibilities (Clear version support and EOL should be mentioned in the documentation #551)

Nice to have for v1

Model

More examples of generic workflows (eg add_grid_forcing_from_rasterdataset or add_vector_forcing_from_geodataset) to help clearly define what part of the function will end in Model, ModelComponent, workflows and raster/vector/xugrid etc.

Data catalog

Finalize data processing supported in the drivers: dtype support (Ensure and check time slicing is done properly in the data adapters if time_tuple is passed #665, discuss best practice to deal with 2D/3D geometry types #204, new features request for get_geodataframe method #97)

Robustness

Improve test efficiency (in catching bugs) / coverage (aim for 95%)
Fix memory error for PET forcing DeBruin Forcing has instable memory usage #618, refactor + unit tests for forcing.py #32

Maintenance

Improve robustness of releases and publish workflows to pypi, conda-forge and docker

After v1

Maintenance

Improve workflow to generate the documentation
Check link with delta-forge
Improve performance and modularity of our tests

Workflows

For the existing workflows (forcing, river), because they work for regular grid, find a way to make it clear or extract the generic part from the grid part (eg pet forcing could be re-used for vector or mesh model, just reprojection is different).

Memory usage and performance

Improve performance, including benchmarking of system runtime and resource usage
Make better use of dask (lazy data loading and delayed computation) -> unless it is foreseen that this will impact the public API of hydromt

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What do we need to move to V1? #662

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 5 comments 2 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

What do we need to move to V1? #662

savente93 Nov 27, 2023 Maintainer

Documentation

Testing, Releases & Ownership

Memory & performance characteristics

FS & Cloud interactions

External API

Conclusion

Replies: 5 comments · 2 replies

Jaapel Nov 30, 2023

deltamarnix Dec 1, 2023 Collaborator

savente93 Jan 2, 2024 Maintainer Author

alimeshgi Dec 1, 2023 Maintainer

hboisgon Dec 4, 2023 Maintainer

Command line users

Python users and plugin developers

Maintainability and testing of hydromt

DirkEilander Dec 20, 2023 Maintainer

Stable API

Documentation

Data

Model

Testing

Releases and ownership

hboisgon Dec 22, 2023 Maintainer

Needed for v1

API

Documentation

Model

Data catalog and adapters

Robustness

Maintenance

Workflows

Releases and ownership:

Nice to have for v1

Model

Data catalog

Robustness

Maintenance

After v1

Maintenance

Workflows

Memory usage and performance

savente93
Nov 27, 2023
Maintainer

Replies: 5 comments 2 replies

Jaapel
Nov 30, 2023

deltamarnix Dec 1, 2023
Collaborator

savente93 Jan 2, 2024
Maintainer Author

alimeshgi
Dec 1, 2023
Maintainer

hboisgon
Dec 4, 2023
Maintainer

DirkEilander
Dec 20, 2023
Maintainer

hboisgon
Dec 22, 2023
Maintainer