Replies: 5 comments 2 replies
-
3 most important point for my experience so far:
This is for me the main one, as this is really unacceptable when creating "production-ready" applications with HydroMT. If we are more strict with chunking and apply some metadata-reading to the data-catalog items, we can ensure that the memory used for loading the datasets stays under a certain threshold. This does mean that we cannot allow "tiny" machines to operate HydroMT, as we do not want to duplicate datasets with different levels of chunking. I guess it is good to solve these issues and specify some hardware requirements for the recommended chunk size of many datasets. Due to its nature HydroMT will always be memory intensive.
There is much code assuming files. Often fsspec "file-like" API is not perfect and different methods have different results, or are not implemented at all. Looking at data as streams is indeed the way to go, as you can efficiently stream from a file.
👍 |
Beta Was this translation helpful? Give feedback.
-
@JoostBuitink @dalmijn @kathrynroscoe @Leynse @roeldegoede @xldeltares @sibrenloos @MPWeeber Please see the discussion above on “What do we need to move to V1 for HydroMT core”. I would appreciate it if you could have a look and provide us with your comments and suggestions. Thank you in advance! |
Beta Was this translation helpful? Give feedback.
-
I think for me, what matters most with V1 is what @deltamarnix also suggests:
Command line usersCommand line interface Configuration and data catalog file For the data_catalog file, we have done a lot of changes (like the variants) that should be implemented and well documented. With v1, we should aim that format and expected data properties in the catalog file should be more or less fixed. See below for the issues that would help achieve this. For the CLI (build/update) configuration file, we had a recent suggestion from @deltamarnix in #669 to use only one configuration file per model and for updating, change that original file and have hydromt detect the changes compared to the previous build. If we decide to implement this I think this should be in v1 as this would modify a little usage of hydromt. Releasing and installation Documentation Memory usage and performance Not breaking during hydromt running is more important. To my knowledge, there are two main points where memory error do happen. Catchment delineation based on DEM data but there we do not really have a choice. And preparing PET data (#618 and #32 ). It would be great if we can fix this one for v1 and I would leave the rest for after. Python users and plugin developersModel API and generic workflows For the generic workflows, I think it would be nice to have some examples to check if everything we want works correctly. We already have workflows for grid and mesh. I wonder if we could maybe work on an extra one for forcing / states object for which schematization is not fixed (eg add_grid_forcing_from_rasterdataset or add_vector_forcing_from_geodataset)? So that is clear in the future what part of the function will end in Data Catalog Another thing I would like for the data catalog, is to finalize functionalities like data dtype support and geometry type support for GeoDataFrame and GeoDataset, better slicing, nodata handling (see #665, #204 and #97). A final issue is a better version of data name and unit harmonization in #45 so that when users prepare a data catalog, expected names and units are clearer. And well finalize naming of the data catalog properties #180 Workflows and methods Plugin entry points Link with the plugins Maintainability and testing of hydromtDeprecations Better CI Testing
I agree but I think this is more nice to have for v1. For me stability is important and having tests that ensure (new) functionality (keep) work the way they should. It is not easy to write good and meaningful tests but we are getting better. We could still have a last go at improving testing coverage for v1. |
Beta Was this translation helpful? Give feedback.
-
Stable APII agree with @hboisgon and @deltamarnix that the top priority for v1 is a stable API for the CLI, the public python API, and the plugin entrypoints. While the CLI is relatively small and stable, the Python API is large and changes still occur. The first task should be to define what should be part of the public vs private API, where we guaranty stability for the public API. The public API should include (These classes should also be available at the highest level when importing the package):
Other submodules like the DocumentationWe should put the Model class at the center of the documentation. This will help users to understand better what HydroMT is about. DataThis data components are currently the most complex part of HydroMT. This could be simplified by differentiating between the a ModelThe main purpose of the TestingIt would be really good to increase coverage to 95% and make sure that the public API is especially carefully tested for different use cases. Releases and ownership@savente93 made some really good comments about this. It would be good to document our release cycle and what we consider to be our vs plugin responsibilities. |
Beta Was this translation helpful? Give feedback.
-
Needed for v1Goal: the top priority for v1 is a stable API for the CLI, the public python API, and the plugin entrypoints. API
Documentation
Model
Data catalog and adapters
Robustness
Maintenance
Workflows
Releases and ownership:
Nice to have for v1Model
Data catalog
Robustness
Maintenance
After v1Maintenance
Workflows
Memory usage and performance
|
Beta Was this translation helpful? Give feedback.
-
Documentation
Feedback from a recent user survey identifying the documentation as HydroMT's weakest aspect. We've also added more functionality to the CLI that doesn't fit into the build/update/clip structure of the current documentation (namely check and export commands). Therefore, the documentation should at least receive a restructuring, to give these better attention.
I also think there is a lack of documentation segmentation for users of different skill levels. I think the documentation would benefit from a restructuring into something like: lay person, basic user, advanced user, developer (not necessarily in that order or grouping).
There are also some technical considerations that would be good to address before moving to V1 to make maintenance easier for the developers. For example, currently the documentation is the most significant contributor to our CI time, checkout time, repo/history size and is (in my opinion) one of the more cumbersome parts of the repository to work with. While I haven't articulated the recommendations for the documentation just yet, I believe some work here would be beneficial.
Testing, Releases & Ownership
HydroMT's functionality is closely tied to its plugins, which raises questions about compatibility and responsibility for updates. Key considerations include:
Past releases have encountered issues, such as problems with pip and Conda releases. To avoid this, a release testing system should be implemented to catch problems before the release is (partially) completed. HydroMT currently has several release avenues: PyPi, Conda-forge, Deltares-forge, and Docker. All these releases should ideally be tested before release. Conda-forge and Delta-forge currently lack a good testing mechanism, but tools like grayskull could catch major issues for Conda-forge. It is currently not clear what forms of automated testing and releasing are available for delta-forge.
HydroMT has good test coverage, but test isolation, performance, and modularity could be improved. The current release cycle is a minor version every quarter, with additional releases upon request. A streamlined approach to releasing could facilitate an increased frequency. The proposed future release cycle includes a minor release every quarter, a new patch version every month, and major releases only when absolutely necessary. For support, it's proposed to offer two quarters of active support and one year of security support for each minor release.
Memory & performance characteristics
HydroMT's current code base focuses primarily on functionality, with less emphasis on optimisation, which usually results in acceptable performance, but instability in others. For instance, forcing can increase memory usage significantly, with De Bruin forcing known to spike up to 12 GB regardless of chunk size. To be considered production-ready, HydroMT's memory usage needs to be stable and configurable, even if it affects time efficiency.
We would need to provide a baseline benchmark to profile system runtime and resource usage, and to run regression tests. Ideally, we'd create benchmark suites should for both local and cloud-based data for performance comparison. However, potential egress costs associated with the cloud may make this unfeasible.
I can think of several options for improving performance consistency:
Making better use of preprocessing and cloud-optimised formats.
Providing more user-accessible performance-minded options to ensure HydroMT is suitable for production use.
More explicit use of lazy data loading and delayed computation available in Dask. This could improve scalability and performance and allow for internal optimizations like better caching.
While performance improvements are likely to positively affect both runtime and resource usage, I think we should give priority should to memory consumption to prevent program crashes. I think performance and memory characteristics will be the most complex implementations needed for moving to v1, so I think it would be best to leave this for last.
FS & Cloud interactions
HydroMT currently assumes all data is file-based (e.g. the DataAdapter requires a path), leading to several issues. Differences in handling POSIX and Windows paths have caused confusion, and the lack of consistent file operations for cloud storage has created friction in the ongoing cloud pilot. Given these challenges, a centralised approach is necessary to handle cross-platform requirements and cloud-native operations.
Currently, HydroMT's use of lazy loading and read-ahead metadata through Dask and xarray is minimal. To address this, I think a centralised solution is necessary. HydroMT's inability to consume data from streaming APIs also needs to be addressed because of potential data volume requirements.
Implementing these changes will involve significant operations and affect most of the code base. I propose the following steps:
External API
The external API, which will be crucial to the design of V1, needs to be flexible enough to meet users' needs for a considerable time. The proposed components of the external API include:
We should consider anything not included in these categories core-internal. By limiting the external surface area, the core team can maintain flexibility for internal operations and provide a concise API for users. This approach will be necessary for the core team to structure the code understandably.
An excellent discussion about how to reactor the model class is already taking place on GitHub and therefore I did not discuss it here.
Conclusion
To summarise the main points discussed:
The code base should be restructured to expose a more succinct API as well as centralise and make better use of things like Dask and IO.
The documentation should be restructured to better include the different type of users and the recent developments
More attention should be paid to ensuring consistent performance.
More concrete agreements are needed regarding the maintenance of plugins
The collaboration between the plugins could be made easier with additional testing and release automation.
Beta Was this translation helpful? Give feedback.
All reactions