Run SCREAM processes in standalone mode for science research #2845

PeterCaldwell · 2024-05-23T17:04:39Z

PeterCaldwell
May 23, 2024
Maintainer

Ability to run a single SCREAM process with user-defined inputs and use their outputs for evaluation would be very useful.

As an example, @hassanbeydoun currently wants to run RRTMGP in standalone mode with and without our moisture profile corrected to get rid of our dry bias at mid-levels and moist bias at high levels. I've also benefited greatly from running microphysics iteratively within a bespoke/idealized column driver (as advocated by Adrian Hill's KiD driver). This allows us to deeply investigate whether microphysics is working correctly. One could imagine a similar functionality for SHOC (or any other process).

One (easy?) pathway to this would be to just run an existing process test with the input hacked. Does anyone have a good sense of what would be involved in this?

A more awesome solution would be to develop a strategy for calling a process from python since that's the language everyone will be using to evaluate this output. @mahf708 envisions creating a python package that links to pre-compiled C++. EAMxx staff would create this package monthly or so so folks could just pip-install it rather than having to muck about themselves.

As a 3rd solution, Aaron suggested a uniform driver (in C++?) where a user could specify which process they want to run (at run time?). A challenge is that passing input vars to the proc might get confusing since they would be different for each proc and there could be a lot of them.

Open questions:

which strategy is most workable?
will we just expose the main processes, or also low-level functions. It would be nice to be able to use our saturation mixing ratio function for analysis, for example. More generally, it would be nice if a user could (for example) run the P3 process, find an issue with a piece (say, accretion), then run accretion by itself and figure out the error was in a sub-function of accretion and run that by itself. But exposing everything might be a lot of work.

bartgol · 2024-05-23T19:04:53Z

bartgol
May 23, 2024
Maintainer

I'm definitely against providing wrappers for each individual functions. If you go there, you open a can of worms that nobody (at least, not me) want to develop/maintain.

I am not sure I understand the "uniform driver" solution.

As for the other approaches, using existing single-process tests should probably be the way to go. Users can still call the standalone executables from python. They just need to provide the input.yaml file, but that can be done by reading in a sample one, modifying the few params they want to modify, and dump it again. Loading a yaml file produces a python dict, so navigating the yaml tree is quick, once one knows the main structure.

I would try to stand away from calling processes from python via c-py bindings. EAMxx has a non-trivial initialization sequence, as well as a non-trivial infrastructure for running a process. I am a bit wary of giving the user the ability to call functions at will. My fear is that things won't work, and we'll spend lots of time debugging these wrappers. Since I think performance is not key here, I think running standalone execs via subprocess pipes is plenty good.

0 replies

mahf708 · 2024-05-27T18:28:47Z

mahf708
May 27, 2024
Collaborator

I did an initial exploration of the python package idea. I will first describe what I tried doing, and then I will go over the challenges and potential solutions.

What I tried:
The idea of using a python package hinges on creating an importable executable in a format python understands such that import some_exe will work in python. To that end, I wanted to start very small.

I used nanobind and pybind11 to create a very simple function and exposed it as a python package in isolated environment. Both worked. Fwiw, pybind11 is the more established of the two but nanobind is supposed to be much slimmer, faster, more modern, etc.. This step was done in complete isolation from SCREAM (i.e., in a separate setup that is completely independent --- this is important as it will become evident below). In practice, I wrote a cpp file and CMakeLists.txt file.
I took the simple package above and inserted it in the diagnostics subdir under an "ext" subdir, keeping the cpp file as is, but taking the content of CMakeLists.txt and putting them in nearby SCREAM CMakeLists.txt. When I did this, nanobind no longer worked for me, but pybind11 still worked. The catch with nanobind appears to be related to a very strict cmake logic, and the error was in the build system inability to find the nanobind headers. Anyway, pybind11 worked completely fine. Mind you, the cpp file wasn't changed; this is important as will become clear in the next step.
I started adding more tools (meaning headers, etc.) from SCREAM into the cpp file to expose them and use them in the package (but without changing the simple function). This was done for testing before actually using exposed headers for anything substantial (my guess would be that, if they compile, they will "just" work eventually).
- I quickly realized that pybind11 required dynamic linking, and so I had to start manually turning on dynamic linking for each subcomponent I was adding. (In SCREAM, we strictly use static linking as far as I understand.) No big deal, but things started becoming invasive (so a future isolated, and potentially redundant, ext subdir may become pretty annoying to maintain). Things seemed to work until they did not ...
- Things seemed to work until they stopped --- dynamically linking peripheral components (like diagnostics, scream-share, etc.) seemed fine, but I hit stubborn linking errors with deeper part of ekat and/or kokkoscore, at which point I knew I couldn't continue debugging without involving either @bartgol or @jgfouca or both.
- Of note, the error with the deeper parts wasn't in the compilation --- which worked fine --- but rather in the import xyz stage with a missing cryptic symbol. I resolved some of these errors, but then I hit another roadblock with how to deal with MPI (MPI is almost always involved in some capacity wherever one looks in SCREAM, and so we need to carefully init this stuff to avoid errors like void ekat::Comm::check_mpi_inited() const: Assertion 'flag!=0' failed).

What I concluded (for now):
While this seems doable with some extra effort, the effort may be such that we cannot justify it. It will take diverting precious resources to this effort. I think we can come up with a strategy whereby we remake the SCREAM build system from scratch to accommodate dynamic linking and make the process easier for a potential python package, but again, this may become unwieldy very quickly. I think Luca's assessment above is more pessimistic than mine, but I now generally agree about the effort needed and potential maintenance.

Looking ahead:
I think potentially having some (not necessarily all) SCREAM internals "python packageable" will unlock a lot of interesting experimentation. For one, it will enable the point in the main post here by @PeterCaldwell. For another, and maybe for our "next pivot," this can potentially maximize SCREAM usability and interoperability --- the latter is especially important for applications like machine learning and making use of interesting/innovative interfaces (say a process written in rust or jax-compile-able or numba-optimized or triton-compile-able or whatever). Anyway, before we commit resources to this, we will want to have a longer discussion. I think hearing @mt5555's opinion will be super insightful as well :)

Something I'd like to look into soon-ish:
pykokkos

0 replies

PeterCaldwell · 2024-05-28T15:18:40Z

PeterCaldwell
May 28, 2024
Maintainer Author

This is awesome, thanks Naser! If I understand correctly, my "python is written in C and our code is C++ so it should be easy to bridge from one to the other" thinking is wrong because of the complexities our code needs to run on a supercomputer (e.g. MPI, possibly Kokkos, initialization, etc).

My feeling is that if we need to modify EAMxx source code in order to get this functionality working, we should abandon the effort... maintaining 2 similar but not identical code bases is not an efficient use of our time for enabling a "bonus" capability. I'm curious whether any of you disagree or have a more nuanced view.

Luckily, I asked @hassanbeydoun and @AaronDonahue to explore the much simpler approach of hijacking an existing standalone test to run a process with user-defined input.

1 reply

mahf708 May 28, 2024
Collaborator

Yes 100% on all fronts. I think that's what we should for the time being.

However, I wouldn't give up completely on the idea of packaging some of our c++ code into python. I will keep working on this every now and then on my own to see how far I can get. I personally intuit that the overhead costs of running our model from python (as opposed running an executable directly) aren't actually that huge. So, we may improve user portability (much like we have robust machine portability) while maintaining perf down the line. It will take some work though... This is something I am quite passionate about, so I am happy to spend more time on it on my own exploring :)

bartgol · 2024-05-28T18:21:31Z

bartgol
May 28, 2024
Maintainer

@mahf708 , some quick thoughts:

EAMxx does not assume static linking. It's just the CMake default. If you want to build shared libs, you need to configure with -DBUILD_SHARED_LIBS:BOOL=ON. That said, I've never tried to do it in EAMxx (even though I've done it plenty with other libs), so I can't tell if there is some gotchas in any of the TPLs. I'm not sure if that's what you did or if you hacked the compiler flags directly.
I had the impression that the goal was to call EAMxx from python, while from what you wrote it seems you try to call python from EAMxx. The latter is already done in the ML physics package, so it does work. As for the former, I do believe we can make it work, but it would require writing a py library for each eamxx library we want to use, with all the wrappers for initializing the needed stuff. It would be relatively simple to wrap the physics packages, but if you want python to access internals (like fields, grids, ...), as I assume you do, then there is a lot of python wrapping to do.
pykokkos is well supported, as I understand, so using it should be fine. But I am not sure you need it. Imho, the memory management and parallelization can still be kept in cpp, with python only used to access data for ease of use.

0 replies

mahf708 · 2024-05-28T19:27:49Z

mahf708
May 28, 2024
Collaborator

Oh I thought it was some SCREAM policy at higher-up dirs... so we could potentially try dynamic linking later to see how things turn out. Btw, I did this for dynamic linking:

  set_target_properties(diagnostics PROPERTIES POSITION_INDEPENDENT_CODE TRUE)

No, this was calling cpp from python:
First, I added pybind11_add_module(exn_ext exn_ext.cpp) to CMakeLists.txt
If I have a very plan cpp file with exn_ext in it, I compile scream standalone stuff as-is, and in the directory of the edited CmakeLists.txt, I will find something like exn_ext.so<BLAHBLAH> where BLAHBLAH is for the python linking and LIBEXT, etc.. Then, in python, in that directory, I can do:

import exn_ext
exn_ext.add(1,3) # prints 7 if as add is defined as add(i,j) --> i+j+j

5 replies

mahf708 May 28, 2024
Collaborator

^ that's the essence of packaging c++ code in python fwiw

Maybe it is inaccurate to describe dynamic linking, we just need the path to be independent, so the above alone is fine. Python extensions expect that.

mahf708 May 28, 2024
Collaborator

If the above works, we can formalize completely in setup.py or whatever. I can handle those details when we are there.

bartgol May 28, 2024
Maintainer

Setting BUILD_SHARED_LIBS=ON should achieve roughly the same thing, except that it is a global property, so it gets set on all targets. I don't know if the exn_ext lib was linking against other targets, but if so, then all of its deps must be built with -fPIC.

mahf708 May 28, 2024
Collaborator

See, that's why I knew I shouldn't go down that path on my own ;)

In fact, the -fPIC error is the one that got me to do the set_target_properties(diagnostics PROPERTIES POSITION_INDEPENDENT_CODE TRUE) but it needed to be done for everything for it to work.

I left this in the following state

[ac.ngmahfouz@chrlogin1 scream]$ cd /home/ac.ngmahfouz/e3sm/scream/components/eamxx/src/diagnostics/exts
[ac.ngmahfouz@chrlogin1 exts]$ cat *
# NOTE: tests inside this if statement won't be built in a baselines-only build


function (createDiagTest test_name test_srcs)
  CreateUnitTest(${test_name} "${test_srcs}"
    LIBS diagnostics physics_share
    LABELS diagnostics
    ${ARGN})
endfunction ()

if (NOT SCREAM_ONLY_GENERATE_BASELINES)
  include(ScreamUtils)

  set(PYLIB_SRCS
  exn_ext.cpp
  )

  include(ScreamUtils)
    if(${CMAKE_VERSION} VERSION_GREATER_EQUAL "3.11.0")
    message(STATUS "Downloading Pybind11")
    include(FetchContent)

    FetchContent_Declare(pybind11 GIT_REPOSITORY https://github.com/pybind/pybind11.git GIT_TAG v2.10.4)
    FetchContent_MakeAvailable(pybind11)
  else()
    message(FATAL_ERROR "pybind11 is missing. Use CMake >= 3.11 or download it")
  endif()
  find_package(Python REQUIRED COMPONENTS Interpreter Development)
  # find_package(pybind11 REQUIRED)
  add_library(py_lib ${PYLIB_SRCS})
  target_compile_definitions(py_lib PUBLIC EAMXX_HAS_PY_LIB)
  target_compile_definitions(py_lib PRIVATE -DPY_LIB_CUSTOM_PATH="${CMAKE_CURRENT_SOURCE_DIR}")
  target_include_directories(py_lib SYSTEM PUBLIC ${PYTHON_INCLUDE_DIRS})
  target_link_libraries(py_lib PUBLIC diagnostics pybind11::pybind11 Python::Python)
  # target_link_libraries(diagnostics PUBLIC scream_share)

  pybind11_add_module(exn_ext exn_ext.cpp)
  target_link_libraries(exn_ext PUBLIC diagnostics scream_share ekat kokkoscore pybind11::pybind11 Python::Python)
  set_target_properties(diagnostics PROPERTIES POSITION_INDEPENDENT_CODE TRUE)
  set_target_properties(scream_share PROPERTIES POSITION_INDEPENDENT_CODE TRUE)
  set_target_properties(ekat PROPERTIES POSITION_INDEPENDENT_CODE TRUE)
  set_target_properties(kokkoscore PROPERTIES POSITION_INDEPENDENT_CODE TRUE)

endif()

#include <pybind11/pybind11.h>

#include "diagnostics/register_diagnostics.hpp"
#include "share/grid/mesh_free_grids_manager.hpp"

namespace scream {
// std::shared_ptr<GridsManager> create_gm(const ekat::Comm &comm, const int ncols,
//                                         const int nlevs) {
//   const int num_global_cols = ncols * comm.size();

//   using vos_t = std::vector<std::string>;
//   ekat::ParameterList gm_params;
//   gm_params.set("grids_names", vos_t{"Point Grid"});
//   auto &pl = gm_params.sublist("Point Grid");
//   pl.set<std::string>("type", "point_grid");
//   pl.set("aliases", vos_t{"Physics"});
//   pl.set<int>("number_of_global_columns", num_global_cols);
//   pl.set<int>("number_of_vertical_levels", nlevs);

//   auto gm = create_mesh_free_grids_manager(comm, gm_params);
//   gm->build_grids();

//   return gm;
// }

ekat::Comm comm(MPI_COMM_WORLD);

// Create a grids manager - single column for these tests
constexpr int nlevs = 9;
const int ngcols    = 1; //* comm.size();

// auto gm   = create_gm(comm, ngcols, nlevs);
// auto grid = gm->get_grid("Physics");

// ekat::ParameterList params;
// auto &diag_factory = AtmosphereDiagnosticFactory::instance();
// auto diag          = diag_factory.create("Exner", comm, params);

  using namespace ShortFieldTagsNames;
  using namespace ekat::units;
// FieldLayout scalar2d_layout{{COL, LEV}, {ngcols, nlevs}};
// FieldIdentifier pm_fid("p_mid", scalar2d_layout, Pa, grid->name());
// Field pm(pm_fid);
// pm.allocate_view();
// pm.get_header().get_tracking().update_time_stamp(t0);

int calc(int i, int j) { return i + j + j; }

PYBIND11_MODULE(exn_ext, m) {
  m.doc() = "pybind11 example plugin";  // optional module docstring
  m.def("calc", &calc, "A function that calcs");
}

[ac.ngmahfouz@chrlogin1 exts]$

mahf708 May 28, 2024
Collaborator

^ be warned, not everything is needed. Just my playing around to see what I could get to work in terms of linking ,etc.

bartgol · 2024-05-28T23:54:41Z

bartgol
May 28, 2024
Maintainer

So, I got convinced that we can easily-ish achieve something that allows this kind of usage from python

import pyscream as ps
import pyp3 as P3
import numpy as np

ncols = 100
nlevs = 128

v1 = np.zeros(shape=(ncols,nlevs))
v2 = np.zeros(shape=(ncols,nlevs))

for i in range(0,ncols):
    for j in range(0,nlevs):
        v1[i,j] = i*nlevs + j + 1
        v2[i,j] = j*ncols + i + 1

f1 = ps.Field("T_mid",v1)
f 2= ps.Field("qv",v2)
...

params = {}
p3 = P3.P3(params)
p3.set_fields(f1,f2,...)

t0 = ps.TimeStamp(...)
dt = 100
for val in range(blah1, blah2):
   p3.set_param('blah',val)
   p3.set_time(t0)
   p3.run(dt)
   
   # analyze f1,f2,...

# Avoid errors from kokkos
f1.cleanup()
f2.cleanup()

2 replies

bartgol May 28, 2024
Maintainer

I already tinkered a bit with the part about creating numpy arrays and passing the pointers to pyscream to create Fields, and it works fine (pyscream can mod entries, and python will see it changed upon return). It didn't take a lot of work. So long as this kind of "basic" usage is what we need, the rest can follow suit in a matter of few days of work (perhaps a few weeks if we want to make it robust and add a tiny bit more of flexibility).

PeterCaldwell May 29, 2024
Maintainer Author

Your example looks great to me, Luca. My feeling is that we should expose the EAMxx processes and expect users to be savvy enough to look in the C++ code to figure out what the input/output variables are and what their types are, then with a little bit of user-guide-ing figure out how to get python variables into the data types needed to pass to python. My thinking is that this would be easier for us than having to write/maintain wrapper functions and will never go out of date (since the user is interacting directly with the typical code). It is also consistent with my rant this morning that modelers shouldn't have to do all the work for analysts writing papers. But maybe I'm underestimating the difficulty of connecting directly to the C code.

crterai · 2024-05-30T16:44:09Z

crterai
May 30, 2024
Collaborator

This is an neat capability! I wanted to up-vote Hassan's use case of running standalone RRTMGP that @PeterCaldwell brought up in the original post. This will be analysis that we'd want to include in the Cess overview paper.

8 replies

mahf708 May 30, 2024
Collaborator

This is something I also very much want personally (but I didn't want to suggest it because I thought it would much harder). I can articulate the use cases more in-detail soon. But, Luca, for RRTMGP, we may want full-resolution support, essentially think of it as extracting out the main radiation call in the unit test

bartgol May 30, 2024
Maintainer

"full-resolution" as opposed to ?

mahf708 May 30, 2024
Collaborator

for the case of p3, we it will be okay to just do single-column. For RRTMGP, we want an actual model rerun of a specific IC/config essentially. Like the full grid.

crterai May 30, 2024
Collaborator

For the Cess analysis case, running RRTMGP on a single-column is sufficient.

mahf708 May 31, 2024
Collaborator

Okay, so I take back my statement. My use case is sufficiently different. In case there's a difference in terms of impl, I would just prioritize what's needed for Cess for now. I won't get around to doing anything with radiation until the fall (that far in the future is impossible to plan for, least of all, it actually isn't all that clear if I am going to be using this account anymore)

mahf708 · 2024-05-30T16:49:42Z

mahf708
May 30, 2024
Collaborator

For reference, the PR with the initial impl is here: #2851. For more technical comments, like code, etc., please also consider adding them there if you have opinions...

1 reply

bartgol May 31, 2024
Maintainer

That PR is getting quite mature. I think it will get integrated early next week. If you want to take a look, now it's a good time to suggest drastic changes of direction.

bartgol · 2024-05-31T04:48:10Z

bartgol
May 31, 2024
Maintainer

@mahf708 I think we could integrate that PR sooner than later, but if you want to wait so we can reorganize the pyXYZ files, so that they all lie in src/python and we create a single module, that's fine with me.

1 reply

mahf708 May 31, 2024
Collaborator

Sounds good. Let's talk soon to discuss. Let's hold off merging until after the weekend though (I want to see if I can pull off the packaging and set up public ci while at it)

Run SCREAM processes in standalone mode for science research #2845

PeterCaldwell May 23, 2024 Maintainer

Replies: 9 comments · 18 replies

bartgol May 23, 2024 Maintainer

mahf708 May 27, 2024 Collaborator

PeterCaldwell May 28, 2024 Maintainer Author

mahf708 May 28, 2024 Collaborator

bartgol May 28, 2024 Maintainer

mahf708 May 28, 2024 Collaborator

mahf708 May 28, 2024 Collaborator

mahf708 May 28, 2024 Collaborator

bartgol May 28, 2024 Maintainer

mahf708 May 28, 2024 Collaborator

mahf708 May 28, 2024 Collaborator

bartgol May 28, 2024 Maintainer

bartgol May 28, 2024 Maintainer

PeterCaldwell May 29, 2024 Maintainer Author

crterai May 30, 2024 Collaborator

mahf708 May 30, 2024 Collaborator

bartgol May 30, 2024 Maintainer

mahf708 May 30, 2024 Collaborator

crterai May 30, 2024 Collaborator

mahf708 May 31, 2024 Collaborator

mahf708 May 30, 2024 Collaborator

bartgol May 31, 2024 Maintainer

bartgol May 31, 2024 Maintainer

mahf708 May 31, 2024 Collaborator

PeterCaldwell
May 23, 2024
Maintainer

Replies: 9 comments 18 replies

bartgol
May 23, 2024
Maintainer

mahf708
May 27, 2024
Collaborator

PeterCaldwell
May 28, 2024
Maintainer Author

mahf708 May 28, 2024
Collaborator

bartgol
May 28, 2024
Maintainer

mahf708
May 28, 2024
Collaborator

mahf708 May 28, 2024
Collaborator

mahf708 May 28, 2024
Collaborator

bartgol May 28, 2024
Maintainer

mahf708 May 28, 2024
Collaborator

mahf708 May 28, 2024
Collaborator

bartgol
May 28, 2024
Maintainer

bartgol May 28, 2024
Maintainer

PeterCaldwell May 29, 2024
Maintainer Author

crterai
May 30, 2024
Collaborator

mahf708 May 30, 2024
Collaborator

bartgol May 30, 2024
Maintainer

mahf708 May 30, 2024
Collaborator

crterai May 30, 2024
Collaborator

mahf708 May 31, 2024
Collaborator

mahf708
May 30, 2024
Collaborator

bartgol May 31, 2024
Maintainer

bartgol
May 31, 2024
Maintainer

mahf708 May 31, 2024
Collaborator