Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Staged cable improve resubmission and restart handling #537

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

Whyborn
Copy link
Contributor

@Whyborn Whyborn commented Nov 12, 2024

Improvements to staged_cable driver

I made some improvements to the staged_cable driver to hopefully make it more easily understandable, and more easily shareable. Summary of changes:

  • Changed the way older restarts are handled. Previously, when setting up the current work directory, we would also run through older restart directories to retrieve any "unique" restarts (any with file name not matching a current restart). I've changed it so that at the end of a stage, we grab any "unique" restart from the previous restarts and copy it into the current restart work directory. Quick example: we've finished stage 1, which created restarts climate_rst.nc and cable_rst.nc, which live in {archive}/restart000. We run stage 2, which created restarts casa_rst.nc and cable_nc.nc. We look back at {archive}/restart000, see that there's no new version of climate_rst.nc, so copy that to the work restart directory. The contents of the work restart directory are then moved to {archive}/restart001. This makes restarts much more easily shareable.
  • Changed the way the remaining number of runs are set. Previously, on initialisation of the model, the configuration log was read and the environment variable dictating the number of runs was modified. I realised I can set expt.n_runs instead of the environment variable, which is persistent. Now the configuration log work is only done where necessary in setup() and archive```, rather than every initialisation.

Coming with this is the removal of a fair bit of now-dead code.

Edit: One thing I forgot to mention is the handling of potentially new namelists. I would like it to be possible for people adding new science (or perhaps cable configurations I don't know about yet) to be able to use the tool. Unfortunately, I can't know a priori which namelists will be required in these instances. To get around this, I could simply set

self.optional_config_files = glob.glob('*/nml')

rather than the predefined list of namelists, but it feels like a bit of an abuse of process. Thoughts?

Now only build the config log during setup, as I learned that expt.n_runs can be set to control resubmission rather than setting the environment variable directly. During archiving, any restarts from the previous stage not updated by the current stage are copied to the next set of restarts, so all that is required for portable restarting is a single restart directory.
Original implementation searched for older restarts on setup, now we copy any non-duplicate restarts into the next stage restarts. This allows a single restart to begin any configuration.
The tests required some changes to the configuration and stage config paths. Some minor changes to the tests to account for slightly changed behaviours.
@pep8speaks
Copy link

pep8speaks commented Nov 12, 2024

Hello @Whyborn! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2024-12-04 04:17:35 UTC

@coveralls
Copy link

coveralls commented Nov 12, 2024

Coverage Status

coverage: 58.784% (+0.4%) from 58.364%
when pulling 7a399f5 on staged-cable-improve-resubmission-and-restart-handling
into 27aac37 on master.

@aidanheerdegen
Copy link
Collaborator

I wanted to add Claire to the reviewers, but she doesn't appear in the list of reviewers.

I've made a team for ACCESS-NRI and this is rectified

Copy link
Collaborator

@aidanheerdegen aidanheerdegen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine to me, but I've not looked at the logic deeply, just had a skim and commented on things that stood out to me.

payu/models/staged_cable.py Outdated Show resolved Hide resolved
Comment on lines +246 to +248
shutil.copy(os.path.join(prior_restart_path, f),
self.work_restart_path)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be possible to just make links, which will resolve to their original location IIRC.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about the scenario where we want to share a restart directory with someone else, or perhaps even publicly somewhere, so that they can use it as a start point for simulations? I think it's likely enough that the original run would be somewhere that others won't have read permissions to, or even potentially on other machines.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't looked into the code changes but I don't see how making a copy solves problems of read permissions? You need read access to be able to copy files. So links would just be as well.

Copy link
Contributor Author

@Whyborn Whyborn Nov 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking of a situation where someone (e.g. ACCESS-NRI) has run an expensive spin-up configuration and want to provide the restarts as a starting point for others in the community. The configuration may have 20+ stages, but the full suite of restarts may contain a restart from the 1st stage, one from the 5th stage and the rest from the most recent stage. A supported payu workflow is to take a restart from elsewhere and effectively plug it in, so payu transparently restarts from this restart (I think this is true- I have memory of this being demonstrated on the training day).

A scenario where I run a configuration to create a starting point for community science. I'll likely save the data to my non-permanent scratch space. I want to take the most recent payu set of restarts and store it someone accessible on gdata. If the necessary older restarts are only known by symlink, then they'll disappear relatively soon and whichever user takes this payu restart may not have access to my scratch space where the symlinked files live.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use cp with the -L option to "dereference" links, so the destination will contain all the files referenced by links

https://superuser.com/a/216920

ccarouge
ccarouge previously approved these changes Nov 18, 2024
Copy link
Contributor

@ccarouge ccarouge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks ok to me.

I had to go read the whole file to remind myself of what was going on here, so I have a comment on something you haven't changed. Feel free to ignore.

In this block of code:

for namelist in namelists:
write_target = os.path.join(self.work_input_path, namelist)
stage_nml = os.path.join(self.control_path, stage_name, namelist)
if os.path.isfile(os.path.join(self.control_path, namelist)):
# Instance where there is a master and stage namelist
with open(stage_nml) as stage_nml_f:
stage_namelist = f90nml.read(stage_nml_f)
master_nml = os.path.join(self.control_path, namelist)
f90nml.patch(master_nml, stage_namelist, write_target)

I think it would be more readable if you define master_nml at the same place as stage_nml and use it in the IF condition on file existence (l.182):

if os.path.isfile(master_nml): 

test/models/test_staged_cable.py Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants