Staged cable improve resubmission and restart handling #537

Whyborn · 2024-11-12T01:19:21Z

Improvements to `staged_cable` driver

I made some improvements to the staged_cable driver to hopefully make it more easily understandable, and more easily shareable. Summary of changes:

Changed the way older restarts are handled. Previously, when setting up the current work directory, we would also run through older restart directories to retrieve any "unique" restarts (any with file name not matching a current restart). I've changed it so that at the end of a stage, we grab any "unique" restart from the previous restarts and copy it into the current restart work directory. Quick example: we've finished stage 1, which created restarts climate_rst.nc and cable_rst.nc, which live in {archive}/restart000. We run stage 2, which created restarts casa_rst.nc and cable_nc.nc. We look back at {archive}/restart000, see that there's no new version of climate_rst.nc, so copy that to the work restart directory. The contents of the work restart directory are then moved to {archive}/restart001. This makes restarts much more easily shareable.
Changed the way the remaining number of runs are set. Previously, on initialisation of the model, the configuration log was read and the environment variable dictating the number of runs was modified. I realised I can set expt.n_runs instead of the environment variable, which is persistent. Now the configuration log work is only done where necessary in setup() and archive```, rather than every initialisation.

Coming with this is the removal of a fair bit of now-dead code.

Edit: One thing I forgot to mention is the handling of potentially new namelists. I would like it to be possible for people adding new science (or perhaps cable configurations I don't know about yet) to be able to use the tool. Unfortunately, I can't know a priori which namelists will be required in these instances. To get around this, I could simply set

self.optional_config_files = glob.glob('*/nml')

rather than the predefined list of namelists, but it feels like a bit of an abuse of process. Thoughts?

Now only build the config log during setup, as I learned that expt.n_runs can be set to control resubmission rather than setting the environment variable directly. During archiving, any restarts from the previous stage not updated by the current stage are copied to the next set of restarts, so all that is required for portable restarting is a single restart directory.

Original implementation searched for older restarts on setup, now we copy any non-duplicate restarts into the next stage restarts. This allows a single restart to begin any configuration.

The tests required some changes to the configuration and stage config paths. Some minor changes to the tests to account for slightly changed behaviours.

pep8speaks · 2024-11-12T01:19:27Z

Hello @Whyborn! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2024-12-04 04:17:35 UTC

coveralls · 2024-11-12T01:22:32Z

coverage: 58.784% (+0.4%) from 58.364%
when pulling 7a399f5 on staged-cable-improve-resubmission-and-restart-handling
into 27aac37 on master.

aidanheerdegen · 2024-11-12T05:42:36Z

I wanted to add Claire to the reviewers, but she doesn't appear in the list of reviewers.

I've made a team for ACCESS-NRI and this is rectified

aidanheerdegen

Looks fine to me, but I've not looked at the logic deeply, just had a skim and commented on things that stood out to me.

payu/models/staged_cable.py

aidanheerdegen · 2024-11-15T01:44:09Z

payu/models/staged_cable.py

+                    shutil.copy(os.path.join(prior_restart_path, f),
+                                self.work_restart_path)
+


I think it should be possible to just make links, which will resolve to their original location IIRC.

What about the scenario where we want to share a restart directory with someone else, or perhaps even publicly somewhere, so that they can use it as a start point for simulations? I think it's likely enough that the original run would be somewhere that others won't have read permissions to, or even potentially on other machines.

I haven't looked into the code changes but I don't see how making a copy solves problems of read permissions? You need read access to be able to copy files. So links would just be as well.

I'm thinking of a situation where someone (e.g. ACCESS-NRI) has run an expensive spin-up configuration and want to provide the restarts as a starting point for others in the community. The configuration may have 20+ stages, but the full suite of restarts may contain a restart from the 1st stage, one from the 5th stage and the rest from the most recent stage. A supported payu workflow is to take a restart from elsewhere and effectively plug it in, so payu transparently restarts from this restart (I think this is true- I have memory of this being demonstrated on the training day).

A scenario where I run a configuration to create a starting point for community science. I'll likely save the data to my non-permanent scratch space. I want to take the most recent payu set of restarts and store it someone accessible on gdata. If the necessary older restarts are only known by symlink, then they'll disappear relatively soon and whichever user takes this payu restart may not have access to my scratch space where the symlinked files live.

You can use cp with the -L option to "dereference" links, so the destination will contain all the files referenced by links

https://superuser.com/a/216920

ccarouge

It looks ok to me.

I had to go read the whole file to remind myself of what was going on here, so I have a comment on something you haven't changed. Feel free to ignore.

In this block of code:

payu/payu/models/staged_cable.py

Lines 178 to 188 in b69b26d

    
           for namelist in namelists: 
        
               write_target = os.path.join(self.work_input_path, namelist) 
        
               stage_nml = os.path.join(self.control_path, stage_name, namelist) 
        
               if os.path.isfile(os.path.join(self.control_path, namelist)): 
        
                   # Instance where there is a master and stage namelist 
        
                   with open(stage_nml) as stage_nml_f: 
        
                       stage_namelist = f90nml.read(stage_nml_f) 
        
                   master_nml = os.path.join(self.control_path, namelist) 
        
                   f90nml.patch(master_nml, stage_namelist, write_target)

I think it would be more readable if you define master_nml at the same place as stage_nml and use it in the IF condition on file existence (l.182):

if os.path.isfile(master_nml):

test/models/test_staged_cable.py

Whyborn added 3 commits November 6, 2024 15:33

Add older necessary restarts to most recent restart dir

4cf0e1b

Original implementation searched for older restarts on setup, now we copy any non-duplicate restarts into the next stage restarts. This allows a single restart to begin any configuration.

Some robustness changes picked up to pass testing

1218e2a

The tests required some changes to the configuration and stage config paths. Some minor changes to the tests to account for slightly changed behaviours.

Whyborn requested a review from aidanheerdegen November 12, 2024 01:19

Pep8 compliance

b69b26d

aidanheerdegen requested a review from ccarouge November 12, 2024 05:42

aidanheerdegen reviewed Nov 15, 2024

View reviewed changes

ccarouge previously approved these changes Nov 18, 2024

View reviewed changes

test/models/test_staged_cable.py Show resolved Hide resolved

Add all namelists to optional_config_files and improved clarity

7a399f5

Whyborn dismissed ccarouge’s stale review via 7a399f5 December 4, 2024 04:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Staged cable improve resubmission and restart handling #537

Staged cable improve resubmission and restart handling #537

Whyborn commented Nov 12, 2024 •

edited

Loading

pep8speaks commented Nov 12, 2024 •

edited

Loading

coveralls commented Nov 12, 2024 •

edited

Loading

aidanheerdegen commented Nov 12, 2024

aidanheerdegen left a comment

aidanheerdegen Nov 15, 2024

Whyborn Nov 18, 2024

ccarouge Nov 18, 2024

Whyborn Nov 18, 2024 •

edited

Loading

aidanheerdegen Nov 20, 2024

ccarouge left a comment

		shutil.copy(os.path.join(prior_restart_path, f),
		self.work_restart_path)

	for namelist in namelists:
	write_target = os.path.join(self.work_input_path, namelist)
	stage_nml = os.path.join(self.control_path, stage_name, namelist)

	if os.path.isfile(os.path.join(self.control_path, namelist)):
	# Instance where there is a master and stage namelist
	with open(stage_nml) as stage_nml_f:
	stage_namelist = f90nml.read(stage_nml_f)

	master_nml = os.path.join(self.control_path, namelist)
	f90nml.patch(master_nml, stage_namelist, write_target)

Staged cable improve resubmission and restart handling #537

Are you sure you want to change the base?

Staged cable improve resubmission and restart handling #537

Conversation

Whyborn commented Nov 12, 2024 • edited Loading

Improvements to staged_cable driver

pep8speaks commented Nov 12, 2024 • edited Loading

Comment last updated at 2024-12-04 04:17:35 UTC

coveralls commented Nov 12, 2024 • edited Loading

aidanheerdegen commented Nov 12, 2024

aidanheerdegen left a comment

Choose a reason for hiding this comment

aidanheerdegen Nov 15, 2024

Choose a reason for hiding this comment

Whyborn Nov 18, 2024

Choose a reason for hiding this comment

ccarouge Nov 18, 2024

Choose a reason for hiding this comment

Whyborn Nov 18, 2024 • edited Loading

Choose a reason for hiding this comment

aidanheerdegen Nov 20, 2024

Choose a reason for hiding this comment

ccarouge left a comment

Choose a reason for hiding this comment

Whyborn commented Nov 12, 2024 •

edited

Loading

Improvements to `staged_cable` driver

pep8speaks commented Nov 12, 2024 •

edited

Loading

coveralls commented Nov 12, 2024 •

edited

Loading

Whyborn Nov 18, 2024 •

edited

Loading