Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ACCESS-OM3 Deployment on Gadi Killed during esmf build Phase #8

Closed
CodeGat opened this issue Apr 25, 2024 · 6 comments · Fixed by ACCESS-NRI/build-cd#61
Closed

ACCESS-OM3 Deployment on Gadi Killed during esmf build Phase #8

CodeGat opened this issue Apr 25, 2024 · 6 comments · Fixed by ACCESS-NRI/build-cd#61
Assignees

Comments

@CodeGat
Copy link
Contributor

CodeGat commented Apr 25, 2024

Pull request in question: #5

Within the above pull request, during the deployment of ACCESS-OM3, our jobs seemingly get killed by Gadi during esmfs build phase. Examples of these runs are at the bottom of the PR:

It may be something to do with an Out Of Memory Error being triggered by Gadi during that phase.

Potential Solutions

  • Find a way to parallelize esmf build phase?

Pinging @micaeljtoliveira - have you had any experience with an issue like this when building ACCESS-OM3 on Gadi?

@micaeljtoliveira
Copy link

@CodeGat Where is the compilation happening? On a login or a compute node?

Find a way to parallelize esmf build phase?

By default spack does parallelize builds. What it does not by default is to parallelize over builds, but that's a bit more tricky and not relevant here.

@CodeGat
Copy link
Contributor Author

CodeGat commented Apr 26, 2024

The compilation is happening on a login node

@micaeljtoliveira
Copy link

There are several limits enforced on the login nodes regarding memory usage and CPU time. I strongly recommend you move the builds to a compute node. That will require submitting a job though...

@aidanheerdegen
Copy link
Member

That will require submitting a job though...

And attendant wait in the queue for every build job. Would really like to avoid that if possible.

@CodeGat
Copy link
Contributor Author

CodeGat commented Apr 30, 2024

Have confirmed that, for install of access-om3 with spack -d install --fresh on Gadi:

  • --jobs 8 fails on esmf build stage
  • --jobs 4 succeeds

Will need to test how long it takes to install something like om2 with --jobs 4 vs not specifying --jobs.

@CodeGat
Copy link
Contributor Author

CodeGat commented May 1, 2024

Just to finish up on this conversation thread, it's now a per-model vars.SPACK_INSTALL_PARALLEL_JOBS rather than an explicit, global --jobs 4.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants