-
Notifications
You must be signed in to change notification settings - Fork 168
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
C768 analysis tasks Fail on Hera #2498
Comments
@spanNOAA Have you compiled UFS with the unstructured wave grids (option |
No, I compiled the global workflow only using the '-g' option. |
@spanNOAA Are you using the top of the |
Yes, I'm using the develop branch. |
FYI, this problem was only observed with C768. I have no issue with C384. |
@spanNOAA Can please point me to your g-w develop branch path on RDHPCS Hera? |
I can locate the local repo at: /scratch2/BMC/wrfruc/Sijie.Pan/ufs-ar/arfs. |
@spanNOAA Thank you. Can you please check out and/or update your current
That will ensure that both the executable is up-to-date and can use the unstructured wave grids. Can you then rerun your C768 experiment to see if the same exceptions are raised? |
Certainly. But before doing so, may I ask two questions:
|
|
These are analysis jobs and have nothing to do with C768 is not a resolution we test regularly, and we tend to discourage people from running C768 on Hera anyway because the machine is small. How much larger did you try making the wallclock? Have you tried increasing the number of cores instead/as well? |
When you mention checking out and/or updating my current develop branch, are you indicating that the entire global workflow needs updating, or is it solely the ufs model that requires to be updated? |
Additionally, when you tried increasing the wallclock, did you regenerate your rocoto XML afterwards? |
These failures are in the analysis job. It is unlikely anything with UFS or its build is the problem here. |
I attempted wallclock settings ranging from 10 to 40 minutes, but none of them works. When the wallclock was set to 20 minutes or more, the program consistently stalled at the same point. |
Okay, I'm going to check your full log and see if I can find anything, otherwise might need to get a specialist to look at it. |
I really appreciate it. |
looking at sfcanl, the problem seems to be in
Since the ranks are tiles, they should all be similar run times. I think this points back to a memory issue. Try changing the resource request to:
That should be overkill, but if it works we can try dialing it back. |
The problem remains despite increasing the nodes to 6. |
@GeorgeGayno-NOAA I'm out of simple ideas, can you take a look at this issue? |
Sure. The
@spanNOAA - Is it always the same tiles/mpi tasks that hang or is it random? Are you using a recent version of 'develop', which works on Rocky 8? |
It's not random. Every time, the tasks for tiles 3-5 stall. While I'm not using the latest version of the 'develop' branch, it does support Rocky 8. The hash of the global workflow I'm using is d6be3b5. |
Let me try to run the cycle step myself. Don't delete your working directories. |
I was able to run your test case using my own stand-alone script - /scratch1/NCEPDEV/da/George.Gayno/cycle.broke If I just run tile 1, there is a bottleneck in the interpolation of the GLDAS soil moisture to the tile:
The interpolation for month=1 takes 6:30 minutes. And there are many uninterpolated points:
The UFS_UTILS C768 regression test, which uses a non-fractional grid, runs very quickly. And there are very few uninterpolated points:
The C48 regression test uses a fractional grid. It runs quickly, but there is a very high percentage of uninterpolated points:
Maybe there is a problem with how the interpolation mask is being set up for fractional grids? |
Could you provide guidance on setting up the interpolation mask correctly for fractional grids? Also, as we're going to run 3-week analysis-forecast cycles, I'm curious about the potential impact of using non-fractional grids instead of fractional grids. |
I think the mask problem is a bug in the global_cycle code. I will need to run some tests. |
@spanNOAA - I found the problem and have a fix. What hash of the ccpp-physics are you using? |
I checked the CMakeList file located at sorc/ufs_utils.fd/ccpp-physics, and it shows that the ccpp version is 5.0.0. |
For sorc/ufs_utils.fd/ccpp-physics:
for sorc/ufs_model.fd/FV3/ccpp/physics
|
I have a fix. Replace the version of sfcsub.F in It should run now with only six mpi tasks - one task per tile. |
The fix now successfully resolves issues for both gdassfcanl and gfssfcanl. Both tasks can be completed without any problems. |
C768 gdasanalcalc failure on Hera examined with the following findings. Job gdasanalcalc copies Able to reproduce this behavior in stand-alone shell script which executes Script
whereas
Both scripts execute
The parallel xml specifies
for the gfs and gdas analcalc job. The analcalc job runs several executables. I do not have a solution for the Hera hang in gdasanalcalc at C768. I am simply sharing what tests reveal. |
Hi @RussTreadon-NOAA, just following up on the Hera hang issue in gdasanalcalc at C768 that we discussed about a month ago. You mentioned that there wasn't a solution available at that time and shared some test results. I wanted to check in to see if there have been any updates or progress on resolving this issue since then. |
@spanNOAA , no updates from me. I am not actively working on this issue. |
@SamuelTrahanNOAA Could you take a look at this issue? Thanks! |
I am looking into this. Presently, I am not able to cycle past the first half-cycle due to OOM errors, so that will need to be resolved first. |
I do not have a solution for this yet, either, but I do have some additional details. The hang occurs at line 390 of driver.F90 |
The issue appears to be the size of the buffer that is sent via |
@DavidHuber-NOAA |
@guoqing-noaa I have opened PR #2819. The branch has other C768 fixes in it that will be helpful for testing. I had another problem with the analysis UPP job, so this is still a work in progress. |
Thanks, @DavidHuber-NOAA |
@DavidHuber-NOAA I have no issues with the C768 gdasanalcalc task after applying this fix. |
What is wrong?
The gdassfcanl, gfssfcanl, and gdasanalcalc tasks encounter failure from the second cycle. Regardless of the time wall set for the job, the tasks consistently exceed the time limit.
I am attempting to run the simulations starting from 2023021018 and ending 202302261800.
Brief snippet of error from gdassfcanl.log and gfssfcanl.log file for 2023021100 forecast cycle:
0: update OUTPUT SFC DATA TO: ./fnbgso.001
0:
0: CYCLE PROGRAM COMPLETED NORMALLY ON RANK: 0
0: slurmstepd: error: *** STEP 58349057.0 ON h34m13 CANCELLED AT 2024-04-16T21:54:15 DUE TO TIME LIMIT ***
slurmstepd: error: *** JOB 58349057 ON h34m13 CANCELLED AT 2024-04-16T21:54:15 DUE TO TIME LIMIT ***
Start Epilog on node h34m13 for job 58349057 :: Tue Apr 16 21:54:17 UTC 2024
Job 58349057 finished for user Sijie.Pan in partition hera with exit code 0:0
End Epilogue Tue Apr 16 21:54:17 UTC 2024
Brief snippet of error from gdasanalcalc.log file for 2023021100 forecast cycle:
PROGRAM INTERP_INC HAS BEGUN. COMPILED 2019100.00 ORG: EMC
STARTING DATE-TIME APR 15,2024 17:16:27.299 106 MON 2460416
srun: Complete StepId=58250207.0 received
slurmstepd: error: *** STEP 58250207.0 ON h1m01 CANCELLED AT 2024-04-15T17:36:15 DUE TO TIME LIMIT ***
slurmstepd: error: *** JOB 58250207 ON h1m01 CANCELLED AT 2024-04-15T17:36:15 DUE TO TIME LIMIT ***
Start Epilog on node h1m01 for job 58250207 :: Mon Apr 15 17:36:18 UTC 2024
Job 58250207 finished for user Sijie.Pan in partition bigmem with exit code 0:0
End Epilogue Mon Apr 15 17:36:18 UTC 2024
What should have happened?
The tasks 'gdassfcanl', 'gfssfcanl', and 'gdasanalcalc' generate the respective files required for the remainder of the workflow to use.
What machines are impacted?
Hera
Steps to reproduce
./setup_expt.py gfs cycled --app ATM --pslot C768_6hourly_0210 --nens 80 --idate 2023021018 --edate 2023022618 --start cold --gfs_cyc 4 --resdetatmos 768 --resensatmos 384 --configdir /scratch2/BMC/wrfruc/Sijie.Pan/ufs-ar/arfs/parm/config/gfs --comroot ${COMROOT} --expdir ${EXPDIR} --icsdir /scratch2/BMC/wrfruc/Guoqing.Ge/ufs-ar/ICS/2023021018C768C384L128/output
Additional information
You can find gdassfcanl.log, gfssfcanl.log and gdasanalcalc.log in the following directory:
/scratch2/BMC/wrfruc/Sijie.Pan/ufs-ar/comroot/C768_6hourly_0210/logs/2023021100
Do you have a proposed solution?
No response
The text was updated successfully, but these errors were encountered: