Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add fixes to products for when REPLAY IC's are used #2755

Merged

Conversation

EricSinsky-NOAA
Copy link
Contributor

@EricSinsky-NOAA EricSinsky-NOAA commented Jul 10, 2024

Description

This PR fixes a couple issues that arise when replay initial conditions are used. These issues only occur when REPLAY_ICS is set to YES and OFFSET_START_HOUR is greater than 0. The following items are addressed in this PR.

  1. Fix issue that causes ocean_prod tasks not to be triggered (issue #2725). A new diag_table was added (called diag_table_replay) that is used only when REPLAY_ICS is set to YES. This diag_table accounts for the offset that occurs when using replay IC's.
  2. Fix issue that causes atmos_prod tasks not to be triggered for the first lead time (e.g. f003) (issue #2754). When OFFSET_START_HOUR is greater than 0, the first fhr is ${OFFSET_START_HOUR}+(${DELTIM}/3600), which is defined in forecast_predet.sh and will allow data for the first lead time to be generated. The filename with this lead time will still be labelled with OFFSET_START_HOUR.
  3. Minor modifications were made to the extractvars task so that atmos data from replay cases can be processed.

This PR was split from PR #2680.

Refs #2725, #2754

Type of change

  • Bug fix (fixes something broken)

Change characteristics

  • Is this a breaking change (a change in existing functionality)? NO
  • Does this change require a documentation update? NO

How has this been tested?

These changes were cloned, built and ran on WCOSS2. These changes were tested by running GEFS.

Checklist

  • Any dependent changes have been merged and published
  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • My changes generate no new warnings
  • New and existing tests pass with my changes
  • I have made corresponding changes to the documentation if necessary

Copy link
Contributor

@aerorahul aerorahul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See some suggestions.

ush/parsing_namelists_FV3.sh Outdated Show resolved Hide resolved
ush/parsing_namelists_FV3.sh Outdated Show resolved Hide resolved
ush/forecast_postdet.sh Outdated Show resolved Hide resolved
ush/forecast_postdet.sh Outdated Show resolved Hide resolved
NDATE has been removed and has been replaced with date.
The determination on whether or not the replay_diag_table should be used has been moved to config.fcst.
The dest_file variable has been moved outside of if-statement in forecast_postdet.
A bugfix has been added in exglobal_stage_ic to ensure ${MEMDIR:3} is not intrepreted in base 8.
A fix has been added for cases where there is an fhr that is a decimal number.
ush/forecast_postdet.sh Fixed Show fixed Hide fixed
@christopherwharrop-noaa
Copy link
Contributor

Rocoto has very little control over hanging batch system commands. If a batch system is being hammered, and qstat starts taking a long time to run, all Rocoto can do is time out the commands and try again later.

I will say that PBSPro is highly prone to this type of behavior as it has had scaling/threading problems in the past. Which is one reason why I was very surprised to see PBSPro selected as the batch system for WCOSS. And it is very easy to overload PBSPro with naive user behaviors. Rocoto has already been highly tuned to do everything possible to put as little load as possible on PBSPro as problems were first seen and addressed on Cheyenne.

@christopherwharrop-noaa
Copy link
Contributor

If you (or others) have other processes that run qstat commands on WCOSS, you may want to audit those to find out what they are doing. PBSPro is very fragile, and someone running qstat or, even worse, pbsnodes, in the wrong way in automation could very easily cause problems.

@TerrenceMcGuinness-NOAA
Copy link
Collaborator

TerrenceMcGuinness-NOAA commented Aug 12, 2024

The enkfgdascleanup job is PENDING in the queue for the C96C48_hybatmDA case on Hera.
Not sure what launch failed requeued held means.
Not flagged as STALLED because the is still QUEUED when PENDING.

Terry.McGuinness (hfe03) C96C48_hybatmDA_57107f48 $ squeue -u $USER
     JOBID PARTITION  NAME                     USER             STATE        TIME TIME_LIMIT NODES NODELIST(REASON)
  64649768 hera       C96C48_hybatmDA_57107f48 Terry.McGuinness PENDING      0:00      15:00     1 (launch failed requeued held)
  64649463 hera       C96_atmaerosnowDA_57107f Terry.McGuinness PENDING      0:00      30:00    10 (launch failed requeued held)
  
Terry.McGuinness (hfe03) C96C48_hybatmDA_57107f48 $ rocotostat -w C96C48_hybatmDA_57107f48.xml -d C96C48_hybatmDA_57107f48.db | grep QUE
202112210000         enkfgdascleanup                    64649768              QUEUED                   -         0           0.0

Terry.McGuinness (hfe03) C96C48_hybatmDA_57107f48 $ rocotocheck -w C96C48_hybatmDA_57107f48.xml -d C96C48_hybatmDA_57107f48.db -c 202112210000 -t enkfgdascleanup

Task: enkfgdascleanup
  account: nems
  command: /scratch1/NCEPDEV/global/CI/2755/gfs/jobs/rocoto/cleanup.sh
  cores: 1
  cycledefs: gdas
  final: false
  jobname: C96C48_hybatmDA_57107f48_enkfgdascleanup_00
  join: /scratch1/NCEPDEV/global/CI/2755/RUNTESTS/COMROOT/C96C48_hybatmDA_57107f48/logs/2021122100/enkfgdascleanup.log
  maxtries: 2
  memory: 4096M
  name: enkfgdascleanup
  nodes: 1:ppn=1:tpp=1
  partition: hera
  queue: batch
  throttle: 9999999
  walltime: 00:15:00
  environment
    CDATE ==> 2021122100
    COMROOT ==> /scratch1/NCEPDEV/global/CI/2755/RUNTESTS/COMROOT
    DATAROOT ==> /scratch1/NCEPDEV/stmp2/Terry.McGuinness/RUNDIRS/C96C48_hybatmDA_57107f48/enkfgdas.2021122100
    EXPDIR ==> /scratch1/NCEPDEV/global/CI/2755/RUNTESTS/EXPDIR/C96C48_hybatmDA_57107f48
    HOMEgfs ==> /scratch1/NCEPDEV/global/CI/2755/gfs
    NET ==> gfs
    PDY ==> 20211221
    RUN ==> enkfgdas
    RUN_ENVIR ==> emc
    cyc ==> 00
  dependencies
    AND is satisfied
      SOME is satisfied
        enkfgdasearc00 of cycle 202112210000 is SUCCEEDED
        enkfgdasearc01 of cycle 202112210000 is SUCCEEDED

Cycle: 202112210000
  Valid for this task: YES
  State: active
  Activated: 2024-08-08 00:59:14 UTC
  Completed: -
  Expired: -

Job: 64649768
  State:  QUEUED (PENDING)
  Exit Status:  -
  Tries:  0
  Unknown count:  0
  Duration:  0.0
Terry.McGuinness (hfe03) C96C48_hybatmDA_57107f48 $ 

@DavidHuber-NOAA
Copy link
Contributor

DavidHuber-NOAA commented Aug 12, 2024

@TerrenceMcGuinness-NOAA This looks like an issue with Slurm based on this conversation. Here is an excerpt from Slurm support:

this usually means the launched failed on the compute nodes for one reason or another and instead of relaunching the job on the node and draining the queue it holds it to gain further review from the user or the admin.

What I would do is look at the slurmd log from one of the nodes where the job ran and see why the job failed.

A common mistake is the spool dir isn't owned by user root (or the slurmd isn't ran by user root).  But I am guessing this is a new thing and most other jobs have ran with no issue.

If you could look at the log and see if you can find something let me know.

If you can't easily see anything please attach the slurmd and slurmcltd logs and I can look and see if I can find something.

If you can also attach your slurm.conf file that would be helpful as well.

I'd suggest you report it to RDHPCS.

@DavidHuber-NOAA I will return to this shortly. So far as I can tell (no log and no retires) that this job didn't run Still need to use Slurm queries on the slurm job number.

@TerrenceMcGuinness-NOAA TerrenceMcGuinness-NOAA added CI-Wcoss2-Running **Bot use only** CI testing on WCOSS for this PR is in-progress and removed CI-Wcoss2-Failed **Bot use only** CI testing on WCOSS for this PR has failed labels Aug 12, 2024
@TerrenceMcGuinness-NOAA
Copy link
Collaborator

Made suggested updates to the Rocoto configurations for Timeouts and restated all the CI cases for this PR on WCOSS2:

terry.mcguinness (clogin03) 1.3.5 $ cat rocotorc 
---
:DatabaseType: SQLite3
:WorkflowDocType: XML
:DatabaseServer: true
:BatchQueueServer: true
:WorkflowIOServer: true
:MaxUnknowns: 3
:MaxLogDays: 7
:AutoVacuum: true
:VacuumPurgeDays: 30
:SubmitThreads: 8
:JobQueueTimeout: 120
:JobAcctTimeout: 120

terry.mcguinness (clogin03) 1.3.5 $ crontab -l
MAILTO="terry.mcguinness@noaa.gov>"
SHELL=/bin/bash -l
ci_dir=/lfs/h2/emc/global/noscrub/terry.mcguinness/GW/global-workflow_ci/ci/scripts
d_cmd=date +%Y-%m-%d-%H:%M
*/4 * * * * d=$($d_cmd); ${ci_dir}/driver.sh >& ~/ci_logs/bash_driver_${d}.log || echo "ERROR in driver"
*/7 * * * * d=$($d_cmd); ${ci_dir}/run_ci.sh >& ~/ci_logs/run_ci_${d}.log || echo "ERROR in run_ci"
*/6 * * * * d=$($d_cmd); ${ci_dir}/check_ci.sh >& ~/ci_logs/check_ci_${d}.log || echo "ERROR in check_ci"

terry.mcguinness (clogin03) 1.3.5 $ q

cbqs01: 
                                                                 Req'd  Req'd   Elap
Job ID               Username Queue    Jobname    SessID NDS TSK Memory Time  S Time
-------------------- -------- -------- ---------- ------ --- --- ------ ----- - -----
148101461.cbqs01     terry.m* dev      C48_S2SW_* 129237   1 128    --  03:00 R 00:04
148101541.cbqs01     terry.m* dev      C96_atmae* 172194   1 128    --  00:20 R 00:02
148101616.cbqs01     terry.m* dev      C96C48_hy* 117488   4 500    --  01:20 R 00:02
148101617.cbqs01     terry.m* dev      C96C48_hy* 187580   1  80    --  00:15 R 00:02
148101618.cbqs01     terry.m* dev      C96C48_hy*  70962   4 500    --  01:00 R 00:02
148101695.cbqs01     terry.m* dev      C96C48_uf*  67337   1 128    --  00:20 R 00:01
148101706.cbqs01     terry.m* dev      C96C48_uf*  73148   1 128    --  00:20 R 00:01
148101707.cbqs01     terry.m* dev      C96C48_uf* 212313   1 128    --  00:20 R 00:01

terry.mcguinness (clogin03) 1.3.5 $ displaydb
2755 Open Running 0 ci_repo
terry.mcguinness (clogin03) 1.3.5 $ 

@emcbot emcbot added CI-Wcoss2-Failed **Bot use only** CI testing on WCOSS for this PR has failed and removed CI-Wcoss2-Running **Bot use only** CI testing on WCOSS for this PR is in-progress labels Aug 12, 2024
@emcbot
Copy link

emcbot commented Aug 12, 2024

Experiment C96_atm3DVar_extended_d443bf9c STALLED on Wcoss2 at 08/12/24 02:30:21 PM

@TerrenceMcGuinness-NOAA
Copy link
Collaborator

TerrenceMcGuinness-NOAA commented Aug 12, 2024

Confirmed: The STALLED condition on WCOSS2 is a false negative. No UNAVAILBLE were observed, only UNKOWNS. Added tasking to make the STALLED flag more robust.

@TerrenceMcGuinness-NOAA
Copy link
Collaborator

Got passed PBS/Rocoto anomalies that were leading to false negatives for STALLING because of UNKNOWN/UNAVAILBLE states, and arrived at gsi.x fails. Starting CI up one more time to capture results.

@TerrenceMcGuinness-NOAA TerrenceMcGuinness-NOAA added CI-Wcoss2-Running **Bot use only** CI testing on WCOSS for this PR is in-progress and removed CI-Wcoss2-Failed **Bot use only** CI testing on WCOSS for this PR has failed labels Aug 12, 2024
@emcbot emcbot added CI-Wcoss2-Failed **Bot use only** CI testing on WCOSS for this PR has failed and removed CI-Wcoss2-Running **Bot use only** CI testing on WCOSS for this PR is in-progress labels Aug 12, 2024
@emcbot
Copy link

emcbot commented Aug 12, 2024

Experiment C48_S2SW_d443bf9c FAIL on Wcoss2 at 08/12/24 03:54:24 PM

Error logs:

/lfs/h2/emc/global/noscrub/globalworkflow.ci/GFS_CI_ROOT/PR/2755/RUNTESTS/COMROOT/C48_S2SW_d443bf9c/logs/2021032312/gfsfcst.log

Follow link here to view the contents of the above file(s): (link)

@KateFriedman-NOAA
Copy link
Member

@EricSinsky-NOAA FYI, I have a PR in CI testing that refactors the staging job: #2651

I ended up removing the DTG_PREFIX that is built using cycle and OFFSET_START_HOUR and replaced with model_start_date_current_cycle and associated logic to set that variable. I think my PR may add conflicts with yours if it goes in first. If you have a moment, please review the staging job PR and let me know if we'll need to add the usage of OFFSET_START_HOUR back in and update the REPLAY_ICS blocks in the new yaml (see links below). I was not able to test with REPLAY_ICS=YES for my branch testing.

https://github.com/KateFriedman-NOAA/global-workflow/blob/feature/issue_2475/parm/stage/stage.yaml.j2#L104
https://github.com/KateFriedman-NOAA/global-workflow/blob/feature/issue_2475/parm/stage/stage.yaml.j2#L166

If my PR goes in first, I can work with you to make any needed updates in your branch to accommodate the staging job refactor. Let me know!

@emcbot emcbot added CI-Hera-Passed **Bot use only** CI testing on Hera for this PR has completed successfully and removed CI-Hera-Running **Bot use only** CI testing on Hera for this PR is in-progress labels Aug 12, 2024
@emcbot
Copy link

emcbot commented Aug 12, 2024

CI Passed Hera at
Built and ran in directory /scratch1/NCEPDEV/global/CI/2755


Experiment C48_ATM_57107f48 Completed 1 Cycles: *SUCCESS* at Thu Aug  8 02:19:58 UTC 2024
Experiment C48mx500_3DVarAOWCDA_57107f48 Completed 2 Cycles: *SUCCESS* at Thu Aug  8 02:27:37 UTC 2024
Experiment C96_atm3DVar_57107f48 Completed 3 Cycles: *SUCCESS* at Thu Aug  8 03:41:58 UTC 2024
Experiment C48_S2SWA_gefs_57107f48 Completed 1 Cycles: *SUCCESS* at Thu Aug  8 03:48:04 UTC 2024
Experiment C48_S2SW_57107f48 Completed 1 Cycles: *SUCCESS* at Thu Aug  8 04:06:26 UTC 2024
Experiment C96C48_hybatmDA_57107f48 Completed 3 Cycles: *SUCCESS* at Mon Aug 12 17:14:12 UTC 2024
Experiment C96_atmaerosnowDA_57107f48 Completed 3 Cycles: *SUCCESS* at Mon Aug 12 18:57:35 UTC 2024

@WalterKolczynski-NOAA WalterKolczynski-NOAA removed the CI-Wcoss2-Failed **Bot use only** CI testing on WCOSS for this PR has failed label Aug 12, 2024
@WalterKolczynski-NOAA
Copy link
Contributor

Since (a) the previous failures on WCOSS were unrelated to this PR and (b) everything changes is covered by tests run on other machines, I'm going to go ahead and merge this.

FYI: @KateFriedman-NOAA

@WalterKolczynski-NOAA WalterKolczynski-NOAA merged commit 5699167 into NOAA-EMC:develop Aug 13, 2024
5 checks passed
DavidHuber-NOAA added a commit to DavidHuber-NOAA/global-workflow that referenced this pull request Aug 13, 2024
…e_rocoto

* origin/develop:
  Jenkins Pipeline Updates (NOAA-EMC#2815)
  Add Gaea C5 to CI (NOAA-EMC#2814)
  Add support for forecast-only runs on AWS (NOAA-EMC#2711)
  Add fixes to products for when REPLAY IC's are used  (NOAA-EMC#2755)
  Add capability to run forecast in segments (NOAA-EMC#2795)
@EricSinsky-NOAA EricSinsky-NOAA deleted the feature/offsetfixes branch August 14, 2024 12:27
@EricSinsky-NOAA
Copy link
Contributor Author

@EricSinsky-NOAA FYI, I have a PR in CI testing that refactors the staging job: #2651

I ended up removing the DTG_PREFIX that is built using cycle and OFFSET_START_HOUR and replaced with model_start_date_current_cycle and associated logic to set that variable. I think my PR may add conflicts with yours if it goes in first. If you have a moment, please review the staging job PR and let me know if we'll need to add the usage of OFFSET_START_HOUR back in and update the REPLAY_ICS blocks in the new yaml (see links below). I was not able to test with REPLAY_ICS=YES for my branch testing.

https://github.com/KateFriedman-NOAA/global-workflow/blob/feature/issue_2475/parm/stage/stage.yaml.j2#L104 https://github.com/KateFriedman-NOAA/global-workflow/blob/feature/issue_2475/parm/stage/stage.yaml.j2#L166

If my PR goes in first, I can work with you to make any needed updates in your branch to accommodate the staging job refactor. Let me know!

@KateFriedman-NOAA Thank you for letting us know about these conflicts with the replay-related variables. We can continue working to resolve these conflicts in @NeilBarton-NOAA's PR #2788.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI-Hera-Passed **Bot use only** CI testing on Hera for this PR has completed successfully CI-Hercules-Passed **Bot use only** CI testing on Hercules for this PR has completed successfully
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants