Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

irreproducable results in variable resolution #631

Closed
jedwards4b opened this issue Aug 3, 2022 · 96 comments
Closed

irreproducable results in variable resolution #631

jedwards4b opened this issue Aug 3, 2022 · 96 comments
Assignees
Labels
bug Something isn't working correctly
Milestone

Comments

@jedwards4b
Copy link

What happened?

We are seeing intermittent failures of the compset FHIST at resolution ne0np4.NATL.ne30x8_t13

I tried twice at NTASKS=3600 one failed on startup and one ran successfully.
Isla tried at NTASKS=3600 and had two successful runs and one failure.

Isla tried at NTASKS=5400 and had a similar failure. I tried at that resolution and had a successful run.

All this to say that I suspect there may be a race condition type problem here and it seems that this compset should be tested more.

What are the steps to reproduce the bug?

./create_newcase --compset FHIST --res ne0np4.NATL.ne30x8_t13 --case $CASENAME --mach cheyenne --run-unsupported
cd $CASENAME
./xmlchange NTASKS=3600
./case.setup
./case.build
./case.submit

(maybe that'll work, maybe it won't)

What CAM tag were you using?

cam6_3_052 (cesm2_3_beta08)

What machine were you running CAM on?

CISL machine (e.g. cheyenne)

What compiler were you using?

Intel

Path to a case directory, if applicable

/glade/scratch/jedwards/testRR_jul2022.001

Will you be addressing this bug yourself?

No

Extra info

No response

@jedwards4b jedwards4b added the bug Something isn't working correctly label Aug 3, 2022
@adamrher
Copy link

adamrher commented Aug 3, 2022

@islasimpson

@gold2718
Copy link
Collaborator

gold2718 commented Aug 4, 2022

This is not a supported grid (that I know of). What is this grid?

@adamrher
Copy link

adamrher commented Aug 4, 2022

it's not a supported grid. it's an experimental grid / cutting edge science

@PeterHjortLauritzen
Copy link
Collaborator

PeterHjortLauritzen commented Aug 4, 2022

Do you have a log file (atm) from one of the runs we can look at? (are there no changes to the namelist?) Thanks

@jedwards4b
Copy link
Author

/glade/scratch/jedwards/testRR_jul2022.001/run/cesm.log.5286452.chadmin1.ib0.cheyenne.ucar.edu.220802-080311
run/cesm.log.5286452.chadmin1.ib0.cheyenne.ucar.edu.220802-080311:74:
run/cesm.log.5286452.chadmin1.ib0.cheyenne.ucar.edu.220802-080311:74:SHR_REPROSUM_CALC: Input contains 0.00000E+00 NaNs and 0.40000E+01 INFs on process 74
run/cesm.log.5286452.chadmin1.ib0.cheyenne.ucar.edu.220802-080311:74: ERROR: shr_reprosum_calc ERROR: NaNs or INFs in input

@PeterHjortLauritzen
Copy link
Collaborator

PeterHjortLauritzen commented Aug 4, 2022

Unsupported variable resolution setups are not stable out-of-the-box. You can see that if you search for "dt" in the atm.log file where theoretical estimates for stable time-steps are given. Hence we need to set the se_*split variables.

@jedwards4b
Copy link
Author

I would expect that in this case it would fail every time. But it doesn't - In that same directory you will see a run
that succeeds at the same ntasks count with no changes in the model or model inputs.

@adamrher
Copy link

adamrher commented Aug 4, 2022

@PeterHjortLauritzen I don't think this is a stability issue because for some tasks it runs (in fact, Isla has this case running over the last few days).

Yes, this is an unsupported grid. But I think this issue needs to be looked at because it may be a system issue that impacts all variable-resolution configurations.

@islasimpson
Copy link

In case it's useful, this is my case which is currently running...

casedir:/glade/work/islas/cesm2_3_beta08/runs/f.e23.FAMIPfosi.ne0np4.NATL.ne30x8_t13.001
rundir:/glade/scratch/islas/f.e23.FAMIPfosi.ne0np4.NATL.ne30x8_t13.001/run

@PeterHjortLauritzen
Copy link
Collaborator

Oh OK ... (I would still recommend to decrease the dynamics and tracer time-steps by increasing se_rsplit; you are the experts here but I would expect a model that may be unstable and somehow manages to keep running to do weird things)

@islasimpson
Copy link

se_rsplit is currently set to 3. What would you recommend we go to? I assume decreasing the dynamics time-step is going to make the model run a lot slower? Robb has been running experiments with this grid for a while and I don't think anything too peculiar happened.

@adamrher
Copy link

adamrher commented Aug 4, 2022

(Peter - w/ var-res I try to run w/ the same dt's as in an equivalent global uniform res run. the atm.log dt metrics are never happy with my approach, but so far this has yielded stable runs for everyone I've advised on var-res time-steps.)

Let's not get distracted from the main issue!

@PeterHjortLauritzen
Copy link
Collaborator

OK. Apologies for derailing the detective work ...

@islasimpson
Copy link

So, just to clarify, there is no need to change the se_rsplit? I'm restarting anyway because there was an output issue...

@adamrher
Copy link

adamrher commented Aug 4, 2022

I would not recommend changing se_rsplit, or any of the time-stepping. Robb and I have tested these settings extensively.

@islasimpson
Copy link

Ok, sounds good.

@cacraigucar cacraigucar moved this to To Do in CAM Development Aug 8, 2022
@cacraigucar cacraigucar added this to the CAM6.5 milestone Aug 9, 2022
@islasimpson
Copy link

Here is one of my cases that has failed /glade/work/islas/cesm2_3_beta08/runs/testRR_jul2022.001 although I think this is identical to the one that Jim posted above.

@brianpm
Copy link
Collaborator

brianpm commented Aug 9, 2022

I have seen an issue that might be the same. I've been using the same tag as @islasimpson, but with a different grid (refined tropical belt). The run was crashing on SHR_REPROSUM_CALC just like above. The "solution" seemed to be to start from analytic initial conditions, which allowed the run to get started and completed my 1-day test.

Here is the case directory: /glade/work/brianpm/my_cases/test_cases/c2p3b8.f2000climo.trbelta.001

In the current state, this case is using the analytic ic.

This is the same grid that @jtruesdal has been testing, and might have seen the same issue.

@jedwards4b
Copy link
Author

@brianpm - was your failure without the analytic initial condition repeatable or intermittent?

@brianpm
Copy link
Collaborator

brianpm commented Aug 9, 2022

I don't know. With analytic initial conditions the run successfully started. With initial conditions derived from regridding with Patrick's VR tools, I was seeing a failure, but I don't know if was actually repeatable. I saw it on several attempts, as I was trying to work through the case and get it running (with input from @adamrher).

@adamrher
Copy link

adamrher commented Aug 9, 2022

@brianpm - was your failure without the analytic initial condition repeatable or intermittent?

I'm fairly confident that Brian's issue with the TRBLT grid is repeatable. I think it was an unstable initial condition, that was resolved by running w/ analytic initial conditions. So my guess is it is not related to this issue, which is characterized by intermittent failures for an identical set of settings.

@jtruesdal mentioned that he may have gotten intermittent errors with the TRBLT grid, though.

But so far only Isla's NATL grid can reproduce this result.

@patcal suggested he may have had a similar issue with various var-res configurations.

@jtruesdal
Copy link
Collaborator

My TRBELT case is /glade/p/cgd/amp/jet/cases/F2000climo.ne0np4.trbelta.ne30x8_g17.intel.1080pes.chey.nuopc.cesm23alpha09d.001.dbg

My errors look to be a bad read of the initial conditions file. Right after calling read_inidat and doing a boundary exchange the prognostic fields contain some bad values. For SE the min/max of the initial state is printed and shows the bad values. This from the atm.log

STATE DIAGNOSTICS

                            MIN                    MAX              AVE (hPa)      REL. MASS. CHANGE

U -0.932687431112143+170 0.125020681873722E+03
V -0.310895810370714+170 0.132994696270308E+03
T 0.139085716626218E+03 0.306223262802823E+03
OMEGA -0.646747652113979+170 0.674889617481827+170
OMEGA CN 0.000000000000000E+00 0.000000000000000E+00

I have debug print in the cesm.log file showing the locations of the bad values.

@adamrher
Copy link

@jtruesdal this looks an awful lot like the errors @renerwijn was getting in our new dual-polar var-res grid. The first print out of these stats, during the initialization phase nstep=0:

 nstep=           0  time=  0.000000000000000E+000  [day]

 STATE DIAGNOSTICS

                                MIN                    MAX              AVE (hPa)      REL. MASS. CHANGE
  U          -0.204248117308741+233  0.160408366612220+281
  V          -0.204248117308741+233  0.160408366612220+281
  T          -0.932687431112143+170  0.308818178603708E+03
  OMEGA      -0.452352825352631+280  0.298789765442738+281

When I first saw this I was like what could possible be causing the state to go berserk? These don't resemble the values in the ncdata file. However, Rene can correct me, but the ncdata file turned out to be the problem ... or it at least, it motivated us to run the us standard atmos analytical inic for 4 weeks and spit out a new cam.i file. That cam.i file ended up being stable and not giving the egregious winds at nstep=0. So that anecdote makes me wonder whether it's just an unstable inic that yields this crazy state at nstep=0? The dycore had to have done something to the state at this point, because the ncdata file is on the dynamics grid, right? Is it doing more than just reading in the data at nstep=0?

@jtruesdal
Copy link
Collaborator

@adamrher This is printed out after the initial file is read and before dynamics runs. There is some initialization of derived quantities and mucking with edge buffers but I don't think the state is modified before the print. I will try the analytic init as suggested and create a new initial condition. I guess there could still be some corruption or incompatibility in the NetCDF initial file I'm using.

@jtruesdal
Copy link
Collaborator

@adamrher The analytic IC worked as did a restart from that run. Unfortunately using the IC produced by the analytic run exhibited the same behavior as before, sometimes working but most of the time reading an assortment of bad values. The failures show up under the STATE DIAGNOSTIC print in the atm log file and are garbage values not NaN's or INF's. Jim's test also has a bad state. The fields are read via infld and the errors seem to be confined to the 3d fields. The garbage values are intermingled with reads of good values on numerous processors. Maybe @gold2718 was right when thinking that the variable resolution data is exposing an issue in infld.

@adamrher
Copy link

Jim's test also has a bad state.

indeed, from Jim's intermittent failure run at the top of this thread:

 nstep=           0  time=  0.000000000000000E+000  [day]

 STATE DIAGNOSTICS

                                MIN                    MAX              AVE (hPa)      REL. MASS. CHANGE
  U          -0.406066707886934+302  0.245322431892797+301
  V          -0.112625708802609E+03  0.829671412241384E+02
  T           0.177824544971257E+03  0.306593499694223E+03
  OMEGA      -0.517987814127956+290  0.401590425697583+290

So we have three separate var-res configurations that are able to reproduce this error. At least we're converging on the issue...

@jtruesdal
Copy link
Collaborator

Ill test my case this afternoon and report back.

@cacraigucar
Copy link
Collaborator

@adamrher - Can you give us the information that we need for the pio and ccs_config changes? What tags did you use for these, so I can add them to PR #659. Those are the only changes required to fix this, correct?

@cacraigucar
Copy link
Collaborator

@adamrher - Also if you can give us the details on the three tests you want to add to the regression tests, we can work on including them? We can work the details on this offline if needed.

@adamrher
Copy link

@cacraigucar regarding the three tests, these seem reasonable to me:

<test compset="FHIST" grid="ne0ARCTICne30x4_ne0ARCTICne30x4_mt12" name="ERP_Ln9_Vnuopc" testmods="cam/outfrq9s">
<test compset="FHIST" grid="ne0ARCTICGRISne30x8_ne0ARCTICGRISne30x8_mt12" name="ERP_Ln9_Vnuopc" testmods="cam/outfrq9s">
<test compset="FHIST" grid="ne0CONUSne30x8_ne0CONUSne30x8_mt12" name="ERP_Ln9_Vnuopc" testmods="cam/outfrq9s">

I would set the walltime to 30 minutes, to make sure it will still run when we double our vertical resolution in FHIST runs to L58. I defer to the se's on whether ERP is the best test (if we only get to choose one) ... I'm just more familiar with it.

The only var-res we have now are FW (WACCM) tests, so I think it will be good to have these less complex, but arguably more common compsets working for all three grids. I will note that currently the CONUS will not run out of the box w/ FHIST because at least one emission file does not have year 1979 data in it (I suspect this is because that ACOM folks like to run CONUS with short nudged runs in a more recent year, and didn't bother to make the emissions work for 1979.

Note that for arctic and greenland var-res grids, the emissions files are not on the native grids, which means they are interpolated on the fly from probably f09 files. ACOM likes to have emissions on the native grids for hi-res (I'm less picky). So I think to resolve this issue we should just ask ACOM to extent their CONUS emissions files to include year 1979.

@adamrher
Copy link

[people are asking for me to explicitly state the pio and ccs_versions needed for this fix: update externals to pio2_5_9 and the current head of ccs_config (ccs_config_cesm0.0.44, I believe)]

@jtruesdal
Copy link
Collaborator

The TRBELT tests worked. I updated pio and the ccs_config manually and have finished a few runs to completion. I also verified a restart run using the global integrals from the log. Everything completes and matches. This looks good from my end.

@cacraigucar
Copy link
Collaborator

cacraigucar commented Oct 3, 2022

regression tests on cheyenne indicate baseline answer changes (which is not expected). @jedwards4b has the following summary:

I can confirm that there is an answer change when I use the new tags.
If I try to update just ESMF I get a runtime failure in the test.
If I try to update just pio all tests pass.

I'm still looking for something in between.

@jedwards4b
Copy link
Author

Updating to esmf-8.3.0-ncdfio-mpt-O also causes an answer change.

@jedwards4b
Copy link
Author

Updating to esmf-8.3.0b13-ncdfio-mpt-O also fails baseline compare.

@jedwards4b
Copy link
Author

Using esmf-8.3.0b07 passes baseline (also using pio2.5.9)

@adamrher
Copy link

adamrher commented Oct 4, 2022

Should we reach out to the ESMF team to ask if there are any expected answer changes? If I recall from your test failures they were all either cam-se or mpas -- all unstructured grids. I recall maybe Bob Oehmke saying that a fix was made to mapping algo for unstructured grids a while back and to switch to a more recent library. Or maybe it was something else ...

@jedwards4b
Copy link
Author

@adamrher Yes I am working with the ESMF team.

@cacraigucar
Copy link
Collaborator

To document this here, answer changes were seen in the following CAM regression tests:
ERP_Ln9_Vnuopc.ne30_ne30_mg17.FCnudged.cheyenne_intel.cam-outfrq9s (Overall: DIFF) details:
ERS_Ln9_P288x1_Vnuopc.mpasa120_mpasa120.F2000climo.cheyenne_intel.cam-outfrq9s_mpasa120 (Overall: DIFF) details:
ERS_Ln9_P36x1_Vnuopc.mpasa480_mpasa480.F2000climo.cheyenne_intel.cam-outfrq9s_mpasa480 (Overall: DIFF) details:
SMS_D_Ln9_Vnuopc.ne0CONUSne30x8_ne0CONUSne30x8_mt12.FCHIST.cheyenne_intel.cam-outfrq9s_refined_camchem (Overall: DIFF) details:
SMS_D_Ln9_Vnuopc.ne16_ne16_mg17.FX2000.cheyenne_intel.cam-outfrq9s (Overall: DIFF) details:

The ESMF team is working to identify the cause of the answer changes

@cacraigucar
Copy link
Collaborator

@jedwards4b - I have a test which is also flat out failing. After thinking it might be a cheyenne hiccup, it keeps failing in the exact same way.

The bottom of the cesm log file is:

275: imp_sol: time step    1800.000     failed to converge @ (lchnk,vctrpos,nstep) =     2868     102       0
341: imp_sol: time step    1800.000     failed to converge @ (lchnk,vctrpos,nstep) =     3462     100       0
215: imp_sol: time step    1800.000     failed to converge @ (lchnk,vctrpos,nstep) =     2328     102       0
335: imp_sol: time step    1800.000     failed to converge @ (lchnk,vctrpos,nstep) =     3408     102       0
1: Opened file
1: SMS_Ln9_Vmct.f09_f09_mg17.1850_CAM60%WCTS_CLM50%SP_CICE5%PRES_DOCN%DOM_MOSART_S
1: GLC_SWAV.cheyenne_intel.cam-reduced_hist3s.GC.aux_cam_20221005141143.cam.h0.000
1: 1-01-01-00000.nc to write         376
1: Opened file
1: SMS_Ln9_Vmct.f09_f09_mg17.1850_CAM60%WCTS_CLM50%SP_CICE5%PRES_DOCN%DOM_MOSART_S
1: GLC_SWAV.cheyenne_intel.cam-reduced_hist3s.GC.aux_cam_20221005141143.cam.h7.000
1: 1-01-01-00000.nc to write         377
234:MPT ERROR: Assertion failed at reg_cache.c:302: "i == num_used"
234:MPT ERROR: Rank 234(g:234) is aborting with error code 0.
234:    Process ID: 20354, Host: r6i5n27, Program: /glade/scratch/cacraig/aux_cam_20221005141143/SMS_Ln9_Vmct.f09_f09_mg17.1850_CAM60%WCTS_CLM50(null)P_CICE5%PRES_DOCN%DOM_MOSART_SGLC_SWAV.cheyenne_intel.cam-reduced_hist3s.GC.aux_cam_20221005141143/bld/cesm.exe

The latest job can be seen at:
/glade/scratch/cacraig/aux_cam_20221005141143/SMS_Ln9_Vmct.f09_f09_mg17.1850_CAM60%WCTS_CLM50%SP_CICE5%PRES_DOCN%DOM_MOSART_SGLC_SWAV.cheyenne_intel.cam-reduced_hist3s.GC.aux_cam_20221005141143

Note that the ONLY changes are the ccs_config and pio external updates to ccs_config_cesm0.0.45 and pio2_5_9

@fischer-ncar - have you encountered this as well?

@fischer-ncar
Copy link
Collaborator

Nope, I haven't seen this error. I'll try to reproduce your error with my latest alpha10a sandbox.

@jedwards4b
Copy link
Author

@cacraigucar I'm pretty sure that the problem here is your pelayout of 384x3 since 384 is not an even multiple of
36. Change the test to 360x3 and try again.

@fvitt
Copy link

fvitt commented Oct 6, 2022

@cacraigucar I'm pretty sure that the problem here is your pelayout of 384x3 since 384 is not an even multiple of 36. Change the test to 360x3 and try again.

The 384x3 pelayout places 12 MPI tasks on cheyenne node. This evenly spreads across 32 nodes.

@fischer-ncar
Copy link
Collaborator

I was able to get this test to pass using the current alpha10a sandbox. This is using 384x3. From what you're using. The alpha10a sandbox has updates to cdeps, cmeps, cice6, ctsm, cime, cpl7, and share.

@jedwards4b
Copy link
Author

I also had no problem running this test with the original pe-layout.
cat TestStatus
PASS SMS_Ln9_Vmct.f09_f09_mg17.1850_CAM60%WCTS_CLM50%SP_CICE5%PRES_DOCN%DOM_MOSART_SGLC_SWAV.cheyenne_intel.cam-reduced_hist3s CREATE_NEWCASE
PASS SMS_Ln9_Vmct.f09_f09_mg17.1850_CAM60%WCTS_CLM50%SP_CICE5%PRES_DOCN%DOM_MOSART_SGLC_SWAV.cheyenne_intel.cam-reduced_hist3s XML
PASS SMS_Ln9_Vmct.f09_f09_mg17.1850_CAM60%WCTS_CLM50%SP_CICE5%PRES_DOCN%DOM_MOSART_SGLC_SWAV.cheyenne_intel.cam-reduced_hist3s SETUP
PASS SMS_Ln9_Vmct.f09_f09_mg17.1850_CAM60%WCTS_CLM50%SP_CICE5%PRES_DOCN%DOM_MOSART_SGLC_SWAV.cheyenne_intel.cam-reduced_hist3s SHAREDLIB_BUILD time=390
PASS SMS_Ln9_Vmct.f09_f09_mg17.1850_CAM60%WCTS_CLM50%SP_CICE5%PRES_DOCN%DOM_MOSART_SGLC_SWAV.cheyenne_intel.cam-reduced_hist3s MODEL_BUILD time=416
PASS SMS_Ln9_Vmct.f09_f09_mg17.1850_CAM60%WCTS_CLM50%SP_CICE5%PRES_DOCN%DOM_MOSART_SGLC_SWAV.cheyenne_intel.cam-reduced_hist3s SUBMIT
PASS SMS_Ln9_Vmct.f09_f09_mg17.1850_CAM60%WCTS_CLM50%SP_CICE5%PRES_DOCN%DOM_MOSART_SGLC_SWAV.cheyenne_intel.cam-reduced_hist3s RUN time=199
PASS SMS_Ln9_Vmct.f09_f09_mg17.1850_CAM60%WCTS_CLM50%SP_CICE5%PRES_DOCN%DOM_MOSART_SGLC_SWAV.cheyenne_intel.cam-reduced_hist3s MEMLEAK insuffiencient data for memleak test
PASS SMS_Ln9_Vmct.f09_f09_mg17.1850_CAM60%WCTS_CLM50%SP_CICE5%PRES_DOCN%DOM_MOSART_SGLC_SWAV.cheyenne_intel.cam-reduced_hist3s SHORT_TERM_ARCHIVER
(cheyenne) cheyenne1: /glade/scratch/jedwards/SMS_Ln9_Vmct.f09_f09_mg17.1850_CAM60%WCTS_CLM50%SP_CICE5%PRES_DOCN%DOM_MOSART_SGLC_SWAV.cheyenne_intel.cam-reduced_hist3s.20221005_174926_kl1guc

@cacraigucar
Copy link
Collaborator

cacraigucar commented Oct 6, 2022

I checked out a new attempt of the branch as it is currently stored (to make sure it wasn't corrupted somehow) and ran create_test on it. I still get the same results, so there must be something different between @jedwards4b setup and mine.

My code base is at:
/glade/u/home/cacraig/cam_fix_irrep_results

The failed test is at:
/glade/scratch/cacraig/SMS_Ln9_Vmct.f09_f09_mg17.1850_CAM60%WCTS_CLM50%SP_CICE5%PRES_DOCN%DOM_MOSART_SGLC_SWAV.cheyenne_intel.cam-reduced_hist3s.20221006_081400_ywi7ei/run

It is also worth reiterating that I am only changing pio and ccs_config externals. This test worked fine in all previous CAM tags

@jedwards4b
Copy link
Author

jedwards4b commented Oct 6, 2022

I tried again and passed again. I see this difference in our cases:

185c185
me <   model_version = "cam6_3_077"
---
you >   model_version = "cam6_2_021-1988-g178544a2"

Looking at the git log confirms that you are testing an older version of cam.

@cacraigucar
Copy link
Collaborator

When I went to SRCROOT, I got the following:

cheyenne3$ git diff cam6_3_078 | less
diff --git a/Externals.cfg b/Externals.cfg
index b29291a7..239f2835 100644
--- a/Externals.cfg
+++ b/Externals.cfg
@@ -1,5 +1,5 @@
[ccs_config]
-tag = ccs_config_cesm0.0.28
+tag = ccs_config_cesm0.0.45
protocol = git
repo_url = https://github.com/ESMCI/ccs_config_cesm
local_path = ccs_config
@@ -57,7 +57,7 @@ local_path = libraries/mct
required = True

[parallelio]
-tag = pio2_5_7
+tag = pio2_5_9
protocol = git
repo_url = https://github.com/NCAR/ParallelIO
local_path = libraries/parallelio

Also manage_externals/checkout_externals --status indicate that it was all clean

Which git log is the one saying I'm using an older version of cam? (i.e. what directory were you in when you executed the command)?

@jedwards4b
Copy link
Author

Top level - cam itself.
Looking closer at this I am using an older tag than you are, mine is in hash 0764b57

you are in hash
178544a

which is 6 commits ahead.

@cacraigucar
Copy link
Collaborator

cacraigucar commented Oct 7, 2022

The answer changes that we are seeing with these changes are due to ESMF. Here is Mariana's explanation of what is causing the differences from a separate email exchange:

Hi Bob and Cheryl,

I've looked at the diffs closely with Jim and the problem boils down to roundoff level changes in ESMF first order conservative mapping that change the computed land fraction and ocean fraction used when merging to the atm. 

As I mentioned above - what the land and data ocean and cice (in prescribed mode are doing) is taking an input mask file 
   mesh_mask = /glade/p/cesmdata/cseg/inputdata/share/meshes/gx1v7_151008_ESMFmesh.nc
and mapping the mask on this file using first order conservative mapping to the atm grid :
   mesh_atm = /glade/p/cesmdata/cseg/inputdata/share/meshes/ne30np4_ESMFmesh_cdf5_c20211018.nc
Note that the atm, lnd, ice and ocn are all on the mesh_atm grid.
If there are any differences in this mapping then the model will no longer be bfb.

If you look at the file:
/glade/scratch/jedwards/SMS_D_Ln3_P360.ne30_ne30_mg17.FCnudged.cheyenne_intel.cam-outfrq3s.C.20221005_125112_6h9p4v/cpl.diffs - you can see the following differences in the land fraction sent to the atm

atmExp_Sl_lfrac   (atmExp_nx,atmExp_ny,time)  t_index =      1     1
        107    48602  (   838,     1,     1) (     1,     1,     1) (  5097,     1,     1) (  5097,     1,     1)
               48602   1.000000000000000E+00   0.000000000000000E+00 3.1E-14  9.934056592391882E-03 2.0E-16  9.934056592391882E-03
               48602   1.000000000000000E+00   0.000000000000000E+00          9.934056592361351E-03          9.934056592361351E-03
               48602  (   838,     1,     1) (     1,     1,     1)
          avg abs field values:    2.921258617016634E-01    rms diff: 1.6E-16   avg rel diff(npos):  2.0E-16
                                   2.921258617016634E-01                        avg decimal digits(ndif): 14.5 worst: 11.5
 RMS atmExp_Sl_lfrac                  1.6118E-16            NORMALIZED  5.5175E-16

You can see that these are simply round off level changes - but they will have a strong impact on the solution.'

Frankly, I'm not concerned about this. Any order of operation change in ESMF regridding or mesh storage would result in this type of change. Bob can confirm my assumption.
Does this make sense to everyone?

Mariana

Based on this information @adamrher, @cacraigucar and Robert Oehmke have all signed off on the differences

@cacraigucar cacraigucar assigned cacraigucar and unassigned gold2718 Oct 7, 2022
Repository owner moved this from To Do to Done in CAM Development Oct 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working correctly
Projects
Status: Done
Development

No branches or pull requests