Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cam6_4_050: clubb_intr GPUization #1175

Open
wants to merge 15 commits into
base: cam_development
Choose a base branch
from

Conversation

huebleruwm
Copy link

This only modifies clubb_intr.F90 and doesn't require a new verseion of clubb. The purpose of this is the addition of acc directives, added in order to offload computations to GPUs. Besides the directives, this mainly consists of replacing vector notation with explicit loops, combining loops with the same bounds where possible, and moving non-gpuized function calls to outside of the GPU section. I also added some new notation for the number of vertical levels (nzm_clubb and nzt_clubb) that improves readability and will make it easier to merge in with future versions of clubb. I also included some timing statements, similar to the ones added in the Earthworks ew-develop branch, which this version of clubb_intr is also compatible with.

This is BFB on CPUs (tested with intel), runs with intel+debugging, and passes the ECT test when comparing CPU results to GPU results (using cam7). There's some options that I didn't GPUize or test (do_clubb_mf, do_rainturb, do_cldcool, clubb_do_icesuper, single_column ), so I left the code for them untouched and added some checks to stop the run if they're set when the code is compiled with OpenACC.

If there ends up being something wrong with these changes then this version, which is an earlier commit that contains only a new OpenACC data statement and some timer additions, would be nice to get in at least.

@Katetc Katetc self-requested a review October 21, 2024 21:43
@cacraigucar cacraigucar requested a review from nusbaume October 22, 2024 15:10
Copy link
Collaborator

@nusbaume nusbaume left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! I had some questions and change requests but none of them are required, and of course if you have any concerns with any of my requests then just let me know. Thanks!

src/physics/cam/clubb_intr.F90 Outdated Show resolved Hide resolved
src/physics/cam/clubb_intr.F90 Outdated Show resolved Hide resolved
src/physics/cam/clubb_intr.F90 Outdated Show resolved Hide resolved
! to zero.
fcor(:) = 0._r8
! Set the ztodt timestep in pbuf for SILHS
ztodtptr(:) = 1.0_r8*hdtime
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this is a question for @Katetc, but why multiply by one instead of just using hdtime directly?

Copy link
Author

@huebleruwm huebleruwm Oct 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no idea why this is, but I was wondering the same thing. The comment implies it's only for SILHS though, which has a dubious future in cam, so I left it untouched.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I very vaguely remember some kind of PGI compiler issue with pointers and some PBuf variables like 10 years ago... and this was some kind of tacky patch. I'm sure you could remove this multiplier without an issue now.

Comment on lines +2863 to +2864
!$acc state1, state1%q, state1%u, state1%v, state1%t, state1%pmid, state1%s, state1%pint, &
!$acc state1%zm, state1%zi, state1%pdeldry, state1%pdel, state1%omega, state1%phis, &
Copy link
Collaborator

@nusbaume nusbaume Oct 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious as to why you have to bring in state1 in addition to the relevant state1 variables (e.g. state1%q)?

Copy link
Author

@huebleruwm huebleruwm Oct 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly I don't know either anymore. I remember at one point having trouble with this and finding out that you did have to allocate the derived type itself on the GPU (create/copyin), so I just keep doing it that way, but I discovered recently that it's completely unnecessary. However, I'm not sure if the lack of such need is something nvidia has implemented, or if it's part of the API standard, or if the trouble I encountered forever ago was just a bug. Common advice from the internet is to copy the user-defined type in itself, but that might be for types that contain scalars or fixed sized arrays, since those values are contained in the memory storing the type variable.

So long story short - just in case we need it.

src/physics/cam/clubb_intr.F90 Outdated Show resolved Hide resolved
src/physics/cam/clubb_intr.F90 Outdated Show resolved Hide resolved
… in clubb_ini_cam. Reusing inv_exner_clubb(:,pver) to set inv_exner_clubb_surf. Splitting up array initialization loop to slightly improve CPU performance.
@huebleruwm
Copy link
Author

Looks good to me! I had some questions and change requests but none of them are required, and of course if you have any concerns with any of my requests then just let me know. Thanks!

Good finds, I took you up on every suggestion.

@huebleruwm
Copy link
Author

huebleruwm commented Oct 28, 2024

I ran a PFS test to check how the performance changed, and initially found that clubb_tend_cam was ~8% slower with these changes (up to cfd2824). Which is unacceptable. I took a guess that the biggest performance killer was a massive loop I made to replace a large amount of vector notation, where the code just zeros out a bunch of arrays. That seemed to be the culprit, since I split the loop up here around line 2991, and the result was slightly faster than the original code.

Here's the timing output comparison now. I ran these a couple times and got roughly the same results.

From the cam_dev head I started with (cam6_4_038)

Region                                         PETs   PEs    Count    Mean (s)    Min (s)     Min PET Max (s)     Max PET
macrop_tend                                    512    512    MULTIPLE 135.3464    124.0819    366     146.0524    42     
  clubb_tend_cam                               512    512    MULTIPLE 132.0139    121.0235    366     142.8016    166    
    clubb_tend_cam_i_loop                      512    512    MULTIPLE 111.1615    102.4363    366     123.0577    166    

From the head of this branch (75f51d1)

Region                                         PETs   PEs    Count    Mean (s)    Min (s)     Min PET Max (s)     Max PET
macrop_tend                                    512    512    MULTIPLE 132.7466    123.5249    366     144.4218    149    
  clubb_tend_cam                               512    512    MULTIPLE 129.4563    120.4303    366     140.8711    149    
    clubb_tend_cam:ACCR                        512    512    MULTIPLE 107.8117    100.6424    366     117.3765    149    
      clubb_tend_cam:advance_clubb_core_api    512    512    MULTIPLE 91.2902     84.0197     366     100.7022    149    
      clubb_tend_cam:flip-index                512    512    MULTIPLE 11.2684     9.9628      220     13.7559     455    
   clubb_tend_cam:NAR                          512    512    MULTIPLE 21.4333     19.0054     401     24.0401     76     
                      qneg3                    512    512    MULTIPLE 0.5783      0.5268      474     0.6311      66     
   clubb_tend_cam:acc_copyin                   512    512    MULTIPLE 0.0059      0.0045      358     0.0081      260    
   clubb_tend_cam:acc_copyout                  512    512    MULTIPLE 0.0038      0.0025      314     0.0061      149    

So it seems these changes made clubb_tend_cam faster by ~2%.

I ran all the same tests mentioned above to confirm these new changes.

Copy link
Collaborator

@Katetc Katetc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a really nice cleanup and the GPU code is not too obtrusive. I give it five stars!

My comments are mostly questions so I can understand what is going on as best as possible.

I am only about 96% confidant that the vertical dimension changes are correct through the whole file. I think the regression tests will be necessary to make sure we have that right in every case.

src/physics/cam/clubb_intr.F90 Show resolved Hide resolved
if ( single_column .and. .not. scm_cambfb_mode ) then
call endrun(subr//': (single_column && !scm_cambfb_mode)=.true. not available when compiling with OpenACC')
end if
#endif
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still learning about GPUs, but I'm curious why we have to have an endrun here instead of just running non-GPU code on the CPUs?

Copy link
Author

@huebleruwm huebleruwm Nov 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was my first instinct too. It would take some of acc update host and acc update device directives to move the data we need back and forth inside of the if-statements, and there would be no issue with doing so. The reason I didn't do that is because I was testing the GPU results with a CPU ensemble baseline that I had run with those options turned off. So to confirm those changes I would've had to run another baseline with those options turned on, which would've taken at least another day, and I was trying to get this in quickly. Also I asked Vince about the options and he indicated they weren't run often or were more so test options that didn't turn out well.

So I thought it was safest to leave the code in those sections untouched and just disable the options for GPU runs. I did do a little test to confirm the CPU code ran to completion with them enabled though, except clubb_do_adv, which didn't run due to begin with due to some error in cnst_add called from clubb_register_cam.

src/physics/cam/clubb_intr.F90 Show resolved Hide resolved
src/physics/cam/clubb_intr.F90 Outdated Show resolved Hide resolved
if (macmic_it==1) vpwp_clubb_gw_mc(:ncol,:) = 0._r8
if (macmic_it==1) thlp2_clubb_gw_mc(:ncol,:) = 0._r8
if (macmic_it==1) wpthlp_clubb_gw_mc(:ncol,:) = 0._r8
call t_stopf('clubb_tend_cam:flip-index')

do t=1,nadv ! do needed number of "sub" timesteps for each CAM step
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This loop - Its kind of hard to tell but I think it's a larger loop for the cld_mac_mic run. Shouldn't this have an !acc tag?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the overall sub-stepping loop, so it has to be run sequentially. It also contains the call to advance_clubb_core and some other smaller routines, so it wouldn't be possibly to offload to the GPU.

@peverwhee peverwhee changed the title clubb_intr GPUization cam6_4_048: clubb_intr GPUization Nov 11, 2024
@peverwhee peverwhee changed the title cam6_4_048: clubb_intr GPUization cam6_4_049: clubb_intr GPUization Nov 12, 2024
@peverwhee peverwhee changed the title cam6_4_049: clubb_intr GPUization cam6_4_050: clubb_intr GPUization Nov 13, 2024
Copy link
Collaborator

@cacraigucar cacraigucar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since I was curious, I took a cursory look at this PR. I did find one minor item which I believe would be good to add (since it took me awhile to find the answer). This is not intended to be a full review.

Comment on lines +63 to +64
nzm_clubb, &
nzt_clubb
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could a comment be added to describe what each of these new variables are? (I saw they were documented in the code, but it is somewhat common to document them when they are declared).

@nusbaume
Copy link
Collaborator

Hi @huebleruwm, apologies for the delay (we had some CESM-wide PRs that had to come in first), but this PR is finally close to the top of our queue and thus we are about to run our full regression testing suite. Are there any remaining modifications you wanted to make before we test and merge, or is this PR ready to go on your end?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: No status
Development

Successfully merging this pull request may close these issues.

4 participants