Accelerating MultiLayerQG on GPUs #373

mpudig · 2024-10-28T20:10:03Z

This pull request addresses accelerating the PV-stream function inversion in MultiLayerQG for arbitrary layers on a GPU, as discussed here.

I've used KernelAbstractions to optimize what used to be a loop over all (x, y). There is still a loop over number of layers squared. These changes greatly accelerate the code for certain set-ups with more than two layers based on some simple tests (seen here).

…ons on issue FourierFlows#112.

…lso explicitly import CPU, GPU from FourierFlows as there were conflicts with KernelAbstractions.

…er than 16.

navidcy · 2024-10-28T20:31:03Z

We should bump a patch release.

navidcy · 2024-10-28T20:32:23Z

src/multilayerqg.jl

+  # if dev == GPU() && nlayers > 2
+  #   @warn """MultiLayerQG module is not optimized on the GPU yet for configurations with
+  #   3 fluid layers or more!


Why comment these out? Delete?

navidcy · 2024-10-28T20:33:38Z

src/multilayerqg.jl

+  S, nlayers = params.S, params.nlayers
+  kernel!(qh, ψh, S, nlayers)
+
+  # This will ensure that no other operations occur until the kernel has finished


Suggested change

# This will ensure that no other operations occur until the kernel has finished

# Ensure that no other operations occur until the kernel has finished

navidcy · 2024-10-28T20:51:44Z

src/multilayerqg.jl

+"""
+    @kernel function streamfunctionfrompv_kernel!(ψh, qh, S⁻¹, nlayers)
+
+Kernel for GPU acceleration of streamfunction from PV calculation, i.e., invert the PV to obtain
+the Fourier transform of the streamfunction `ψh` in each layer from `qh` using `ψh = params.S⁻¹ qh`.
+"""
+@kernel function streamfunctionfrompv_kernel!(ψh, qh, S⁻¹, nlayers)
+    i, j = @index(Global, NTuple)
+
+    @unroll for k = 1:nlayers
+
+        @inbounds ψh[i, j, k] = 0
+
+        @unroll for m = 1:nlayers
+            @inbounds ψh[i, j, k] += S⁻¹[i, j][k, m] * qh[i, j, m]
+        end
+
+    end
+end
+


are the two kernel functions identical except the order they want their arguments?

Looks like it yes. It'd probably make sense to write one kernel in the more general form.

I think so! Would make the code more robust.

glwagner · 2024-10-29T03:28:01Z

src/multilayerqg.jl

+  @unroll for k = 1:nlayers
+
+      @inbounds qh[i, j, k] = 0
+
+      @unroll for m = 1:nlayers


The @unroll don't do anything unless nlayers is known at compile time (this requires using Val, but I don't know if it will speed anything up... it might).

glwagner · 2024-10-29T03:29:31Z

src/multilayerqg.jl

+Kernel for GPU acceleration of PV from streamfunction calculation, i.e., obtaining the Fourier
+transform of the PV from the streamfunction `ψh` in each layer using `qh = params.S * ψh`.
+"""
+@kernel function pvfromstreamfunction_kernel!(qh, ψh, S, nlayers)


Suggested change

@kernel function pvfromstreamfunction_kernel!(qh, ψh, S, nlayers)

@kernel function pvfromstreamfunction_kernel!(qh, ψh, S, ::Val{nlayers}) where nlayers

for @unroll you have to do this, and also pass Val(nlayers) rather than nlayers into the kernel when launching it. I don't know if it will speed things up though. It might.

…calFlows.jl into MultiLayerQG_GPU

mpudig · 2024-10-29T18:18:53Z

src/multilayerqg.jl

+@kernel function PVinversion_kernel!(a, b, M, ::Val{nlayers}) where nlayers
+  i, j = @index(Global, NTuple)
+
+  @unroll for k = 1:nlayers
+
+      @inbounds a[i, j, k] = 0
+
+      @unroll for m = 1:nlayers
+          @inbounds a[i, j, k] += M[i, j][k, m] * b[i, j, m]
+      end
+
+  end
+end


I rewrote the kernel in more general form and added Val. The code has sped up slightly, but the 16-thread CPU still outperforms the GPU. Compare these benchmarks to what I showed here

GPU
nlayers = 12; nx = 512; prob = MultiLayerQG.Problem(nlayers, GPU(); nx); @btime stepforward!(prob) 668.165 ms (2533 allocations: 191.19 KiB)

CPU with 16 threads
nlayers = 12; nx = 512; prob = MultiLayerQG.Problem(nlayers, CPU(); nx); @btime stepforward!(prob) 444.419 ms (113 allocations: 5.61 KiB)

Are you sure you are timing the GPU properly?

julia> nlayers = 12; nx = 512; prob = MultiLayerQG.Problem(nlayers, GPU(); nx); @benchmark CUDA.@sync CUDA.@time stepforward!(prob)

0.681338 seconds (57.95 k CPU allocations: 2.419 MiB) (18 GPU allocations: 345.250 MiB, 0.04% memmgmt time) 0.678481 seconds (2.58 k CPU allocations: 194.141 KiB) (16 GPU allocations: 297.156 MiB, 0.03% memmgmt time) 0.676472 seconds (2.58 k CPU allocations: 194.141 KiB) (16 GPU allocations: 297.156 MiB, 0.03% memmgmt time) 0.694825 seconds (2.58 k CPU allocations: 194.141 KiB) (16 GPU allocations: 297.156 MiB, 0.03% memmgmt time) 0.678072 seconds (2.58 k CPU allocations: 194.141 KiB) (16 GPU allocations: 297.156 MiB, 0.03% memmgmt time) 0.677693 seconds (2.58 k CPU allocations: 194.141 KiB) (16 GPU allocations: 297.156 MiB, 0.03% memmgmt time) 0.678237 seconds (2.58 k CPU allocations: 194.141 KiB) (16 GPU allocations: 297.156 MiB, 0.04% memmgmt time) 0.677198 seconds (2.58 k CPU allocations: 194.141 KiB) (16 GPU allocations: 297.156 MiB, 0.03% memmgmt time) 0.676980 seconds (2.58 k CPU allocations: 194.141 KiB) (16 GPU allocations: 297.156 MiB, 0.03% memmgmt time) 0.676189 seconds (2.58 k CPU allocations: 194.141 KiB) (16 GPU allocations: 297.156 MiB, 0.03% memmgmt time) 0.678326 seconds (2.58 k CPU allocations: 194.141 KiB) (16 GPU allocations: 297.156 MiB, 0.03% memmgmt time) 0.677010 seconds (2.58 k CPU allocations: 194.141 KiB) (16 GPU allocations: 297.156 MiB, 0.03% memmgmt time) 0.677142 seconds (2.58 k CPU allocations: 194.141 KiB) (16 GPU allocations: 297.156 MiB, 0.04% memmgmt time) 0.676321 seconds (2.58 k CPU allocations: 194.141 KiB) (16 GPU allocations: 297.156 MiB, 0.04% memmgmt time) 0.678115 seconds (2.58 k CPU allocations: 194.141 KiB) (16 GPU allocations: 297.156 MiB, 0.03% memmgmt time) 0.677461 seconds (2.58 k CPU allocations: 194.141 KiB) (16 GPU allocations: 297.156 MiB, 0.03% memmgmt time) 0.678255 seconds (2.58 k CPU allocations: 194.141 KiB) (16 GPU allocations: 297.156 MiB, 0.03% memmgmt time) 0.677168 seconds (2.58 k CPU allocations: 194.141 KiB) (16 GPU allocations: 297.156 MiB, 0.03% memmgmt time) 0.677529 seconds (2.58 k CPU allocations: 194.141 KiB) (16 GPU allocations: 297.156 MiB, 0.03% memmgmt time) BenchmarkTools.Trial: 8 samples with 1 evaluation. Range (min … max): 676.718 ms … 678.645 ms ┊ GC (min … max): 0.00% … 0.00% Time (median): 677.681 ms ┊ GC (median): 0.00% Time (mean ± σ): 677.765 ms ± 618.941 μs ┊ GC (mean ± σ): 0.00% ± 0.00% █ █ ██ █ █ █ █ █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁██▁▁▁▁▁▁▁█▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁█ ▁ 677 ms Histogram: frequency by time 679 ms < Memory estimate: 199.41 KiB, allocs estimate: 2660.

Seems to be roughly the same as above? Unless I'm misunderstanding what this benchmark is doing...

That's disappointing...

The first thing I would try to figure out is whether this function is indeed the bottleneck. It might be better, infact, to simply benchmark this function in isolation.

I don't know if it matters but I saw the workgroup is 8, 8. Usually we use 16, 16.

I would also check 3 layers first perhaps. The double inner loop gets slower with more layers, and perhaps the computational costs scale differently on CPU vs GPU. That might give a clue.

The loop is over k, m --- which are the slowest (last) indices in a and b. That could be an issue. If you can benchmark this operation in isolation, then you can experiment with new arrays where k, m are the fastest / first indices in a and b. This experiment would tell you what kind of slow down that's incurring.

Maybe the @unroll is not working for some reason. When I've seen stuff like this before, people have used matrix / linear algebra via StaticArrays, rather than explicit loops as used here. If you are just testing the kernel in isolation, transforming a, b to arrays of StaticVector could also be something to experiment with.

In general with performance engineering, one has to really be persistent and creative and test test test. To make this easier you want to extract out the function you're trying to optimize and work with a very idealized test case that also allows you to change the data structure (rather than working with a FourierFlows script). Think of this as research. If you find that you need to rearrange memory differently, then we can come back to GeophysicalFlows and see whether that is feasible or not.

@glwagner @navidcy Apologies for the slow uptake!

As you suggested I tested the streamfunctionfrompv function in isolation as this was indeed the bottleneck. I tested the suggestions you proposed and writing a, b in the kernel as 2d arrays of StaticVector led, by far, to the best performance improvement.

When a, b are defined as such, and M (which will be S or S⁻¹ depending on the inversion direction) is a 2d array of StaticArray, the kernel is very elegant:

@kernel function PVinversion_kernel!(a, b, M, ::Val{nlayers}) where nlayers i, j = @index(Global, NTuple) @inbounds a[i, j] = M[i, j] * b[i, j] end

I benchmarked this new kernel and structure of a, b and compared it to what this PR previously proposed (what I call "current" in the attached figure). The results are pretty impressive – 3 orders of magnitude speed-up in some cases! (These tests were done on a v100 GPU.) With the previous code the complexity scaled as nlayers^4 (matrix-vector multiply + the double loop) whereas without the double loop it scales as nlayers^2-ish. The proposed change is also faster for the nlayers = 2 case, which currently hardcodes the inversion step.

I'm not an expert on writing code for GPUs – how feasible do you think it would be to write all the variables required to step the model forward (qh, ψh, q, ψ, etc.) as arrays of StaticVector? It seems like it would require changing the fftw structure.

Let me know what you think about this. I'd be happy to try reorganise the code and implement it if you think it's a good idea. This sort of acceleration would be a huge boon for my research – and obviously would benefit others who want to run these sort of layered QG simulations as well.

I'm not an expert on writing code for GPUs – how feasible do you think it would be to write all the variables required to step the model forward

Not true. You are an expert. That is why this PR was able to succeed.

how feasible do you think it would be to write all the variables required to step the model forward (qh, ψh, q, ψ, etc.) as arrays of StaticVector?

I don't exactly understand what the intent is, can you spell out your design a little more clearly?

I believe that FFTs act on CuArray of Floats. If I understand you correctly, I believe that you would like to use StaticVector for vertical directions, is that right? However, this would mean that we do not have CuArray of Floats, we have CuArray of StaticVector. So I don't think this is compatible with FFTs.

I think the simplest solution is to keep the CuArray representation of all the array, but either to 1) allocate new CuArray of StaticVector, which are temporary variables for performing the inversion or 2) build the StaticVector inside the inversion kernel.

Number 2 seems actually easier to try right now. It is something like

@kernel function PVinversion_kernel!(a, b, M, ::Val{N}) where N i, j = @index(Global, NTuple) b_tuple = ntuple(Val(N)) do n @inbounds b[i, j, n] end T = eltype(a) b_sv = SVector{N, T}(b_tuple) a_sv = @inbounds M[i, j] * b_sv ntuple(Val(N)) do n @inbounds a[i, j, n] = a_sv[n] end end

There may be many creative ways to do this though. Possibly, it is possible to reinterpret the memory occupied by a CuArray of SVector as a simple CuArray with a different shape. There is a function reinterpret for this. Then perhaps you can perform ffts.

navidcy · 2024-11-17T02:40:56Z

@mpudig any idea why CI breaks?

mpudig · 2024-12-19T04:14:03Z

@mpudig any idea why CI breaks?

Hm, it's the two 3layer tests which fail, which isn't ideal... It's also only for Julia 1.11 that they fail – maybe a package inconsistency? I can't seem to parse where in the test it fails.

mpudig added 10 commits July 26, 2024 13:33

First attempt at writing kernel and workspace following Greg suggesti…

4f7d797

…ons on issue FourierFlows#112.

Change wait to depend on GPU device as per example video on youtube.

114977c

Rewrote kernel for more recent version of KernelAbstraction syntax. A…

76c7728

…lso explicitly import CPU, GPU from FourierFlows as there were conflicts with KernelAbstractions.

Remove GPU warning for more than two layers.

e8cd9d7

Added KernelAbstractions to dependency list.

633797b

Fix typos.

9166539

Fix more typos.

04298c4

Some simple tests changing the workgroup size showed 8 was often bett…

b3161d8

…er than 16.

Fix typos.

880f4f6

Merge branch 'FourierFlows:main' into MultiLayerQG_GPU

63f6d9b

mpudig requested review from navidcy and glwagner October 28, 2024 20:10

navidcy added 🎮 gpu optimization 🏎 labels Oct 28, 2024

navidcy reviewed Oct 28, 2024

View reviewed changes

glwagner reviewed Oct 29, 2024

View reviewed changes

mpudig added 3 commits October 29, 2024 13:28

Changes following suggestions of Navid and Greg.

a04c5b0

Merge branch 'MultiLayerQG_GPU' of https://github.com/mpudig/Geophysi…

5bc7c5f

…calFlows.jl into MultiLayerQG_GPU

Fix typo.

102000e

mpudig commented Oct 29, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Accelerating MultiLayerQG on GPUs #373

Accelerating MultiLayerQG on GPUs #373

mpudig commented Oct 28, 2024

navidcy commented Oct 28, 2024

navidcy Oct 28, 2024

navidcy Oct 28, 2024 •

edited

Loading

navidcy Oct 28, 2024

mpudig Oct 28, 2024

navidcy Oct 28, 2024

glwagner Oct 29, 2024

glwagner Oct 29, 2024

mpudig Oct 29, 2024

navidcy Oct 29, 2024

mpudig Oct 29, 2024

glwagner Nov 17, 2024

mpudig Dec 19, 2024 •

edited

Loading

glwagner Dec 19, 2024

navidcy Dec 19, 2024

glwagner Dec 19, 2024

navidcy commented Nov 17, 2024

mpudig commented Dec 19, 2024

	# This will ensure that no other operations occur until the kernel has finished
	# Ensure that no other operations occur until the kernel has finished

	@kernel function pvfromstreamfunction_kernel!(qh, ψh, S, nlayers)
	@kernel function pvfromstreamfunction_kernel!(qh, ψh, S, ::Val{nlayers}) where nlayers

Accelerating MultiLayerQG on GPUs #373

Are you sure you want to change the base?

Accelerating MultiLayerQG on GPUs #373

Conversation

mpudig commented Oct 28, 2024

navidcy commented Oct 28, 2024

Choose a reason for hiding this comment

navidcy Oct 28, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mpudig Dec 19, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

navidcy commented Nov 17, 2024

mpudig commented Dec 19, 2024

navidcy Oct 28, 2024 •

edited

Loading

mpudig Dec 19, 2024 •

edited

Loading