Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accelerating MultiLayerQG on GPUs #373

Open
wants to merge 13 commits into
base: main
Choose a base branch
from
Open
2 changes: 2 additions & 0 deletions Project.toml
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@ Reexport = "189a3867-3050-52da-a836-e630ba90ab69"
SpecialFunctions = "276daf66-3868-5448-9aa4-cd146d93841b"
StaticArrays = "90137ffa-7385-5640-81b9-e52037218182"
Statistics = "10745b16-79ce-11e8-11f9-7d13ad32a3b2"
KernelAbstractions = "63c18a36-062a-441e-b654-da1e3ab1ce7c"

[compat]
CUDA = "1, 2.4.2, 3.0.0 - 3.6.4, 3.7.1, 4, 5"
Expand All @@ -30,6 +31,7 @@ SpecialFunctions = "0.10, 1, 2"
StaticArrays = "0.12, 1"
Statistics = "1.6"
julia = "1.6"
KernelAbstractions = "0.9"

[extras]
BenchmarkTools = "6e4b80f9-dd63-53aa-95a3-0cdb28fa8baf"
Expand Down
106 changes: 88 additions & 18 deletions src/multilayerqg.jl
Original file line number Diff line number Diff line change
Expand Up @@ -18,11 +18,13 @@ using
LinearAlgebra,
StaticArrays,
Reexport,
DocStringExtensions
DocStringExtensions,
KernelAbstractions

@reexport using FourierFlows

using FourierFlows: parsevalsum, parsevalsum2, superzeros, plan_flows_rfft
using FourierFlows: parsevalsum, parsevalsum2, superzeros, plan_flows_rfft, CPU, GPU
using KernelAbstractions.Extras.LoopInfo: @unroll

nothingfunction(args...) = nothing

Expand Down Expand Up @@ -111,16 +113,16 @@ function Problem(nlayers::Int, # number of fluid lay
aliased_fraction = 1/3,
T = Float64)

if dev == GPU() && nlayers > 2
@warn """MultiLayerQG module is not optimized on the GPU yet for configurations with
3 fluid layers or more!
# if dev == GPU() && nlayers > 2
# @warn """MultiLayerQG module is not optimized on the GPU yet for configurations with
# 3 fluid layers or more!
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why comment these out? Delete?


See issues on Github at https://github.com/FourierFlows/GeophysicalFlows.jl/issues/112
and https://github.com/FourierFlows/GeophysicalFlows.jl/issues/267.
# See issues on Github at https://github.com/FourierFlows/GeophysicalFlows.jl/issues/112
# and https://github.com/FourierFlows/GeophysicalFlows.jl/issues/267.

To use MultiLayerQG with 3 fluid layers or more we suggest, for now, to restrict running
on CPU."""
end
# To use MultiLayerQG with 3 fluid layers or more we suggest, for now, to restrict running
# on CPU."""
# end

if nlayers == 1
@warn """MultiLayerQG module does work for single-layer configuration but may not be as
Expand Down Expand Up @@ -533,16 +535,50 @@ Compute the inverse Fourier transform of `varh` and store it in `var`.
"""
invtransform!(var, varh, params::AbstractParams) = ldiv!(var, params.rfftplan, varh)

"""
@kernel function pvfromstreamfunction_kernel!(qh, ψh, S, nlayers)

Kernel for GPU acceleration of PV from streamfunction calculation, i.e., obtaining the Fourier
transform of the PV from the streamfunction `ψh` in each layer using `qh = params.S * ψh`.
"""
@kernel function pvfromstreamfunction_kernel!(qh, ψh, S, nlayers)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
@kernel function pvfromstreamfunction_kernel!(qh, ψh, S, nlayers)
@kernel function pvfromstreamfunction_kernel!(qh, ψh, S, ::Val{nlayers}) where nlayers

for @unroll you have to do this, and also pass Val(nlayers) rather than nlayers into the kernel when launching it. I don't know if it will speed things up though. It might.

i, j = @index(Global, NTuple)

@unroll for k = 1:nlayers

@inbounds qh[i, j, k] = 0

@unroll for m = 1:nlayers
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The @unroll don't do anything unless nlayers is known at compile time (this requires using Val, but I don't know if it will speed anything up... it might).

@inbounds qh[i, j, k] += S[i, j][k, m] * ψh[i, j, m]
end

end
end

"""
pvfromstreamfunction!(qh, ψh, params, grid)

Obtain the Fourier transform of the PV from the streamfunction `ψh` in each layer using
`qh = params.S * ψh`.
`qh = params.S * ψh`. We use a work layout over which the above-defined kernel is launched.
"""
function pvfromstreamfunction!(qh, ψh, params, grid)
for j=1:grid.nl, i=1:grid.nkr
CUDA.@allowscalar @views qh[i, j, :] .= params.S[i, j] * ψh[i, j, :]
end
# Larger workgroups are generally more efficient. For more generality, we could put an
# if statement that incurs different behavior when either nkl or nl are less than 8
workgroup = 8, 8

# The worksize determines how many times the kernel is run
worksize = grid.nkr, grid.nl

# Instantiates the kernel for relevant backend device
backend = KernelAbstractions.get_backend(qh)
kernel! = pvfromstreamfunction_kernel!(backend, workgroup, worksize)

# Launch the kernel
S, nlayers = params.S, params.nlayers
kernel!(qh, ψh, S, nlayers)

# This will ensure that no other operations occur until the kernel has finished
Copy link
Member

@navidcy navidcy Oct 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# This will ensure that no other operations occur until the kernel has finished
# Ensure that no other operations occur until the kernel has finished

KernelAbstractions.synchronize(backend)

return nothing
end
Expand Down Expand Up @@ -587,16 +623,50 @@ function pvfromstreamfunction!(qh, ψh, params::TwoLayerParams, grid)
return nothing
end

"""
@kernel function streamfunctionfrompv_kernel!(ψh, qh, S⁻¹, nlayers)

Kernel for GPU acceleration of streamfunction from PV calculation, i.e., invert the PV to obtain
the Fourier transform of the streamfunction `ψh` in each layer from `qh` using `ψh = params.S⁻¹ qh`.
"""
@kernel function streamfunctionfrompv_kernel!(ψh, qh, S⁻¹, nlayers)
i, j = @index(Global, NTuple)

@unroll for k = 1:nlayers

@inbounds ψh[i, j, k] = 0

@unroll for m = 1:nlayers
@inbounds ψh[i, j, k] += S⁻¹[i, j][k, m] * qh[i, j, m]
end

end
end

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are the two kernel functions identical except the order they want their arguments?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like it yes. It'd probably make sense to write one kernel in the more general form.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so! Would make the code more robust.

"""
streamfunctionfrompv!(ψh, qh, params, grid)

Invert the PV to obtain the Fourier transform of the streamfunction `ψh` in each layer from
`qh` using `ψh = params.S⁻¹ qh`.
`qh` using `ψh = params.S⁻¹ qh`. We use a work layout over which the above-defined kernel is launched.
"""
function streamfunctionfrompv!(ψh, qh, params, grid)
for j=1:grid.nl, i=1:grid.nkr
CUDA.@allowscalar @views ψh[i, j, :] .= params.S⁻¹[i, j] * qh[i, j, :]
end
# Larger workgroups are generally more efficient. For more generality, we could put an
# if statement that incurs different behavior when either nkl or nl are less than 8
workgroup = 8, 8

# The worksize determines how many times the kernel is run
worksize = grid.nkr, grid.nl

# Instantiates the kernel for relevant backend device
backend = KernelAbstractions.get_backend(ψh)
kernel! = streamfunctionfrompv_kernel!(backend, workgroup, worksize)

# Launch the kernel
S⁻¹, nlayers = params.S⁻¹, params.nlayers
kernel!(ψh, qh, S⁻¹, nlayers)

# This will ensure that no other operations occur until the kernel has finished
KernelAbstractions.synchronize(backend)

return nothing
end
Expand Down
Loading