Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cuda heat example w quaditer #913

Draft
wants to merge 139 commits into
base: master
Choose a base branch
from

Conversation

Abdelrahman912
Copy link
Contributor

Heat Example Prototype using CUDA.jl and StaticCellValues

end

Kgpu = CUDA.zeros(dh.ndofs.x,dh.ndofs.x)
gpu_dh = GPUDofHandler(dh)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe that's the plan anyway but wouldn't it be nicer to write Adapt rules for DofHandler and Grid which return adapt_structure for the GPU structs?

function Adapt.adapt_structure(to,dh:DofHandler)
    return adapt_structure(GPUDofHandler(cu(Int32.(dh.cell_dofs)),GPUGrid(dh.grid)))
end

and then just use the "normal" structs from a user perspective that are automatically converted when needed

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am still a bit split about this, because it contains a performance pitfalls. If we have repeated assembly, then the dof handler will be converted and copied to GPU for each assembly instead of only once.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I missing something or wouldn't the adapt call not also happen for each assembly kernel launch for the GPUDofHandler in that case?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not happen, because we do not need to adapt the GPUDofHandler (it is already a GPU datastructure).

end


gm = static_cellvalues.gm
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this is still a work in progress but the mix of functions and global variables make it kind of confusing to read the code.

@Abdelrahman912
Copy link
Contributor Author

What I did for now, and it's still work in progress:

  1. I added some higher-level abstractions and some restructuring to match the original example (still need some refactoring).
  2. I used the QuadratureValuesIterator and edited the StaticCellValue object to be compatible with the GPU.

This still work in progress and as my discussion with @termi-official last week I still need to work on the assembler, coloring algorthim.

Some problems I have encountered that might be so straightforward to tackle:

  1. Grid object contains Dict type which is not GPU compatible.

@termi-official
Copy link
Member

Great to see some quick progress here!

Some problems I have encountered that might be so straightforward to tackle:

1. `Grid` object contains `Dict` type which is not GPU compatible.

I think that is straight forward to solve. We never really need the Dicts directly during assembly. We should be able to get away by just convert the Vectors (once) to GPUVectors and run the assembly with these. This might require 2 structs. One holding the full information (e.g. GPUGrid) and one which we use in the kernels (e.g. GPUGridView). Maybe the latter could be something like

struct GPUGridView{TEA, TNA, TSA <: Union{Nothing, <:AbstractVector{Int}, <: AbstractVector{FaceIndex}, ..., TCA} <: AbstractGrid (?)
    cells::TEA
    nodes::TNA
    subdomain::TSA
    color::TCA
end

where subdomain just holds the data which we want to iterate over (or nothing for all cells) and color is a vector for elements with one color of the current subdomain.

@KnutAM
Copy link
Member

KnutAM commented May 23, 2024

A longer-term thing just to throw out the idea, but perhaps a more slim Grid could be nice?

struct Grid{dim, C, T, CV, NV, S}
    cells::CV
    nodes::NV
    gridsets::S
    function Grid(cells::AbstractVector{C}, nodes::AbstractVector{Node{dim, T}}, gridsets) where {C, dim, T}
        return new{dim, C, T, typeof(cells), typeof(nodes), typeof(sets)}(cells, nodes, gridsets)
     end
end
struct GridSets
    facetsets::Dict{String, OrderedSet{FacetIndex}}
    cellsets::Dict{String, OrderedSets{Int}}
    ....
end

allowing also gridsets=nothing

@termi-official
Copy link
Member

termi-official commented Jun 4, 2024

A longer-term thing just to throw out the idea, but perhaps a more slim Grid could be nice?

struct Grid{dim, C, T, CV, NV, S}
    cells::CV
    nodes::NV
    gridsets::S
    function Grid(cells::AbstractVector{C}, nodes::AbstractVector{Node{dim, T}}, gridsets) where {C, dim, T}
        return new{dim, C, T, typeof(cells), typeof(nodes), typeof(sets)}(cells, nodes, gridsets)
     end
end
struct GridSets
    facetsets::Dict{String, OrderedSet{FacetIndex}}
    cellsets::Dict{String, OrderedSets{Int}}
    ....
end

allowing also gridsets=nothing

I thought of this quite a bit already, whether we should have our grid in the form

struct Grid{dim, C, T, CV, NV, S, TT}
    cells::CV
    nodes::NV
    subdomain_info::S
    function Grid(cells::AbstractVector{C}, nodes::AbstractVector{Node{dim, T}}, subdomain_info) where {C, dim, T}
        return new{dim, C, T, typeof(cells), typeof(nodes), typeof(sets)}(cells, nodes, subdomain_info)
     end
end

where subdomain info contains any kind of subdomain information. This could also include potentially some optional topology information which we need for some problems. In the simplest case it would be just facesets and cellsets.

However, we should do this in a separate PR. What do you think @fredrikekre ?

@Abdelrahman912
Copy link
Contributor Author

So far:

  1. example implementation using the coloring algorithm and I did my best to follow the same abstraction as in the CPU case (also one could circumvent this by introducing metaprogramming to set up the kernel before launching but this might be relevant for later discussion)
  2. I had to implement a custom assembler (naive implementation) gpu_assembler because the already existing one cannot be used because the permutation and sorteddofs attributes are mutable (regarding to the elements and the size) so it's anly valid for sequential code, not to mention the resize!
  3. Also, setting an index for CuSparseMatrixCSC is not allowed inside a kernel (ref: https://discourse.julialang.org/t/cuda-jl-atomic-addition-error-to-a-sparse-array-inside-cuda-kernel/78789/2), so I write a very naive GPUSparseMatrixCSC.
  4. Final observation, in create_sparsity_pattern it only create sparse matrix with nzvals of type Float64 as follows :
    - K = spzeros!!(Float64, I, J, ndofs(dh), ndofs(dh)) #old code
    + K = spzeros!!(T, I, J, ndofs(dh), ndofs(dh)) # my proposal

I don't know whether this was intended or what but I found it worth mentioning.

@termi-official
Copy link
Member

Thanks for putting this so far together! Some quick comments before the next meeting for you.

2. I had to implement a custom assembler (naive implementation) `gpu_assembler` because the already existing one cannot be used because the `permutation` and `sorteddofs` attributes are mutable (regarding to the elements and the size) so it's anly valid for sequential code, not to mention the `resize!`

Indeed and I started refactoring some of the assembly code here #916 . I also think that we cannot get away with reusing the existing assembler and that we need a custom one.

3. Also, setting an index for `CuSparseMatrixCSC` is not allowed inside a kernel (ref: https://discourse.julialang.org/t/cuda-jl-atomic-addition-error-to-a-sparse-array-inside-cuda-kernel/78789/2), so I write a very naive `GPUSparseMatrixCSC`.

Indeed, but you should be able to write into nzval directly. Your GPUSparseMatrixCSC struct has very similar structure to the one of CUSPARSE already, so switching should be straight forward.

4. Final observation, in `create_sparsity_pattern` it only create sparse matrix with `nzvals` of type `Float64`  as follows :
    - K = spzeros!!(Float64, I, J, ndofs(dh), ndofs(dh)) #old code
    + K = spzeros!!(T, I, J, ndofs(dh), ndofs(dh)) # my proposal

I don't know whether this was intended or what but I found it worth mentioning.

Indeed. Frekdrik has put something great together to fix this already here #888 and I hope we can merge it in the not so far future to have more direct support of different formats.

@Abdelrahman912
Copy link
Contributor Author

A GPU benchmark with 1000 X 1000 grid, Biquadratic Lagrange as an approximation function, and 3 x 3 quadrature rule for numerical integration.

Profiler ran for 477.09 ms, capturing 2844 events.
Host-side activity: calling CUDA APIs took 303.82 ms (63.68% of the trace)
┌──────┬───────────┬───────────┬────────┬─────────────────────────┬────────────────────────────┐
│   ID │     Start │      Time │ Thread │ Name                    │                    Details │
├──────┼───────────┼───────────┼────────┼─────────────────────────┼────────────────────────────┤
│    519.55 µs │   8.82 µs │      1 │ cuMemAllocFromPoolAsync │  15.274 MiB, device memory │
│   1938.39 µs │ 953.67 ns │      1 │ cuStreamSynchronize     │                          - │
│   2844.35 µs │   2.66 ms │      1 │ cuMemcpyHtoDAsync       │                          - │
│   332.71 ms │   7.15 µs │      1 │ cuMemAllocFromPoolAsync │ 244.202 MiB, device memory │
│  1702.84 ms │   1.43 µs │      1 │ cuStreamSynchronize     │                          - │
│  1792.84 ms │  44.51 ms │      1 │ cuMemcpyHtoDAsync       │                          - │
│  18447.38 ms │  27.18 µs │      1 │ cuMemAllocFromPoolAsync │ 244.202 MiB, device memory │
│  30947.47 ms │   3.58 µs │      1 │ cuStreamSynchronize     │                          - │
│  31847.49 ms │  41.89 ms │      1 │ cuMemcpyHtoDAsync       │                          - │
│  32389.42 ms │  38.86 µs │      1 │ cuMemAllocFromPoolAsync │  15.274 MiB, device memory │
│  33589.48 ms │  64.61 µs │      1 │ cuMemsetD32Async        │                          - │
│  35698.81 ms │  30.76 µs │      1 │ cuMemAllocFromPoolAsync │  34.332 MiB, device memory │
│  37098.86 ms │   4.29 µs │      1 │ cuStreamSynchronize     │                          - │
│  37998.88 ms │    5.4 ms │      1 │ cuMemcpyHtoDAsync       │                          - │
│  388104.32 ms │  21.46 µs │      1 │ cuMemAllocFromPoolAsync │  30.518 MiB, device memory │
│  516104.41 ms │   2.38 µs │      1 │ cuStreamSynchronize     │                          - │
│  525104.42 ms │   4.74 ms │      1 │ cuMemcpyHtoDAsync       │                          - │
│  534110.25 ms │  24.56 µs │      1 │ cuMemAllocFromPoolAsync │   3.815 MiB, device memory │
│  548110.29 ms │   4.53 µs │      1 │ cuStreamSynchronize     │                          - │
│  557110.31 ms │ 797.99 µs │      1 │ cuMemcpyHtoDAsync       │                          - │
│  566111.12 ms │  13.35 µs │      1 │ cuMemAllocFromPoolAsync │   7.645 MiB, device memory │
│ 1054111.3 ms │   1.19 µs │      1 │ cuStreamSynchronize     │                          - │
│ 1063111.3 ms │   1.39 ms │      1 │ cuMemcpyHtoDAsync       │                          - │
│ 1072114.73 ms │  13.59 µs │      1 │ cuMemAllocFromPoolAsync │   3.815 MiB, device memory │
│ 1086114.75 ms │   1.91 µs │      1 │ cuStreamSynchronize     │                          - │
│ 1095114.76 ms │ 598.67 µs │      1 │ cuMemcpyHtoDAsync       │                          - │
│ 1112115.43 ms │   8.11 µs │      1 │ cuMemAllocFromPoolAsync │   3.164 MiB, device memory │
│ 1124115.44 ms │  50.54 µs │      1 │ cuMemsetD32Async        │                          - │
│ 1129115.49 ms │   6.91 µs │      1 │ cuMemAllocFromPoolAsync │ 360.000 KiB, device memory │
│ 1141115.5 ms │   6.44 µs │      1 │ cuMemsetD32Async        │                          - │
│ 1162149.56 ms │   1.22 ms │      1 │ cuMemAllocFromPoolAsync │  34.332 MiB, device memory │
│ 1176150.8 ms │   4.29 µs │      1 │ cuStreamSynchronize     │                          - │
│ 1185150.82 ms │   7.98 ms │      1 │ cuMemcpyHtoDAsync       │                          - │
│ 1194158.81 ms │ 722.41 µs │      1 │ cuMemAllocFromPoolAsync │  30.518 MiB, device memory │
│ 1208159.55 ms │   3.58 µs │      1 │ cuStreamSynchronize     │                          - │
│ 1217159.56 ms │   6.86 ms │      1 │ cuMemcpyHtoDAsync       │                          - │
│ 1226167.29 ms │  15.02 µs │      1 │ cuMemAllocFromPoolAsync │   3.815 MiB, device memory │
│ 1240167.34 ms │   3.58 µs │      1 │ cuStreamSynchronize     │                          - │
│ 1249167.35 ms │ 777.24 µs │      1 │ cuMemcpyHtoDAsync       │                          - │
│ 1258168.14 ms │   6.68 µs │      1 │ cuMemAllocFromPoolAsync │   7.645 MiB, device memory │
│ 1965168.38 ms │   1.19 µs │      1 │ cuStreamSynchronize     │                          - │
│ 1974168.39 ms │   1.41 ms │      1 │ cuMemcpyHtoDAsync       │                          - │
│ 1983171.42 ms │  12.87 µs │      1 │ cuMemAllocFromPoolAsync │   3.815 MiB, device memory │
│ 1997171.46 ms │   2.15 µs │      1 │ cuStreamSynchronize     │                          - │
│ 2006171.47 ms │ 816.58 µs │      1 │ cuMemcpyHtoDAsync       │                          - │
│ 2023172.42 ms │   84.4 µs │      1 │ cuLaunchKernel          │                          - │
│ 2801172.93 ms │  13.11 µs │      2 │ cuMemFreeAsync          │   3.815 MiB, device memory │
│ 2806172.95 ms │   2.38 µs │      2 │ cuMemFreeAsync          │   7.645 MiB, device memory │
│ 2811172.96 ms │   2.38 µs │      2 │ cuMemFreeAsync          │   3.815 MiB, device memory │
│ 2816172.96 ms │   2.62 µs │      2 │ cuMemFreeAsync          │  30.518 MiB, device memory │
│ 2821172.97 ms │   2.86 µs │      2 │ cuMemFreeAsync          │  34.332 MiB, device memory │
│ 2824172.97 ms │  303.8 ms │      2 │ cuStreamSynchronize     │                          - │
└──────┴───────────┴───────────┴────────┴─────────────────────────┴────────────────────────────┘

Device-side activity: GPU was busy for 418.87 ms (87.80% of the trace)
┌──────┬───────────┬───────────┬─────────┬────────┬──────┬─────────────┬──────────────┬──────────────────────────────────────────────────────────────────────────────────────────────────
│   ID │     Start │      Time │ Threads │ Blocks │ Regs │        Size │   Throughput │ Name                                                                                            
├──────┼───────────┼───────────┼─────────┼────────┼──────┼─────────────┼──────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────
│   28247.24 µs │    2.6 ms │       ---15.274 MiB │  5.741 GiB/s │ [copy pageable to device memory]                                                                1793.02 ms │  44.45 ms │       ---244.202 MiB │  5.365 GiB/s │ [copy pageable to device memory]                                                                31847.62 ms │  41.88 ms │       ---244.202 MiB │  5.694 GiB/s │ [copy pageable to device memory]                                                                33589.97 ms │  250.1 µs │       ---15.274 MiB │ 59.640 GiB/s │ [set device memory]                                                                             37999.14 ms │   5.27 ms │       ---34.332 MiB │  6.362 GiB/s │ [copy pageable to device memory]                                                                525104.54 ms │   4.76 ms │       ---30.518 MiB │  6.262 GiB/s │ [copy pageable to device memory]                                                                557110.52 ms │ 791.31 µs │       ---3.815 MiB │  4.708 GiB/s │ [copy pageable to device memory]                                                                1063111.5 ms │   1.35 ms │       ---7.645 MiB │  5.519 GiB/s │ [copy pageable to device memory]                                                                1095114.97 ms │ 537.16 µs │       ---3.815 MiB │  6.935 GiB/s │ [copy pageable to device memory]                                                                1124115.95 ms │  58.41 µs │       ---3.164 MiB │ 52.898 GiB/s │ [set device memory]                                                                             1141116.02 ms │  13.11 µs │       ---360.000 KiB │ 26.182 GiB/s │ [set device memory]                                                                             1185153.41 ms │   5.53 ms │       ---34.332 MiB │  6.064 GiB/s │ [copy pageable to device memory]                                                                1217161.78 ms │   4.81 ms │       ---30.518 MiB │  6.200 GiB/s │ [copy pageable to device memory]                                                                1249167.66 ms │ 730.51 µs │       ---3.815 MiB │  5.100 GiB/s │ [copy pageable to device memory]                                                                1974168.68 ms │   1.34 ms │       ---7.645 MiB │  5.582 GiB/s │ [copy pageable to device memory]                                                                2006171.79 ms │ 721.93 µs │       ---3.815 MiB │  5.160 GiB/s │ [copy pageable to device memory]                                                                2023172.92 ms │ 303.78 ms │     2564095--assemble_gpu_(CuSparseDeviceMatrixCSC<Float32, Int32, 1l>, CuDeviceArray<Float32, 1l, 1l>, Stat 
└──────┴───────────┴───────────┴─────────┴────────┴──────┴─────────────┴──────────────┴──────────────────────────────────────────────────────────────────────────────────────────────────

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants