From 2586a0a06a261835e6d67093f4393bf988395dd0 Mon Sep 17 00:00:00 2001 From: Fredrik Ekre Date: Fri, 20 Sep 2024 15:26:52 +0200 Subject: [PATCH] Add a doc section comparing different assembly strategies --- docs/src/topics/assembly.md | 307 ++++++++++++++++++++++++++++++++++++ 1 file changed, 307 insertions(+) diff --git a/docs/src/topics/assembly.md b/docs/src/topics/assembly.md index 868761af8e..11414ea77a 100644 --- a/docs/src/topics/assembly.md +++ b/docs/src/topics/assembly.md @@ -83,3 +83,310 @@ for t in 1:timesteps # ... end ``` + +## Comparison of assembly strategies + +As discussed above there are various ways to assemble the local matrix into the global one. +In particular, it was mentioned that naive indexing is very inefficient and using an +assembler is faster. To put some concrete numbers to these statements we will compare some +strategies in this section. First we compare just a single assembly operation (e.g. +assembling an already computed local matrix) and then to relate this to a more realistic +scenario we compare the full matrix assembly including the integration of all the elements. + +!!! note "Pre-allocated global matrix" + All strategies that we compare below uses a pre-allocated global matrix `K` with the + correct sparsity pattern. Starting with something like + `K = spzeros(ndofs(dh), ndofs(dh))` and then inserting entries is excruciatingly slow + due to the sparse data structure so this method is not even considered. + +For the comparison we need a representative global matrix to assemble into. In the following +setup code we create grid with traingles and a DofHandler with two fields (a quadratic +vector field and a linear scalar field). From this we instantiate the global matrix. + +```@example assembly-perf +using Ferrite + +# Interpolations +ip_u = Lagrange{RefTriangle, 2}()^2 # Quadratic vector field +ip_p = Lagrange{RefTriangle, 1}() # Linear scalar field + +# DofHandler with two fields +const N = 100 +grid = generate_grid(Triangle, (N, N)) +const dh = DofHandler(grid) +add!(dh, :u, ip_u) +add!(dh, :p, ip_p) +close!(dh) + +# Global matrix and a corresponding assembler +const K = allocate_matrix(dh) +nothing # hide +``` + +#### Strategy 1: matrix indexing + +The first strategy is to index directly, using the vector of global dofs, into the global +matrix: + +```@example assembly-perf +function assemble_v1(_, K, dofs, Ke) + K[dofs, dofs] += Ke + return +end +nothing # hide +``` + +This looks very simple but it is very inefficient (as the numbers will show later). To +understand why the operation `K[dofs, dofs] += Ke` (with `K` being a sparse matrix) is so +slow we can dig into the details. + +In Julia there is no "`+=`"-operation: `x += y` is identical to `x = x + y`. Translating +this to our example we have + +```julia +K[dofs, dofs] = K[dofs, dofs] + Ke +``` + +We can break down this a bit further by introducing some temporary variables: + +```julia +tmp1 = K[dofs, dofs] # 1 +tmp2 = tmp1 + Ke # 2 +K[dofs, dofs] = tmp2 # 3 +``` + +Now the problem with this strategy becomes a bit more obvious. + - In line 1 there is first an allocation of a new matrix (`tmp1`) followed by indexing into + `K` to copy elements from `K` to `tmp1`. Both of these operations are rather costly: + allocations should always be minimized in tight loops, and indexing into a sparse matrix + is non-trivial due to the data structure. In addition, since `dofs` vector contains the + global indices (which are not sorted nor consecutive) we have random access patterns. + - In line 2 there is another allocation of a matrix (`tmp2`) for the result of the addition + of `tmp1` and `Ke`. + - In line 3 we again need to index into the sparse matrix to copy over the elements from + `tmp2` to `K`. This essentially duplicates the indexing effort from line 1 since we + lookup the same locations in `K` again. + +#### Strategy 2: scalar indexing + +A variant of the first strategy is to explicitly loop over the indices and add add the +elements individually as scalars: + +```@example assembly-perf +function assemble_v2(_, K, dofs, Ke) + for (i, I) in pairs(dofs) + for (j, J) in pairs(dofs) + K[I, J] += Ke[i, j] + end + end + return +end +nothing # hide +``` + +The core operation `K[I, J] += Ke[i, j]` can still be broken down into + +```julia +tmp1 = K[I, J] +tmp2 = tmp1 + Ke[i, j] +K[I, J] = tmp2 +``` + +but the key difference here is that we index using integers (`I`, `J`, `i`, and `j`) which +means that `tmp1` and `tmp2` are scalars which don't need to be allocated on the heap. This +stragety thus eliminates the allocations that were present in the first strategy. However, +we still lookup the same location in `K` twice, and we still have random access patterns. + +#### Strategy 3: scalar indexing with single lookup + +To improve on the second strategy we will get rid of the double lookup into the sparse +matrix `K`. While Julia doesn't have a "`+=`"-operation there is a `addindex!`-function in +Ferrite which does exactly what we want: it adds a value to a specific location in a sparse +with a single lookup. + +```@example assembly-perf +function assemble_v3(_, K, dofs, Ke) + for (i, I) in pairs(dofs) + for (j, J) in pairs(dofs) + Ferrite.addindex!(K, Ke[i, j], I, J) + end + end + return +end +nothing # hide +``` + +With this method we remove the double lookup, but the issue of random access patterns still +remains. + +#### Strategy 4: using an assembler + +Finally, the last strategy we consider uses an assembler. The assembler is a specific +datastructure that pre-allocates some workspace to make the assembly more efficient: + +```@example assembly-perf +function assemble_v4(assembler, _, dofs, Ke) + assemble!(assembler, dofs, Ke) + return +end +nothing # hide +``` + +The extra workspace inside the assembler is used to sort the dofs when `assemble!` is +called. After sorting it is possible to loop over the sparse matrix data structure in one go +instead of having to lookup locations randomly. + +### Single element assembly + +First we will compare the four functions above for a single assembly operation, i.e. +inserting one local matrix into the global one. For this we simply create a random local +matrix since we are not conserned with the actual values. We also pick the "middle" element +and extract the dofs for that element. Finally, an assembler is created with +`start_assemble` to use with the fourth strategy. + +```@example assembly-perf +dofs_per_cell = ndofs_per_cell(dh) +const Ke = rand(dofs_per_cell, dofs_per_cell) +const dofs = celldofs(dh, N * N ÷ 2) + +const assembler = start_assemble(K) +``` + +We use BenchmarkTools to measure the performance: + +```@example assembly-perf +assemble_v1(assembler, K, dofs, Ke) # hide +assemble_v2(assembler, K, dofs, Ke) # hide +assemble_v3(assembler, K, dofs, Ke) # hide +assemble_v4(assembler, K, dofs, Ke) # hide +nothing # hide +``` + +```julia +using BenchmarkTools + +@btime assemble_v1(assembler, K, dofs, Ke) evals = 10 setup = Ferrite.fillzero!(K) +@btime assemble_v2(assembler, K, dofs, Ke) evals = 10 setup = Ferrite.fillzero!(K) +@btime assemble_v3(assembler, K, dofs, Ke) evals = 10 setup = Ferrite.fillzero!(K) +@btime assemble_v4(assembler, K, dofs, Ke) evals = 10 setup = Ferrite.fillzero!(K) +``` + +The results below are obtained on an Macbook Pro with an Apple M3 CPU. + +``` +3.627 ms (40 allocations: 42.33 MiB) +2.362 μs (0 allocations: 0 bytes) +1.283 μs (0 allocations: 0 bytes) +404.100 ns (0 allocations: 0 bytes) +``` + +The results match what we expect based on the explanations above: + - Between strategy 1 and 2 we got rid of the allocations completely and decreased the time + with a factor of 1500(!). + - Between strategy 2 and 3 we got rid of the double lookup and decreased the time with a + factor of 2. + - Between strategy 3 and 4 we got rid of random lookup order and decreased the time with a + factor of 3. + +The most important thing for this benchmark is to get rid of the allocations. By using an +assembler instead of doing the naive thing we reduce the runtime with a factor of almost +9000(!!). + +### Full system assembly + +We will now compare the four strategies in a more realistic scenario where we assemble all +elements. This is to put the assembly performance in relation to other operations in the +finite element program. Assembly performance might not matter in the end if other things +dominate the runtime. + +For this comparison we use the global assembly routine from +[Tutorial 8: Stokes flow](stokes-flow.md). The routine is modified to i) only assemble the +global matrix and ii) to take an assembly function as the first argument so that we can pass +the four variants from above. + +```@example assembly-perf +function assemble_system!(assembler_function::F, K, dh, cvu, cvp) where {F} + assembler = start_assemble(K) + ke = zeros(ndofs_per_cell(dh), ndofs_per_cell(dh)) + range_u = dof_range(dh, :u) + ndofs_u = length(range_u) + range_p = dof_range(dh, :p) + ndofs_p = length(range_p) + ϕᵤ = Vector{Vec{2,Float64}}(undef, ndofs_u) + ∇ϕᵤ = Vector{Tensor{2,2,Float64,4}}(undef, ndofs_u) + divϕᵤ = Vector{Float64}(undef, ndofs_u) + ϕₚ = Vector{Float64}(undef, ndofs_p) + for cell in CellIterator(dh) + reinit!(cvu, cell) + reinit!(cvp, cell) + ke .= 0 + for qp in 1:getnquadpoints(cvu) + dΩ = getdetJdV(cvu, qp) + for i in 1:ndofs_u + ϕᵤ[i] = shape_value(cvu, qp, i) + ∇ϕᵤ[i] = shape_gradient(cvu, qp, i) + divϕᵤ[i] = shape_divergence(cvu, qp, i) + end + for i in 1:ndofs_p + ϕₚ[i] = shape_value(cvp, qp, i) + end + for (i, I) in pairs(range_u), (j, J) in pairs(range_u) + ke[I, J] += ( ∇ϕᵤ[i] ⊡ ∇ϕᵤ[j] ) * dΩ + end + for (i, I) in pairs(range_u), (j, J) in pairs(range_p) + ke[I, J] += ( -divϕᵤ[i] * ϕₚ[j] ) * dΩ + end + for (i, I) in pairs(range_p), (j, J) in pairs(range_u) + ke[I, J] += ( -divϕᵤ[j] * ϕₚ[i] ) * dΩ + end + end + assembler_function(assembler, K, celldofs(cell), ke) + end + return +end +nothing # hide +``` + +Finally, we need cellvalues for the two fields in order to perform the integration: + +```@example assembly-perf +qr = QuadratureRule{RefTriangle}(2) +const cellvalues_u = CellValues(qr, ip_u) +const cellvalues_p = CellValues(qr, ip_p) +nothing # hide +``` + +```@example assembly-perf +res = 1610.6222565099367 # hide +# assemble_system!(assemble_v1, K, dh, cellvalues_u, cellvalues_p) # hide +# @assert norm(K.nzval) ≈ res # hide +assemble_system!(assemble_v2, K, dh, cellvalues_u, cellvalues_p) # hide +@assert norm(K.nzval) ≈ res # hide +assemble_system!(assemble_v3, K, dh, cellvalues_u, cellvalues_p) # hide +@assert norm(K.nzval) ≈ res # hide +assemble_system!(assemble_v4, K, dh, cellvalues_u, cellvalues_p) # hide +@assert norm(K.nzval) ≈ res # hide +nothing # hide +``` + +```julia +@time assemble_system!(assemble_v1, K, dh, cellvalues_u, cellvalues_p) +@time assemble_system!(assemble_v2, K, dh, cellvalues_u, cellvalues_p) +@time assemble_system!(assemble_v3, K, dh, cellvalues_u, cellvalues_p) +@time assemble_system!(assemble_v4, K, dh, cellvalues_u, cellvalues_p) +``` + +Results (running on the same machine as above): + +``` +75.229110 seconds (799.99 k allocations: 826.668 GiB, 8.20% gc time) + 0.075369 seconds (12 allocations: 3.516 KiB) + 0.037568 seconds (12 allocations: 3.516 KiB) + 0.027457 seconds (14 allocations: 3.766 KiB) +``` + +This follows the same trend as for the benchmarks for individual cell assembly and shows +that the efficiency of the assembly strategy is crucial for the overall performance of the +program. In particular this benchmark shows that allocations in such a tight loop from the +first strategy is very costly and puts a strain on the garbage collector: 8% of the time is +spent in GC instead of crunching numbers.