diff --git a/docs/src/topics/assembly.md b/docs/src/topics/assembly.md
index 868761af8e..11414ea77a 100644
--- a/docs/src/topics/assembly.md
+++ b/docs/src/topics/assembly.md
@@ -83,3 +83,310 @@ for t in 1:timesteps
     # ...
 end
 ```
+
+## Comparison of assembly strategies
+
+As discussed above there are various ways to assemble the local matrix into the global one.
+In particular, it was mentioned that naive indexing is very inefficient and using an
+assembler is faster. To put some concrete numbers to these statements we will compare some
+strategies in this section. First we compare just a single assembly operation (e.g.
+assembling an already computed local matrix) and then to relate this to a more realistic
+scenario we compare the full matrix assembly including the integration of all the elements.
+
+!!! note "Pre-allocated global matrix"
+    All strategies that we compare below uses a pre-allocated global matrix `K` with the
+    correct sparsity pattern. Starting with something like
+    `K = spzeros(ndofs(dh), ndofs(dh))` and then inserting entries is excruciatingly slow
+    due to the sparse data structure so this method is not even considered.
+
+For the comparison we need a representative global matrix to assemble into. In the following
+setup code we create grid with traingles and a DofHandler with two fields (a quadratic
+vector field and a linear scalar field). From this we instantiate the global matrix.
+
+```@example assembly-perf
+using Ferrite
+
+# Interpolations
+ip_u = Lagrange{RefTriangle, 2}()^2 # Quadratic vector field
+ip_p = Lagrange{RefTriangle, 1}()   # Linear scalar field
+
+# DofHandler with two fields
+const N = 100
+grid = generate_grid(Triangle, (N, N))
+const dh = DofHandler(grid)
+add!(dh, :u, ip_u)
+add!(dh, :p, ip_p)
+close!(dh)
+
+# Global matrix and a corresponding assembler
+const K = allocate_matrix(dh)
+nothing # hide
+```
+
+#### Strategy 1: matrix indexing
+
+The first strategy is to index directly, using the vector of global dofs, into the global
+matrix:
+
+```@example assembly-perf
+function assemble_v1(_, K, dofs, Ke)
+    K[dofs, dofs] += Ke
+    return
+end
+nothing # hide
+```
+
+This looks very simple but it is very inefficient (as the numbers will show later). To
+understand why the operation `K[dofs, dofs] += Ke` (with `K` being a sparse matrix) is so
+slow we can dig into the details.
+
+In Julia there is no "`+=`"-operation: `x += y` is identical to `x = x + y`. Translating
+this to our example we have
+
+```julia
+K[dofs, dofs] = K[dofs, dofs] + Ke
+```
+
+We can break down this a bit further by introducing some temporary variables:
+
+```julia
+tmp1 = K[dofs, dofs]   # 1
+tmp2 = tmp1 + Ke       # 2
+K[dofs, dofs] = tmp2   # 3
+```
+
+Now the problem with this strategy becomes a bit more obvious.
+ - In line 1 there is first an allocation of a new matrix (`tmp1`) followed by indexing into
+   `K` to copy elements from `K` to `tmp1`. Both of these operations are rather costly:
+   allocations should always be minimized in tight loops, and indexing into a sparse matrix
+   is non-trivial due to the data structure. In addition, since `dofs` vector contains the
+   global indices (which are not sorted nor consecutive) we have random access patterns.
+ - In line 2 there is another allocation of a matrix (`tmp2`) for the result of the addition
+   of `tmp1` and `Ke`.
+ - In line 3 we again need to index into the sparse matrix to copy over the elements from
+   `tmp2` to `K`. This essentially duplicates the indexing effort from line 1 since we
+   lookup the same locations in `K` again.
+
+#### Strategy 2: scalar indexing
+
+A variant of the first strategy is to explicitly loop over the indices and add add the
+elements individually as scalars:
+
+```@example assembly-perf
+function assemble_v2(_, K, dofs, Ke)
+    for (i, I) in pairs(dofs)
+        for (j, J) in pairs(dofs)
+            K[I, J] += Ke[i, j]
+        end
+    end
+    return
+end
+nothing # hide
+```
+
+The core operation `K[I, J] += Ke[i, j]` can still be broken down into
+
+```julia
+tmp1 = K[I, J]
+tmp2 = tmp1 + Ke[i, j]
+K[I, J] = tmp2
+```
+
+but the key difference here is that we index using integers (`I`, `J`, `i`, and `j`) which
+means that `tmp1` and `tmp2` are scalars which don't need to be allocated on the heap. This
+stragety thus eliminates the allocations that were present in the first strategy. However,
+we still lookup the same location in `K` twice, and we still have random access patterns.
+
+#### Strategy 3: scalar indexing with single lookup
+
+To improve on the second strategy we will get rid of the double lookup into the sparse
+matrix `K`. While Julia doesn't have a "`+=`"-operation there is a `addindex!`-function in
+Ferrite which does exactly what we want: it adds a value to a specific location in a sparse
+with a single lookup.
+
+```@example assembly-perf
+function assemble_v3(_, K, dofs, Ke)
+    for (i, I) in pairs(dofs)
+        for (j, J) in pairs(dofs)
+            Ferrite.addindex!(K, Ke[i, j], I, J)
+        end
+    end
+    return
+end
+nothing # hide
+```
+
+With this method we remove the double lookup, but the issue of random access patterns still
+remains.
+
+#### Strategy 4: using an assembler
+
+Finally, the last strategy we consider uses an assembler. The assembler is a specific
+datastructure that pre-allocates some workspace to make the assembly more efficient:
+
+```@example assembly-perf
+function assemble_v4(assembler, _, dofs, Ke)
+    assemble!(assembler, dofs, Ke)
+    return
+end
+nothing # hide
+```
+
+The extra workspace inside the assembler is used to sort the dofs when `assemble!` is
+called. After sorting it is possible to loop over the sparse matrix data structure in one go
+instead of having to lookup locations randomly.
+
+### Single element assembly
+
+First we will compare the four functions above for a single assembly operation, i.e.
+inserting one local matrix into the global one. For this we simply create a random local
+matrix since we are not conserned with the actual values. We also pick the "middle" element
+and extract the dofs for that element. Finally, an assembler is created with
+`start_assemble` to use with the fourth strategy.
+
+```@example assembly-perf
+dofs_per_cell = ndofs_per_cell(dh)
+const Ke = rand(dofs_per_cell, dofs_per_cell)
+const dofs = celldofs(dh, N * N ÷ 2)
+
+const assembler = start_assemble(K)
+```
+
+We use BenchmarkTools to measure the performance:
+
+```@example assembly-perf
+assemble_v1(assembler, K, dofs, Ke) # hide
+assemble_v2(assembler, K, dofs, Ke) # hide
+assemble_v3(assembler, K, dofs, Ke) # hide
+assemble_v4(assembler, K, dofs, Ke) # hide
+nothing                             # hide
+```
+
+```julia
+using BenchmarkTools
+
+@btime assemble_v1(assembler, K, dofs, Ke) evals = 10 setup = Ferrite.fillzero!(K)
+@btime assemble_v2(assembler, K, dofs, Ke) evals = 10 setup = Ferrite.fillzero!(K)
+@btime assemble_v3(assembler, K, dofs, Ke) evals = 10 setup = Ferrite.fillzero!(K)
+@btime assemble_v4(assembler, K, dofs, Ke) evals = 10 setup = Ferrite.fillzero!(K)
+```
+
+The results below are obtained on an Macbook Pro with an Apple M3 CPU.
+
+```
+3.627 ms (40 allocations: 42.33 MiB)
+2.362 μs (0 allocations: 0 bytes)
+1.283 μs (0 allocations: 0 bytes)
+404.100 ns (0 allocations: 0 bytes)
+```
+
+The results match what we expect based on the explanations above:
+ - Between strategy 1 and 2 we got rid of the allocations completely and decreased the time
+   with a factor of 1500(!).
+ - Between strategy 2 and 3 we got rid of the double lookup and decreased the time with a
+   factor of 2.
+ - Between strategy 3 and 4 we got rid of random lookup order and decreased the time with a
+   factor of 3.
+
+The most important thing for this benchmark is to get rid of the allocations. By using an
+assembler instead of doing the naive thing we reduce the runtime with a factor of almost
+9000(!!).
+
+### Full system assembly
+
+We will now compare the four strategies in a more realistic scenario where we assemble all
+elements. This is to put the assembly performance in relation to other operations in the
+finite element program. Assembly performance might not matter in the end if other things
+dominate the runtime.
+
+For this comparison we use the global assembly routine from
+[Tutorial 8: Stokes flow](stokes-flow.md). The routine is modified to i) only assemble the
+global matrix and ii) to take an assembly function as the first argument so that we can pass
+the four variants from above.
+
+```@example assembly-perf
+function assemble_system!(assembler_function::F, K, dh, cvu, cvp) where {F}
+    assembler = start_assemble(K)
+    ke = zeros(ndofs_per_cell(dh), ndofs_per_cell(dh))
+    range_u = dof_range(dh, :u)
+    ndofs_u = length(range_u)
+    range_p = dof_range(dh, :p)
+    ndofs_p = length(range_p)
+    ϕᵤ = Vector{Vec{2,Float64}}(undef, ndofs_u)
+    ∇ϕᵤ = Vector{Tensor{2,2,Float64,4}}(undef, ndofs_u)
+    divϕᵤ = Vector{Float64}(undef, ndofs_u)
+    ϕₚ = Vector{Float64}(undef, ndofs_p)
+    for cell in CellIterator(dh)
+        reinit!(cvu, cell)
+        reinit!(cvp, cell)
+        ke .= 0
+        for qp in 1:getnquadpoints(cvu)
+            dΩ = getdetJdV(cvu, qp)
+            for i in 1:ndofs_u
+                ϕᵤ[i] = shape_value(cvu, qp, i)
+                ∇ϕᵤ[i] = shape_gradient(cvu, qp, i)
+                divϕᵤ[i] = shape_divergence(cvu, qp, i)
+            end
+            for i in 1:ndofs_p
+                ϕₚ[i] = shape_value(cvp, qp, i)
+            end
+            for (i, I) in pairs(range_u), (j, J) in pairs(range_u)
+                ke[I, J] += ( ∇ϕᵤ[i] ⊡ ∇ϕᵤ[j] ) * dΩ
+            end
+            for (i, I) in pairs(range_u), (j, J) in pairs(range_p)
+                ke[I, J] += ( -divϕᵤ[i] * ϕₚ[j] ) * dΩ
+            end
+            for (i, I) in pairs(range_p), (j, J) in pairs(range_u)
+                ke[I, J] += ( -divϕᵤ[j] * ϕₚ[i] ) * dΩ
+            end
+        end
+        assembler_function(assembler, K, celldofs(cell), ke)
+    end
+    return
+end
+nothing # hide
+```
+
+Finally, we need cellvalues for the two fields in order to perform the integration:
+
+```@example assembly-perf
+qr = QuadratureRule{RefTriangle}(2)
+const cellvalues_u = CellValues(qr, ip_u)
+const cellvalues_p = CellValues(qr, ip_p)
+nothing # hide
+```
+
+```@example assembly-perf
+res = 1610.6222565099367 # hide
+# assemble_system!(assemble_v1, K, dh, cellvalues_u, cellvalues_p) # hide
+# @assert norm(K.nzval) ≈ res # hide
+assemble_system!(assemble_v2, K, dh, cellvalues_u, cellvalues_p) # hide
+@assert norm(K.nzval) ≈ res # hide
+assemble_system!(assemble_v3, K, dh, cellvalues_u, cellvalues_p) # hide
+@assert norm(K.nzval) ≈ res # hide
+assemble_system!(assemble_v4, K, dh, cellvalues_u, cellvalues_p) # hide
+@assert norm(K.nzval) ≈ res # hide
+nothing # hide
+```
+
+```julia
+@time assemble_system!(assemble_v1, K, dh, cellvalues_u, cellvalues_p)
+@time assemble_system!(assemble_v2, K, dh, cellvalues_u, cellvalues_p)
+@time assemble_system!(assemble_v3, K, dh, cellvalues_u, cellvalues_p)
+@time assemble_system!(assemble_v4, K, dh, cellvalues_u, cellvalues_p)
+```
+
+Results (running on the same machine as above):
+
+```
+75.229110 seconds (799.99 k allocations: 826.668 GiB, 8.20% gc time)
+ 0.075369 seconds (12 allocations: 3.516 KiB)
+ 0.037568 seconds (12 allocations: 3.516 KiB)
+ 0.027457 seconds (14 allocations: 3.766 KiB)
+```
+
+This follows the same trend as for the benchmarks for individual cell assembly and shows
+that the efficiency of the assembly strategy is crucial for the overall performance of the
+program. In particular this benchmark shows that allocations in such a tight loop from the
+first strategy is very costly and puts a strain on the garbage collector: 8% of the time is
+spent in GC instead of crunching numbers.