Use LoopVectorization to maximize memory bandwidth usage when creating identity matrix #473

mzy2240 · 2023-03-08T13:14:59Z

mzy2240
Mar 8, 2023

Recently I found creating a large identity matrix the default way (e.g. Matrix(1.0I, 20000, 20000)) could be relatively slow due to the limited memory bandwidth of a single thread. To address that, I found I could do better with something like below:

dp = Array{Float64}(undef, 20000, 20000);
Threads.@threads for j ∈ axes(dp,2)
    @simd for i ∈ axes(dp, 1)
        dp[i,j] = ifelse(i==j, 1.0, 0.0);
    end
end

I am wondering if I could use LoopVectorization to achieve better performance, or simply minimize the use of native Threads.@threads. Thanks in advance!

chriselrod · 2023-03-08T15:33:45Z

chriselrod
Mar 8, 2023
Maintainer

Hmm, something is going wrong here. In theory, you should just be able to do @tturbo, but this gets much worse performance than single threaded @turbo, which does about the same as the default single threaded implementation.
It'd require looking into what's going wrong with the thread scheduling.

You could also consider Polyester.@batch, which has a minbatch argument you can use to limit the rate at which it adds more threads.

0 replies

mzy2240 · 2023-03-08T16:52:16Z

mzy2240
Mar 8, 2023
Author

I am getting slightly different results, but still the fastest implementation is using Threads.@threads.

@benchmark Matrix(1.0I, 20000, 20000)

BenchmarkTools.Trial: 5 samples with 1 evaluation.
 Range (min … max):  843.900 ms …    1.179 s  ┊ GC (min … max):  0.10% … 9.66%
 Time  (median):        1.111 s               ┊ GC (median):    10.25%
 Time  (mean ± σ):      1.025 s ± 165.337 ms  ┊ GC (mean ± σ):   9.48% ± 8.79%

  █                                               ▁     ▁     ▁  
  █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁█▁▁▁▁▁█ ▁
  844 ms           Histogram: frequency by time          1.18 s <

 Memory estimate: 2.98 GiB, allocs estimate: 2.

function i_turbo()
    dp = Array{Float64}(undef, 20000, 20000);
    @turbo for j ∈ axes(dp,2)
        for i ∈ axes(dp, 1)
            dp[i,j] = ifelse(i==j, 1.0, 0.0);
        end
    end
    return dp;
end

i_turbo()
@benchmark i_turbo()

BenchmarkTools.Trial: 4 samples with 1 evaluation.
 Range (min … max):  1.156 s …    1.562 s  ┊ GC (min … max): 0.08% … 13.87%
 Time  (median):     1.333 s               ┊ GC (median):    7.36%
 Time  (mean ± σ):   1.346 s ± 181.958 ms  ┊ GC (mean ± σ):  7.68% ±  7.92%

  █           █                         █                  █  
  █▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  1.16 s         Histogram: frequency by time         1.56 s <

 Memory estimate: 2.98 GiB, allocs estimate: 2.

function i_tturbo()
    dp = Array{Float64}(undef, 20000, 20000);
    @tturbo for j ∈ axes(dp,2)
        for i ∈ axes(dp, 1)
            dp[i,j] = ifelse(i==j, 1.0, 0.0);
        end
    end
    return dp;
end

i_tturbo()
@benchmark i_tturbo()

BenchmarkTools.Trial: 6 samples with 1 evaluation.
 Range (min … max):  499.124 ms …    1.105 s  ┊ GC (min … max):  0.16% … 42.38%
 Time  (median):     963.943 ms               ┊ GC (median):    36.06%
 Time  (mean ± σ):   861.720 ms ± 258.747 ms  ┊ GC (mean ± σ):  30.91% ± 20.18%

  █      █                                    █    █      █   █  
  █▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁█▁▁▁▁▁▁█▁▁▁█ ▁
  499 ms           Histogram: frequency by time          1.11 s <

 Memory estimate: 2.98 GiB, allocs estimate: 2.

function i_mt()
    dp = Array{Float64}(undef, 20000, 20000);
    Threads.@threads for j ∈ axes(dp,2)
        @simd for i ∈ axes(dp, 1)
            dp[i,j] = ifelse(i==j, 1.0, 0.0);
        end
    end
    return dp;
end

i_mt()
@benchmark i_mt()

BenchmarkTools.Trial: 11 samples with 1 evaluation.
 Range (min … max):  312.225 ms … 740.457 ms  ┊ GC (min … max):  0.31% … 51.99%
 Time  (median):     473.053 ms               ┊ GC (median):    28.27%
 Time  (mean ± σ):   476.158 ms ± 114.606 ms  ┊ GC (mean ± σ):  28.81% ± 17.80%

                        █                                        
  ▇▁▁▁▁▁▇▁▁▁▁▁▁▁▇▁▇▁▁▁▁▁█▁▇▁▇▁▁▁▁▁▁▁▁▁▁▁▁▇▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▇ ▁
  312 ms           Histogram: frequency by time          740 ms <

 Memory estimate: 2.98 GiB, allocs estimate: 199.

0 replies

mzy2240 · 2023-03-08T17:02:43Z

mzy2240
Mar 8, 2023
Author

Using the suggested method I could get similar performance. I am wondering if it could be further improved.

function i_polyester()
    dp = Array{Float64}(undef, 20000, 20000);
    @batch minbatch=1250 for j ∈ axes(dp,2)
        for i ∈ axes(dp, 1)
            dp[i,j] = ifelse(i==j, 1.0, 0.0);
        end
    end
    return dp;
end

i_polyester()
@benchmark i_polyester()

BenchmarkTools.Trial: 10 samples with 1 evaluation.
 Range (min … max):  266.641 ms … 633.969 ms  ┊ GC (min … max):  0.28% … 47.30%
 Time  (median):     508.365 ms               ┊ GC (median):    34.35%
 Time  (mean ± σ):   501.235 ms ± 116.029 ms  ┊ GC (mean ± σ):  31.46% ± 15.68%

  ▁          ▁                          █▁▁        ▁   ▁  ▁   ▁  
  █▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁███▁▁▁▁▁▁▁▁█▁▁▁█▁▁█▁▁▁█ ▁
  267 ms           Histogram: frequency by time          634 ms <

 Memory estimate: 2.98 GiB, allocs estimate: 2.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use LoopVectorization to maximize memory bandwidth usage when creating identity matrix #473

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Use LoopVectorization to maximize memory bandwidth usage when creating identity matrix #473

mzy2240 Mar 8, 2023

Replies: 3 comments

chriselrod Mar 8, 2023 Maintainer

mzy2240 Mar 8, 2023 Author

mzy2240 Mar 8, 2023 Author

mzy2240
Mar 8, 2023

chriselrod
Mar 8, 2023
Maintainer

mzy2240
Mar 8, 2023
Author

mzy2240
Mar 8, 2023
Author