[MemoryDiagnoser] inaccurate and influenced by [GlobalSetup] work #1599

Sergio0694 · 2020-11-19T12:26:59Z

Hi, I've been trying to use BenchmarkDotNet to profile the memory usage improvements in a new version of ComputeSharp, but I'm struggling to make sense of the reported memory usage, and I'm wondering whether I might be doing something wrong or whether there's some issues/caveats with [MemoryDiagnoser], as the reported memory allocations seem a bit off.

All the code below and results are from the investigation/bdn branch in the ComputeSharp repo, for reference.

Repro steps

Add the CI nuget.config file for BenchmarkDotNet as explained in this comment
Clone the repo, checkout to investigation/bdn
Build ComputeSharp.Benchmark in Release, run the benchmark as usual with dotnet ComputeSharp.Benchmark

Details

Running that benchmark gives me the following:

Benchmark results (click to expand):

// * Detailed results *
DnnBenchmark.GpuWithNoTemporaryBuffers: Job-TWEPVA(Toolchain=5.0)
Runtime = .NET 5.0.0 (5.0.20.51904), X64 RyuJIT; GC = Concurrent Workstation
Mean = 24.785 ms, StdErr = 0.005 ms (0.02%), N = 13, StdDev = 0.019 ms
Min = 24.739 ms, Q1 = 24.777 ms, Median = 24.789 ms, Q3 = 24.794 ms, Max = 24.820 ms
IQR = 0.018 ms, LowerFence = 24.750 ms, UpperFence = 24.821 ms
ConfidenceInterval = [24.763 ms; 24.808 ms] (CI 99.9%), Margin = 0.023 ms (0.09% of Mean)
Skewness = -0.63, Kurtosis = 3.75, MValue = 2
-------------------- Histogram --------------------
[24.728 ms ; 24.831 ms) | @@@@@@@@@@@@@
---------------------------------------------------

// * Summary *

BenchmarkDotNet=v0.12.1.1466-nightly, OS=Windows 10.0.19041.610 (2004/May2020Update/20H1)
AMD Ryzen 7 2700X, 1 CPU, 16 logical and 8 physical cores
.NET SDK=5.0.100
  [Host]     : .NET 5.0.0 (5.0.20.51904), X64 RyuJIT
  Job-TWEPVA : .NET 5.0.0 (5.0.20.51904), X64 RyuJIT

Toolchain=5.0

|                    Method |     Mean |    Error |   StdDev | Gen 0 | Gen 1 | Gen 2 | Allocated |
|-------------------------- |---------:|---------:|---------:|------:|------:|------:|----------:|
| GpuWithNoTemporaryBuffers | 24.79 ms | 0.023 ms | 0.019 ms |     - |     - |     - |      45 B |

I was confused about those 45B of allocations (after the initial warmup, the benchmark should do no allocations, in theory). So I run the VS memory profiler to have a look (just uncomment that #define PROFILER in the main file of the benchmark project.
With that, I got the following:

To double-check, also used dotMemory (click to expand):

VS reports no allocations at all while running the benchmark code in a loop, so I'm very confused about those 45 B reported by BDN. I know that the [MemoryDiagnoser] has a reported accuracy of 99.5%, but I figured the difference between no allocations at all, and 45 bytes, could be considered not within that threshold? 🤔

Additional info

If you look at my [GlobalSetup] method (specifically, these lines), you will see I'm manually doing a whole bunch of warmup iterations and GC collections from there. Turns out that:

If I remove those warmup iterations completely and let BenchmarkDotNet handle that, the benchmark results go completely off the rails, and I get 1 KB of reported memory allocations (?!).
If I only do a single benchmark invocation as warmup (without that loop and without also calling GC.Collect, the reported allocations are more in line with the results above, but still a bit worse (58 B insteaad of 45).

Benchmark results for point 1. (click to expand):

// * Detailed results *
DnnBenchmark.GpuWithNoTemporaryBuffers: Job-HQFLUQ(Toolchain=5.0)
Runtime = .NET 5.0.0 (5.0.20.51904), X64 RyuJIT; GC = Concurrent Workstation
Mean = 18.801 ms, StdErr = 0.034 ms (0.18%), N = 15, StdDev = 0.133 ms
Min = 18.618 ms, Q1 = 18.725 ms, Median = 18.746 ms, Q3 = 18.893 ms, Max = 19.014 ms
IQR = 0.168 ms, LowerFence = 18.472 ms, UpperFence = 19.145 ms
ConfidenceInterval = [18.658 ms; 18.943 ms] (CI 99.9%), Margin = 0.143 ms (0.76% of Mean)
Skewness = 0.38, Kurtosis = 1.69, MValue = 2
-------------------- Histogram --------------------
[18.547 ms ; 19.085 ms) | @@@@@@@@@@@@@@@
---------------------------------------------------

// * Summary *

BenchmarkDotNet=v0.12.1.1466-nightly, OS=Windows 10.0.19041.610 (2004/May2020Update/20H1)
AMD Ryzen 7 2700X, 1 CPU, 16 logical and 8 physical cores
.NET SDK=5.0.100
  [Host]     : .NET 5.0.0 (5.0.20.51904), X64 RyuJIT
  Job-HQFLUQ : .NET 5.0.0 (5.0.20.51904), X64 RyuJIT

Toolchain=5.0

|                    Method |     Mean |    Error |   StdDev | Gen 0 | Gen 1 | Gen 2 | Allocated |
|-------------------------- |---------:|---------:|---------:|------:|------:|------:|----------:|
| GpuWithNoTemporaryBuffers | 18.80 ms | 0.143 ms | 0.133 ms |     - |     - |     - |      1 KB |

Benchmark results for point 2. (click to expand):

// * Summary *

BenchmarkDotNet=v0.12.1.1466-nightly, OS=Windows 10.0.19041.610 (2004/May2020Update/20H1)
AMD Ryzen 7 2700X, 1 CPU, 16 logical and 8 physical cores
.NET SDK=5.0.100
  [Host]     : .NET 5.0.0 (5.0.20.51904), X64 RyuJIT
  Job-OZIUOW : .NET 5.0.0 (5.0.20.51904), X64 RyuJIT

Toolchain=5.0

|                    Method |     Mean |    Error |   StdDev | Gen 0 | Gen 1 | Gen 2 | Allocated |
|-------------------------- |---------:|---------:|---------:|------:|------:|------:|----------:|
| GpuWithNoTemporaryBuffers | 24.82 ms | 0.068 ms | 0.063 ms |     - |     - |     - |      58 B |

I should note that my initial run of a benchmark is particularly heavy, as the library needs to generate and compile a GPU shader. Further invocations will use the cached data and will be much faster (as in, going down from over 1s to like 24ms in this benchmark). So I guess running the benchmark at least once in the [GlobalSetup] helps BenchmarkDotNet ignore that first big outlier while running the actual benchmarks. I still don't get though:

Why is the memory reporting apparently incorrect and not in line with VS' memory profiler? That is, even when I have [GlobalSetup] do a whole bunch of warmup iterations and GC collections, for good measure. What are those 45 B?
Why does the reported memory usage seem influenced by what I do in [GlobalSetup]?

Thanks! 😄

The text was updated successfully, but these errors were encountered:

timcassell · 2020-12-06T08:24:56Z

Ah, I see you are using .NET 5.0. See #1543 and dotnet/runtime#45446

Try running it in .NET Framework and CoreRt to see what results those runtimes report.

Sergio0694 · 2020-12-07T21:22:37Z

Hey @timcassell - thank you for chiming in!
Unfortunately my library only targets .NET 5 as a minimum, so I won't be able to test that with .NET Framerwork or CoreRT 😟
I guess I can just keep doing more work manually in the benchmark setup for now and then wait for BDN 1.3.0 to be published.
Will update again once that is available! 🙂

timcassell · 2023-08-16T05:50:55Z

Closing as duplicate of #1542.

petarpetrovt mentioned this issue Nov 26, 2020

Benchmarks give inconsistent memory results petarpetrovt/sorting-networks#26

Open

timcassell closed this as not planned Won't fix, can't repro, duplicate, stale Aug 16, 2023

timcassell added the duplicate label Aug 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MemoryDiagnoser] inaccurate and influenced by [GlobalSetup] work #1599

[MemoryDiagnoser] inaccurate and influenced by [GlobalSetup] work #1599

Sergio0694 commented Nov 19, 2020 •

edited

Loading

timcassell commented Dec 6, 2020

Sergio0694 commented Dec 7, 2020

timcassell commented Aug 16, 2023

[MemoryDiagnoser] inaccurate and influenced by [GlobalSetup] work #1599

[MemoryDiagnoser] inaccurate and influenced by [GlobalSetup] work #1599

Comments

Sergio0694 commented Nov 19, 2020 • edited Loading

Repro steps

Details

Additional info

timcassell commented Dec 6, 2020

Sergio0694 commented Dec 7, 2020

timcassell commented Aug 16, 2023

Sergio0694 commented Nov 19, 2020 •

edited

Loading