Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MemoryDiagnoser] inaccurate and influenced by [GlobalSetup] work #1599

Closed
Sergio0694 opened this issue Nov 19, 2020 · 3 comments
Closed

[MemoryDiagnoser] inaccurate and influenced by [GlobalSetup] work #1599

Sergio0694 opened this issue Nov 19, 2020 · 3 comments

Comments

@Sergio0694
Copy link

Sergio0694 commented Nov 19, 2020

Hi, I've been trying to use BenchmarkDotNet to profile the memory usage improvements in a new version of ComputeSharp, but I'm struggling to make sense of the reported memory usage, and I'm wondering whether I might be doing something wrong or whether there's some issues/caveats with [MemoryDiagnoser], as the reported memory allocations seem a bit off.

All the code below and results are from the investigation/bdn branch in the ComputeSharp repo, for reference.

Repro steps

  • Add the CI nuget.config file for BenchmarkDotNet as explained in this comment
  • Clone the repo, checkout to investigation/bdn
  • Build ComputeSharp.Benchmark in Release, run the benchmark as usual with dotnet ComputeSharp.Benchmark

Details

Running that benchmark gives me the following:

Benchmark results (click to expand):
// * Detailed results *
DnnBenchmark.GpuWithNoTemporaryBuffers: Job-TWEPVA(Toolchain=5.0)
Runtime = .NET 5.0.0 (5.0.20.51904), X64 RyuJIT; GC = Concurrent Workstation
Mean = 24.785 ms, StdErr = 0.005 ms (0.02%), N = 13, StdDev = 0.019 ms
Min = 24.739 ms, Q1 = 24.777 ms, Median = 24.789 ms, Q3 = 24.794 ms, Max = 24.820 ms
IQR = 0.018 ms, LowerFence = 24.750 ms, UpperFence = 24.821 ms
ConfidenceInterval = [24.763 ms; 24.808 ms] (CI 99.9%), Margin = 0.023 ms (0.09% of Mean)
Skewness = -0.63, Kurtosis = 3.75, MValue = 2
-------------------- Histogram --------------------
[24.728 ms ; 24.831 ms) | @@@@@@@@@@@@@
---------------------------------------------------

// * Summary *

BenchmarkDotNet=v0.12.1.1466-nightly, OS=Windows 10.0.19041.610 (2004/May2020Update/20H1)
AMD Ryzen 7 2700X, 1 CPU, 16 logical and 8 physical cores
.NET SDK=5.0.100
  [Host]     : .NET 5.0.0 (5.0.20.51904), X64 RyuJIT
  Job-TWEPVA : .NET 5.0.0 (5.0.20.51904), X64 RyuJIT

Toolchain=5.0

|                    Method |     Mean |    Error |   StdDev | Gen 0 | Gen 1 | Gen 2 | Allocated |
|-------------------------- |---------:|---------:|---------:|------:|------:|------:|----------:|
| GpuWithNoTemporaryBuffers | 24.79 ms | 0.023 ms | 0.019 ms |     - |     - |     - |      45 B |

I was confused about those 45B of allocations (after the initial warmup, the benchmark should do no allocations, in theory). So I run the VS memory profiler to have a look (just uncomment that #define PROFILER in the main file of the benchmark project.
With that, I got the following:

image

To double-check, also used dotMemory (click to expand):

image

VS reports no allocations at all while running the benchmark code in a loop, so I'm very confused about those 45 B reported by BDN. I know that the [MemoryDiagnoser] has a reported accuracy of 99.5%, but I figured the difference between no allocations at all, and 45 bytes, could be considered not within that threshold? 🤔

Additional info

If you look at my [GlobalSetup] method (specifically, these lines), you will see I'm manually doing a whole bunch of warmup iterations and GC collections from there. Turns out that:

  1. If I remove those warmup iterations completely and let BenchmarkDotNet handle that, the benchmark results go completely off the rails, and I get 1 KB of reported memory allocations (?!).
  2. If I only do a single benchmark invocation as warmup (without that loop and without also calling GC.Collect, the reported allocations are more in line with the results above, but still a bit worse (58 B insteaad of 45).
Benchmark results for point 1. (click to expand):
// * Detailed results *
DnnBenchmark.GpuWithNoTemporaryBuffers: Job-HQFLUQ(Toolchain=5.0)
Runtime = .NET 5.0.0 (5.0.20.51904), X64 RyuJIT; GC = Concurrent Workstation
Mean = 18.801 ms, StdErr = 0.034 ms (0.18%), N = 15, StdDev = 0.133 ms
Min = 18.618 ms, Q1 = 18.725 ms, Median = 18.746 ms, Q3 = 18.893 ms, Max = 19.014 ms
IQR = 0.168 ms, LowerFence = 18.472 ms, UpperFence = 19.145 ms
ConfidenceInterval = [18.658 ms; 18.943 ms] (CI 99.9%), Margin = 0.143 ms (0.76% of Mean)
Skewness = 0.38, Kurtosis = 1.69, MValue = 2
-------------------- Histogram --------------------
[18.547 ms ; 19.085 ms) | @@@@@@@@@@@@@@@
---------------------------------------------------

// * Summary *

BenchmarkDotNet=v0.12.1.1466-nightly, OS=Windows 10.0.19041.610 (2004/May2020Update/20H1)
AMD Ryzen 7 2700X, 1 CPU, 16 logical and 8 physical cores
.NET SDK=5.0.100
  [Host]     : .NET 5.0.0 (5.0.20.51904), X64 RyuJIT
  Job-HQFLUQ : .NET 5.0.0 (5.0.20.51904), X64 RyuJIT

Toolchain=5.0

|                    Method |     Mean |    Error |   StdDev | Gen 0 | Gen 1 | Gen 2 | Allocated |
|-------------------------- |---------:|---------:|---------:|------:|------:|------:|----------:|
| GpuWithNoTemporaryBuffers | 18.80 ms | 0.143 ms | 0.133 ms |     - |     - |     - |      1 KB |
Benchmark results for point 2. (click to expand):
// * Summary *

BenchmarkDotNet=v0.12.1.1466-nightly, OS=Windows 10.0.19041.610 (2004/May2020Update/20H1)
AMD Ryzen 7 2700X, 1 CPU, 16 logical and 8 physical cores
.NET SDK=5.0.100
  [Host]     : .NET 5.0.0 (5.0.20.51904), X64 RyuJIT
  Job-OZIUOW : .NET 5.0.0 (5.0.20.51904), X64 RyuJIT

Toolchain=5.0

|                    Method |     Mean |    Error |   StdDev | Gen 0 | Gen 1 | Gen 2 | Allocated |
|-------------------------- |---------:|---------:|---------:|------:|------:|------:|----------:|
| GpuWithNoTemporaryBuffers | 24.82 ms | 0.068 ms | 0.063 ms |     - |     - |     - |      58 B |

I should note that my initial run of a benchmark is particularly heavy, as the library needs to generate and compile a GPU shader. Further invocations will use the cached data and will be much faster (as in, going down from over 1s to like 24ms in this benchmark). So I guess running the benchmark at least once in the [GlobalSetup] helps BenchmarkDotNet ignore that first big outlier while running the actual benchmarks. I still don't get though:

  • Why is the memory reporting apparently incorrect and not in line with VS' memory profiler? That is, even when I have [GlobalSetup] do a whole bunch of warmup iterations and GC collections, for good measure. What are those 45 B?
  • Why does the reported memory usage seem influenced by what I do in [GlobalSetup]?

Thanks! 😄

@timcassell
Copy link
Collaborator

Ah, I see you are using .NET 5.0. See #1543 and dotnet/runtime#45446

Try running it in .NET Framework and CoreRt to see what results those runtimes report.

@Sergio0694
Copy link
Author

Hey @timcassell - thank you for chiming in!
Unfortunately my library only targets .NET 5 as a minimum, so I won't be able to test that with .NET Framerwork or CoreRT 😟
I guess I can just keep doing more work manually in the benchmark setup for now and then wait for BDN 1.3.0 to be published.
Will update again once that is available! 🙂

@timcassell
Copy link
Collaborator

Closing as duplicate of #1542.

@timcassell timcassell closed this as not planned Won't fix, can't repro, duplicate, stale Aug 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants