You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, I've been trying to use BenchmarkDotNet to profile the memory usage improvements in a new version of ComputeSharp, but I'm struggling to make sense of the reported memory usage, and I'm wondering whether I might be doing something wrong or whether there's some issues/caveats with [MemoryDiagnoser], as the reported memory allocations seem a bit off.
All the code below and results are from the investigation/bdn branch in the ComputeSharp repo, for reference.
Repro steps
Add the CI nuget.config file for BenchmarkDotNet as explained in this comment
Clone the repo, checkout to investigation/bdn
Build ComputeSharp.Benchmark in Release, run the benchmark as usual with dotnet ComputeSharp.Benchmark
Details
Running that benchmark gives me the following:
Benchmark results (click to expand):
// * Detailed results *
DnnBenchmark.GpuWithNoTemporaryBuffers: Job-TWEPVA(Toolchain=5.0)
Runtime = .NET 5.0.0 (5.0.20.51904), X64 RyuJIT; GC = Concurrent Workstation
Mean = 24.785 ms, StdErr = 0.005 ms (0.02%), N = 13, StdDev = 0.019 ms
Min = 24.739 ms, Q1 = 24.777 ms, Median = 24.789 ms, Q3 = 24.794 ms, Max = 24.820 ms
IQR = 0.018 ms, LowerFence = 24.750 ms, UpperFence = 24.821 ms
ConfidenceInterval = [24.763 ms; 24.808 ms] (CI 99.9%), Margin = 0.023 ms (0.09% of Mean)
Skewness = -0.63, Kurtosis = 3.75, MValue = 2
-------------------- Histogram --------------------
[24.728 ms ; 24.831 ms) | @@@@@@@@@@@@@
---------------------------------------------------
// * Summary *
BenchmarkDotNet=v0.12.1.1466-nightly, OS=Windows 10.0.19041.610 (2004/May2020Update/20H1)
AMD Ryzen 7 2700X, 1 CPU, 16 logical and 8 physical cores
.NET SDK=5.0.100
[Host] : .NET 5.0.0 (5.0.20.51904), X64 RyuJIT
Job-TWEPVA : .NET 5.0.0 (5.0.20.51904), X64 RyuJIT
Toolchain=5.0
| Method | Mean | Error | StdDev | Gen 0 | Gen 1 | Gen 2 | Allocated |
|-------------------------- |---------:|---------:|---------:|------:|------:|------:|----------:|
| GpuWithNoTemporaryBuffers | 24.79 ms | 0.023 ms | 0.019 ms | - | - | - | 45 B |
I was confused about those 45B of allocations (after the initial warmup, the benchmark should do no allocations, in theory). So I run the VS memory profiler to have a look (just uncomment that #define PROFILER in the main file of the benchmark project.
With that, I got the following:
To double-check, also used dotMemory (click to expand):
VS reports no allocations at all while running the benchmark code in a loop, so I'm very confused about those 45 B reported by BDN. I know that the [MemoryDiagnoser] has a reported accuracy of 99.5%, but I figured the difference between no allocations at all, and 45 bytes, could be considered not within that threshold? 🤔
Additional info
If you look at my [GlobalSetup] method (specifically, these lines), you will see I'm manually doing a whole bunch of warmup iterations and GC collections from there. Turns out that:
If I remove those warmup iterations completely and let BenchmarkDotNet handle that, the benchmark results go completely off the rails, and I get 1 KB of reported memory allocations (?!).
If I only do a single benchmark invocation as warmup (without that loop and without also calling GC.Collect, the reported allocations are more in line with the results above, but still a bit worse (58 B insteaad of 45).
Benchmark results for point 1. (click to expand):
// * Detailed results *
DnnBenchmark.GpuWithNoTemporaryBuffers: Job-HQFLUQ(Toolchain=5.0)
Runtime = .NET 5.0.0 (5.0.20.51904), X64 RyuJIT; GC = Concurrent Workstation
Mean = 18.801 ms, StdErr = 0.034 ms (0.18%), N = 15, StdDev = 0.133 ms
Min = 18.618 ms, Q1 = 18.725 ms, Median = 18.746 ms, Q3 = 18.893 ms, Max = 19.014 ms
IQR = 0.168 ms, LowerFence = 18.472 ms, UpperFence = 19.145 ms
ConfidenceInterval = [18.658 ms; 18.943 ms] (CI 99.9%), Margin = 0.143 ms (0.76% of Mean)
Skewness = 0.38, Kurtosis = 1.69, MValue = 2
-------------------- Histogram --------------------
[18.547 ms ; 19.085 ms) | @@@@@@@@@@@@@@@
---------------------------------------------------
// * Summary *
BenchmarkDotNet=v0.12.1.1466-nightly, OS=Windows 10.0.19041.610 (2004/May2020Update/20H1)
AMD Ryzen 7 2700X, 1 CPU, 16 logical and 8 physical cores
.NET SDK=5.0.100
[Host] : .NET 5.0.0 (5.0.20.51904), X64 RyuJIT
Job-HQFLUQ : .NET 5.0.0 (5.0.20.51904), X64 RyuJIT
Toolchain=5.0
| Method | Mean | Error | StdDev | Gen 0 | Gen 1 | Gen 2 | Allocated |
|-------------------------- |---------:|---------:|---------:|------:|------:|------:|----------:|
| GpuWithNoTemporaryBuffers | 18.80 ms | 0.143 ms | 0.133 ms | - | - | - | 1 KB |
Benchmark results for point 2. (click to expand):
// * Summary *
BenchmarkDotNet=v0.12.1.1466-nightly, OS=Windows 10.0.19041.610 (2004/May2020Update/20H1)
AMD Ryzen 7 2700X, 1 CPU, 16 logical and 8 physical cores
.NET SDK=5.0.100
[Host] : .NET 5.0.0 (5.0.20.51904), X64 RyuJIT
Job-OZIUOW : .NET 5.0.0 (5.0.20.51904), X64 RyuJIT
Toolchain=5.0
| Method | Mean | Error | StdDev | Gen 0 | Gen 1 | Gen 2 | Allocated |
|-------------------------- |---------:|---------:|---------:|------:|------:|------:|----------:|
| GpuWithNoTemporaryBuffers | 24.82 ms | 0.068 ms | 0.063 ms | - | - | - | 58 B |
I should note that my initial run of a benchmark is particularly heavy, as the library needs to generate and compile a GPU shader. Further invocations will use the cached data and will be much faster (as in, going down from over 1s to like 24ms in this benchmark). So I guess running the benchmark at least once in the [GlobalSetup] helps BenchmarkDotNet ignore that first big outlier while running the actual benchmarks. I still don't get though:
Why is the memory reporting apparently incorrect and not in line with VS' memory profiler? That is, even when I have [GlobalSetup] do a whole bunch of warmup iterations and GC collections, for good measure. What are those 45 B?
Why does the reported memory usage seem influenced by what I do in [GlobalSetup]?
Thanks! 😄
The text was updated successfully, but these errors were encountered:
Hey @timcassell - thank you for chiming in!
Unfortunately my library only targets .NET 5 as a minimum, so I won't be able to test that with .NET Framerwork or CoreRT 😟
I guess I can just keep doing more work manually in the benchmark setup for now and then wait for BDN 1.3.0 to be published.
Will update again once that is available! 🙂
Hi, I've been trying to use
BenchmarkDotNet
to profile the memory usage improvements in a new version ofComputeSharp
, but I'm struggling to make sense of the reported memory usage, and I'm wondering whether I might be doing something wrong or whether there's some issues/caveats with[MemoryDiagnoser]
, as the reported memory allocations seem a bit off.All the code below and results are from the
investigation/bdn
branch in theComputeSharp
repo, for reference.Repro steps
nuget.config
file forBenchmarkDotNet
as explained in this commentinvestigation/bdn
ComputeSharp.Benchmark
in Release, run the benchmark as usual withdotnet ComputeSharp.Benchmark
Details
Running that benchmark gives me the following:
Benchmark results (click to expand):
I was confused about those 45B of allocations (after the initial warmup, the benchmark should do no allocations, in theory). So I run the VS memory profiler to have a look (just uncomment that
#define PROFILER
in the main file of the benchmark project.With that, I got the following:
To double-check, also used dotMemory (click to expand):
VS reports no allocations at all while running the benchmark code in a loop, so I'm very confused about those 45 B reported by BDN. I know that the
[MemoryDiagnoser]
has a reported accuracy of 99.5%, but I figured the difference between no allocations at all, and 45 bytes, could be considered not within that threshold? 🤔Additional info
If you look at my
[GlobalSetup]
method (specifically, these lines), you will see I'm manually doing a whole bunch of warmup iterations and GC collections from there. Turns out that:BenchmarkDotNet
handle that, the benchmark results go completely off the rails, and I get 1 KB of reported memory allocations (?!).GC.Collect
, the reported allocations are more in line with the results above, but still a bit worse (58 B insteaad of 45).Benchmark results for point 1. (click to expand):
Benchmark results for point 2. (click to expand):
I should note that my initial run of a benchmark is particularly heavy, as the library needs to generate and compile a GPU shader. Further invocations will use the cached data and will be much faster (as in, going down from over 1s to like 24ms in this benchmark). So I guess running the benchmark at least once in the
[GlobalSetup]
helpsBenchmarkDotNet
ignore that first big outlier while running the actual benchmarks. I still don't get though:[GlobalSetup]
do a whole bunch of warmup iterations and GC collections, for good measure. What are those 45 B?[GlobalSetup]
?Thanks! 😄
The text was updated successfully, but these errors were encountered: