Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent results with MemoryDiagnoser between identical runs with Server GC #1913

Closed
cmeeren opened this issue Feb 3, 2022 · 14 comments
Closed

Comments

@cmeeren
Copy link

cmeeren commented Feb 3, 2022

The following two results are from two consecutive runs of the exact same code with the exact same BenchmarkDotNet configuration. Notice the differences in Gen 0 (1000 vs 2000 collects per 1k ops) and Gen 1 (0 vs 1000 collects per 1k ops).

I have noticed this a bit in my testing. Is this to be expected? What does it mean? How can I use MemoryDiagnoser effectively to get information (e.g. about GC pressure, I think; note that I know very little about GC) when the results seem to vary so significantly between runs?

(Note that this is from a complex project; unfortunately I don't have a minimal repro.)

It's also interesting to note the differences in the histograms between the runs, though I have no idea if that is relevant here.

Edit: Is it relevant that I am using server GC? Job.Default.WithGcServer(true)

Run 1

// * Detailed results *
Benchmark.GetCollection: Job-QFDCIT(Server=True) [NRes=10, NInc=5]
Runtime = .NET 6.0.1 (6.0.121.56705), X64 RyuJIT; GC = Concurrent Server
Mean = 396.516 μs, StdErr = 1.236 μs (0.31%), N = 14, StdDev = 4.623 μs
Min = 391.917 μs, Q1 = 393.255 μs, Median = 394.960 μs, Q3 = 397.735 μs, Max = 406.730 μs
IQR = 4.480 μs, LowerFence = 386.536 μs, UpperFence = 404.455 μs
ConfidenceInterval = [391.301 μs; 401.731 μs] (CI 99.9%), Margin = 5.215 μs (1.32% of Mean)
Skewness = 0.95, Kurtosis = 2.6, MValue = 2
-------------------- Histogram --------------------
[390.455 μs ; 409.248 μs) | @@@@@@@@@@@@@@
---------------------------------------------------

Benchmark.GetCollection: Job-QFDCIT(Server=True) [NRes=10000, NInc=5]
Runtime = .NET 6.0.1 (6.0.121.56705), X64 RyuJIT; GC = Concurrent Server
Mean = 216.954 ms, StdErr = 1.224 ms (0.56%), N = 46, StdDev = 8.301 ms
Min = 201.155 ms, Q1 = 211.229 ms, Median = 214.586 ms, Q3 = 224.154 ms, Max = 232.700 ms
IQR = 12.925 ms, LowerFence = 191.842 ms, UpperFence = 243.541 ms
ConfidenceInterval = [212.645 ms; 221.262 ms] (CI 99.9%), Margin = 4.308 ms (1.99% of Mean)
Skewness = 0.08, Kurtosis = 1.95, MValue = 2.95
-------------------- Histogram --------------------
[200.762 ms ; 206.843 ms) | @@@@@@
[206.843 ms ; 216.326 ms) | @@@@@@@@@@@@@@@@@@@
[216.326 ms ; 221.689 ms) | @@@@
[221.689 ms ; 227.770 ms) | @@@@@@@@@@@@@
[227.770 ms ; 234.152 ms) | @@@@
---------------------------------------------------

// * Summary *

BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19043.1466 (21H1/May2021Update)
Intel Xeon W-2133 CPU 3.60GHz, 1 CPU, 12 logical and 6 physical cores
.NET SDK=6.0.101
  [Host]     : .NET 6.0.1 (6.0.121.56705), X64 RyuJIT DEBUG
  Job-QFDCIT : .NET 6.0.1 (6.0.121.56705), X64 RyuJIT

Server=True  

|        Method |  NRes | NInc |         Mean |       Error |      StdDev |     Gen 0 |     Gen 1 |  Allocated |
|-------------- |------ |----- |-------------:|------------:|------------:|----------:|----------:|-----------:|
| GetCollection |    10 |    5 |     396.5 μs |     5.21 μs |     4.62 μs |   10.7422 |         - |     645 KB |
| GetCollection | 10000 |    5 | 216,953.5 μs | 4,308.46 μs | 8,300.94 μs | 2000.0000 | 1000.0000 | 579,511 KB |

// * Warnings *
MultimodalDistribution
  Benchmark.GetCollection: Server=True -> It seems that the distribution can have several modes (mValue = 2.95)

// * Hints *
Outliers
  Benchmark.GetCollection: Server=True -> 1 outlier  was  removed (426.98 μs)
  Benchmark.GetCollection: Server=True -> 1 outlier  was  removed (244.82 ms)

Run 2

// * Detailed results *
Benchmark.GetCollection: Job-RLRTSB(Server=True) [NRes=10, NInc=5]
Runtime = .NET 6.0.1 (6.0.121.56705), X64 RyuJIT; GC = Concurrent Server
Mean = 397.049 μs, StdErr = 0.607 μs (0.15%), N = 13, StdDev = 2.187 μs
Min = 392.965 μs, Q1 = 396.024 μs, Median = 397.393 μs, Q3 = 398.576 μs, Max = 400.836 μs
IQR = 2.552 μs, LowerFence = 392.196 μs, UpperFence = 402.404 μs
ConfidenceInterval = [394.429 μs; 399.668 μs] (CI 99.9%), Margin = 2.620 μs (0.66% of Mean)
Skewness = -0.26, Kurtosis = 2.09, MValue = 2
-------------------- Histogram --------------------
[391.744 μs ; 402.057 μs) | @@@@@@@@@@@@@
---------------------------------------------------

Benchmark.GetCollection: Job-RLRTSB(Server=True) [NRes=10000, NInc=5]
Runtime = .NET 6.0.1 (6.0.121.56705), X64 RyuJIT; GC = Concurrent Server
Mean = 216.102 ms, StdErr = 1.171 ms (0.54%), N = 35, StdDev = 6.929 ms
Min = 200.511 ms, Q1 = 210.869 ms, Median = 216.615 ms, Q3 = 219.552 ms, Max = 231.583 ms
IQR = 8.682 ms, LowerFence = 197.846 ms, UpperFence = 232.575 ms
ConfidenceInterval = [211.885 ms; 220.319 ms] (CI 99.9%), Margin = 4.217 ms (1.95% of Mean)
Skewness = -0.01, Kurtosis = 2.71, MValue = 2
-------------------- Histogram --------------------
[200.134 ms ; 208.088 ms) | @@@
[208.088 ms ; 214.659 ms) | @@@@@@@@@@@
[214.659 ms ; 220.219 ms) | @@@@@@@@@@@@@@
[220.219 ms ; 226.285 ms) | @@@@@
[226.285 ms ; 233.172 ms) | @@
---------------------------------------------------

// * Summary *

BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19043.1466 (21H1/May2021Update)
Intel Xeon W-2133 CPU 3.60GHz, 1 CPU, 12 logical and 6 physical cores
.NET SDK=6.0.101
  [Host]     : .NET 6.0.1 (6.0.121.56705), X64 RyuJIT DEBUG
  Job-RLRTSB : .NET 6.0.1 (6.0.121.56705), X64 RyuJIT

Server=True  

|        Method |  NRes | NInc |         Mean |       Error |      StdDev |     Gen 0 |  Allocated |
|-------------- |------ |----- |-------------:|------------:|------------:|----------:|-----------:|
| GetCollection |    10 |    5 |     397.0 μs |     2.62 μs |     2.19 μs |   10.7422 |     644 KB |
| GetCollection | 10000 |    5 | 216,102.0 μs | 4,217.06 μs | 6,928.75 μs | 1000.0000 | 579,424 KB |

// * Hints *
Outliers
  Benchmark.GetCollection: Server=True -> 2 outliers were removed (404.49 μs, 411.06 μs)
  Benchmark.GetCollection: Server=True -> 2 outliers were removed (233.20 ms, 234.78 ms)
@cmeeren
Copy link
Author

cmeeren commented Feb 3, 2022

In case it's relevant, here are some results without server GC. They are significantly different from either of the two runs above.

|        Method |  NRes | NInc |         Mean |       Error |      StdDev |      Gen 0 |      Gen 1 |     Gen 2 |  Allocated |
|-------------- |------ |----- |-------------:|------------:|------------:|-----------:|-----------:|----------:|-----------:|
| GetCollection |    10 |    5 |     487.8 us |     2.00 us |     1.56 us |   155.2734 |    43.9453 |         - |     645 KB |
| GetCollection | 10000 |    5 | 400,496.6 us | 7,225.50 us | 9,645.84 us | 93000.0000 | 22000.0000 | 2000.0000 | 579,569 KB |

Here's a second run. Seems more consistent.

|        Method |  NRes | NInc |         Mean |       Error |      StdDev |      Gen 0 |      Gen 1 |     Gen 2 |  Allocated |
|-------------- |------ |----- |-------------:|------------:|------------:|-----------:|-----------:|----------:|-----------:|
| GetCollection |    10 |    5 |     477.7 us |     5.91 us |     4.94 us |   155.2734 |    43.9453 |         - |     645 KB |
| GetCollection | 10000 |    5 | 403,061.0 us | 7,630.40 us | 7,835.86 us | 93000.0000 | 22000.0000 | 2000.0000 | 579,436 KB |

Note also that the performance is much worse, which is why I prefer measuring with server GC (to get more accurate performance results the way the project is actually used). But if that makes the GC statistics way off, I may have to reconsider.

@cmeeren cmeeren changed the title Inconsistent results with MemoryDiagnoser between identical runs Inconsistent results with MemoryDiagnoser between identical runs with Server GC Feb 3, 2022
@timcassell
Copy link
Collaborator

#1543 Issues with memory GC measurements were tracked down to tiered JIT. Can you try disabling tiered JIT and see if you get the same results?

@cmeeren
Copy link
Author

cmeeren commented Feb 3, 2022

Sure, how do I disable it? Failed to find any info.

@cmeeren
Copy link
Author

cmeeren commented Feb 3, 2022

Also, changelog for 0.13.1 (which I'm using) says that tiered JIT is disabled? https://benchmarkdotnet.org/changelog/full.html

@timcassell
Copy link
Collaborator

Hm, I just checked the PRs that are mentioned in that change, and it looks like it was only disabled for BDN's tests, not all benchmarks. https://github.com/dotnet/BenchmarkDotNet/pull/1747/files You should be able to make the same change in your project to disable it.

@adamsitnik Is this correct, or am I missing something?

@cmeeren
Copy link
Author

cmeeren commented Feb 3, 2022

Tried .WithEnvironmentVariable("COMPlus_TieredCompilation", "0"), but it didn't seem to have any impact. Still get results much closer to the OP than the follow-up comment without server GC.

@adamsitnik
Copy link
Member

Both BenchmarkDotNet and GC use some heuristics, so the results can vary over runs.

BenchmarkDotNet chooses the number of invocations per iteration and runs the benchmark until the results are stable. You can read more about it here: https://benchmarkdotnet.org/articles/guides/how-it-works.html

In the example below it chooses to run benchmark 4 times per iteration (result of the pilot stage) and to finish after 16 workload iterations:

OverheadJitting  1: 1 op, 176000.00 ns, 176.0000 us/op
WorkloadJitting  1: 1 op, 125395700.00 ns, 125.3957 ms/op

WorkloadPilot    1: 2 op, 254009600.00 ns, 127.0048 ms/op
WorkloadPilot    2: 3 op, 381563800.00 ns, 127.1879 ms/op
WorkloadPilot    3: 4 op, 528009600.00 ns, 132.0024 ms/op

WorkloadWarmup   1: 4 op, 503762300.00 ns, 125.9406 ms/op
WorkloadWarmup   2: 4 op, 513744700.00 ns, 128.4362 ms/op
WorkloadWarmup   3: 4 op, 500209300.00 ns, 125.0523 ms/op
WorkloadWarmup   4: 4 op, 501812300.00 ns, 125.4531 ms/op
WorkloadWarmup   5: 4 op, 504992900.00 ns, 126.2482 ms/op
WorkloadWarmup   6: 4 op, 511563800.00 ns, 127.8910 ms/op
WorkloadWarmup   7: 4 op, 499492900.00 ns, 124.8732 ms/op

// BeforeActualRun
WorkloadActual   1: 4 op, 516919300.00 ns, 129.2298 ms/op
WorkloadActual   2: 4 op, 498021500.00 ns, 124.5054 ms/op
WorkloadActual   3: 4 op, 491429500.00 ns, 122.8574 ms/op
WorkloadActual   4: 4 op, 517756500.00 ns, 129.4391 ms/op
WorkloadActual   5: 4 op, 493681700.00 ns, 123.4204 ms/op
WorkloadActual   6: 4 op, 495601700.00 ns, 123.9004 ms/op
WorkloadActual   7: 4 op, 518656500.00 ns, 129.6641 ms/op
WorkloadActual   8: 4 op, 521771700.00 ns, 130.4429 ms/op
WorkloadActual   9: 4 op, 504392100.00 ns, 126.0980 ms/op
WorkloadActual  10: 4 op, 510314600.00 ns, 127.5786 ms/op
WorkloadActual  11: 4 op, 503138300.00 ns, 125.7846 ms/op
WorkloadActual  12: 4 op, 500533300.00 ns, 125.1333 ms/op
WorkloadActual  13: 4 op, 504635800.00 ns, 126.1589 ms/op
WorkloadActual  14: 4 op, 506419900.00 ns, 126.6050 ms/op
WorkloadActual  15: 4 op, 504396800.00 ns, 126.0992 ms/op
WorkloadActual  16: 4 op, 499823400.00 ns, 124.9558 ms/op

After all of that, it runs an additional iteration to get the GC stats.

So between two different runs, BDN might perform a different amount of work and hence allocate more/less before it asks GC for the stats. This is important as GC has it's own set of heuristics that determine generation budget sizes and it may affect how frequently collections are performed.

It's also interesting to note the differences in the histograms between the runs, though I have no idea if that is relevant here.

I encourage you to write a simple Console app with Stopwatch and calls to GC API and see if the differences are similar.

@cmeeren
Copy link
Author

cmeeren commented Feb 3, 2022

Thanks, that is a helpful clarification. It answers the question I (somewhat implicitly) posed in the OP, which was "why is the GC stats slightly different between identical runs".

However, it does not address what I see turned out to be the main issue (which is evident by comparing the results in the OP with those of the subsequent comment): Using WithGcServer(true) completely messes up the "Gen X" results (they are nowhere near where they should be). Even when using WithEnvironmentVariable("COMPlus_TieredCompilation", "0").

Note that the "Allocated" results seem fine, even with tiered JIT. It's the "Gen X" results that are out of whack.

@adamsitnik
Copy link
Member

Using WithGcServer(true) completely messes up the "Gen X" results (they are nowhere near where they should be).

Please create a small repro app and run it without BenchmarkDotNet for both Server and Workstation GC. I expect that the statistics will have similar differences.

@cmeeren
Copy link
Author

cmeeren commented Feb 3, 2022

Ok, I'll try to do that tomorrow.

For clarification: Since you already expect there is a similar difference, are you asking me to repro it just for confirmation that this is indeed an area where BenchmarkDotNet can be improved? Or is it the other way around, that a confirmation would mean that BenchmarkDotNet is not at fault? (Which would confuse me.)

@adamsitnik
Copy link
Member

I am 99.9% sure that it's not BDN fault (it does always exactly the same thing no matter of GC mode). I just want to be 100% sure.

@cmeeren
Copy link
Author

cmeeren commented Feb 4, 2022

Here is a repro solution.

Test.zip

namespace Test

open BenchmarkDotNet.Attributes
open BenchmarkDotNet.Running
open BenchmarkDotNet.Jobs
open BenchmarkDotNet.Configs


[<MemoryDiagnoser>]
type Benchmark () =

  [<Benchmark>]
  member _.Bench() =
    Array.init 10000 string


module Program =

  [<EntryPoint>]
  let main _argv =
    BenchmarkRunner.Run<Benchmark>(
      DefaultConfig.Instance.AddJob(Job.Default.WithGcServer(false))
    )
    |> ignore
    0

Results when using WithGcServer(true):

BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19043.1466 (21H1/May2021Update)
Intel Xeon W-2133 CPU 3.60GHz, 1 CPU, 12 logical and 6 physical cores
.NET SDK=6.0.101
  [Host]     : .NET 6.0.1 (6.0.121.56705), X64 RyuJIT DEBUG
  Job-SSPFSH : .NET 6.0.1 (6.0.121.56705), X64 RyuJIT

Server=True  

| Method |     Mean |   Error |  StdDev |  Gen 0 |  Gen 1 | Allocated |
|------- |---------:|--------:|--------:|-------:|-------:|----------:|
|  Bench | 138.3 us | 1.25 us | 1.05 us | 7.3242 | 3.6621 |    390 KB |

Results when using WithGcServer(false):

BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19043.1466 (21H1/May2021Update)
Intel Xeon W-2133 CPU 3.60GHz, 1 CPU, 12 logical and 6 physical cores
.NET SDK=6.0.101
  [Host]     : .NET 6.0.1 (6.0.121.56705), X64 RyuJIT DEBUG
  Job-IQWASW : .NET 6.0.1 (6.0.121.56705), X64 RyuJIT

Server=False  

| Method |     Mean |   Error |  StdDev |   Gen 0 |   Gen 1 | Allocated |
|------- |---------:|--------:|--------:|--------:|--------:|----------:|
|  Bench | 164.9 us | 1.47 us | 1.23 us | 63.4766 | 31.7383 |    390 KB |

As you can see, the Gen X results are very different.

Some questions:

  1. Are the Gen X results generally useless when using Server GC?
  2. Is there any way BDN could support "correct" Gen X results when using Server GC?
  3. Until point 2 is supported (if ever), should I work around this by consistently disabling Server GC for GC benchmarking (the Gen X results) , and enabling it for performance benchmarking, for most accurate results?
  4. Does disabling Server GC (in BDN) have any other impacts than changing the performance results?
  5. Help me understand: Are the Gen X results at all useful? As long as I have time measurements and allocated memory, why should I even care about GC? If I optimize the code in a way that reduces GC collections, but this does not lead to a statistically significant reduction in either the time measurements or the allocated memory amount, have I really made an optimization that matters to anyone?

@adamsitnik
Copy link
Member

Are the Gen X results generally useless when using Server GC?

The Workstation and Server GC modes are just different. They use different generation budgets, different amount of GC threads etc. One of the effects of that is that they perform cleanup at different rates.

The "Pro .NET Memory Management" book has entire chapter dedicated to this subject: https://learning.oreilly.com/library/view/pro-net-memory/9781484240274/html/430794_1_En_11_Chapter.xhtml

Is there any way BDN could support "correct" Gen X results when using Server GC?

The results are correct. This is how GC works. BDN just reports the numbers provided by the GC.CollectionCount(x) API. I asked you to create a simple console app that does not use BDN at all, but just calls the benchmarked method in a loop with stopwatch and calls GC.CollectionCount(x) APIs at the end so you could observe it on your own.

Something like this:

public static void Main(string[] args)
{
    int iterationCount = int.Parse(args[0]);
    int invocationCount = int.Parse(args[1]);
    bool enforceGC = bool.Parse(args[2]); // true by default in BDN

    for (int i = 0; i < iterationCount; i++)
    {
        Stopwatch stopwatch = Stopwatch.StartNew();
        for (int j = 0; j < invocationCount; j++)
        {
            Benchmark();
        }
        stopwatch.Stop();
        Console.WriteLine(stopwatch.ElapsedMilliseconds / invocationCount);

        if (enforceGC)
        {
            GC.Collect();
            GC.WaitForPendingFinalizers();
            GC.Collect();
        }
    }

    Console.WriteLine();
    GC.Collect(); // cleanup the memory

    int gen0before = GC.CollectionCount(0);
    int gen1before = GC.CollectionCount(1);
    int gen2before = GC.CollectionCount(2);
    for (int j = 0; j < invocationCount; j++)
    {
        Benchmark();
    }
    int gen0after = GC.CollectionCount(0);
    int gen1after = GC.CollectionCount(1);
    int gen2after = GC.CollectionCount(2);

    Console.WriteLine($"GC stats: {(gen0after - gen0before) / 1000.0} {(gen1after - gen1before) / 1000.0} {(gen2after - gen2before) / 1000.0}");
}

static void Benchmark() { } // what you want to benchmark

consistently disabling Server GC for GC benchmarking (the Gen X results) , and enabling it for performance benchmarking, for most accurate results?

You should learn more about the GC flavors and choose the best for your production environment and then just keep using it for benchmarking.

Does disabling Server GC (in BDN) have any other impacts than changing the performance results?

It does in the same way as enabling Server GC for your app.

As long as I have time measurements and allocated memory, why should I even care about GC?

.NET GC uses the "stop the world" approach which means that for some parts of it's job (usually for memory compaction) it pauses all the active threads, performs some work and then re-enables the threads. This takes time which might be observed by the end user. In a desktop app it can be an unresponsive app, in a web app it can be increased latency.

Personally I always look at the Allocated column and Gen 2. If some Gen 2 collections happened, it might mean that the code is allocating large objects (> 85k) and it might be issuing Full GC.

@cmeeren
Copy link
Author

cmeeren commented Feb 4, 2022

Thank you, I understand more now.

I use Server GC because the web API framework I'm developing (and benchmarking) uses the Hopac library for lightweight parallelization and Task-like behavior, and Hopac instructs to turn this on. Benchmarking confirms that performance (time elapsed) is notably better with Server GC.

As I understand your explanation and the actual results, in my case, Server GC simply collects a lot less (as returned by CollectionCount), and BDN is reporting this correctly.

Thanks for the tips regarding usage of the memory results!

@cmeeren cmeeren closed this as completed Feb 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants