Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JMH Benchmarks added. #178

Merged
merged 5 commits into from
Sep 12, 2024
Merged

JMH Benchmarks added. #178

merged 5 commits into from
Sep 12, 2024

Conversation

svladykin
Copy link
Contributor

Description of changes:

Added JMH becnhmarks to have more reliable performance data. Since the performance numbers are not stable across runs had to do 5 forks per benchamark. Total run time should be under 4 hours (Github Actions must be able to run up to 6 hours https://docs.github.com/en/actions/administering-github-actions/usage-limits-billing-and-administration#usage-limits).

Benchmark / Performance (for source code changes):

<replace this with output from /src/test/software/amazon/event/ruler/Bechmarks.java here.

The benchmark results can be fetched from "Pull request checks -> Java build -> build (ubuntu-X.Y, 8) -> Run benchmarks".>

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@timbray
Copy link
Collaborator

timbray commented Aug 27, 2024

Benchmarks that run for hours to characterize performance of an operation that typically takes much less than 1 millisecond seems surprising? When I'm doing optimization work I want benchmarks that complete in a few seconds to facilitate iterative development. Also, these runs will be expensive, no?

Is the source of the reported benchmarking instability well understood? I'm just wondering if there's a cheaper approach.

@svladykin
Copy link
Contributor Author

@timbray
Here we actually benchamark against citylots2 data set (it is good to use real world data), thus single run through the whole data set (200k+ events) is not 1 millisecond, it is around 1+ second for a run. Maybe it makes sense to reduce this data set (need to find a way to do that without reducing data quality much) to have smaller individual runs, but I don't think this will help to reduce variability across forks.

Not sure what do you mean by "expensive", Github Actions are free for open source projects.

These benhmarks are more about having stable baseline for overall library performance, they are not supposed to be run after every code change in your editor. If you work on a specific improvement you should have a separate smaller benchmark for it to be able to iterate quickly. Or just pick one from citylots2, no need to run all of them during development.

The source of instability is not known as of now (I only started diving deep into this code). Fixing it should allow reducing time noticeably, but it will not be seconds anyways.

@timbray
Copy link
Collaborator

timbray commented Aug 27, 2024

Anything that takes 4hrs to run is "expensive" by some measure. But OK.

The fact that the benchmarking results aren't stable is worrying, because it suggests that Ruler's users could experience that lack of stability in production, which could be a real problem in some applications. My first guess would be some JVM memory issue? Anyhow, I think it's probably important to understand.

Welcome to the Ruler world!

@svladykin
Copy link
Contributor Author

This instability is worrying because from what I have seen sometimes the numbers are more stable within a fork than between forks, but the differrence is usually not too high (below maybe 5%) to consider it dangerous. At the same time these 5% I see in my JMH benchmarks with proper warmup and measurement time, while current single shot style Benchmarks run can show like 10% difference for a given benchmark easily (probably if hits unlucky GC pause?).

I agree that GC could probably be the issue here since the code does lots of allocations, but maybe there are other sources of nondeterminism as well.

@baldawar
Copy link
Collaborator

@svladykin Is it possible to check what's the long poll for the test runs ? wondering if most of the latent right now is reading / parsing to the citylots json into memory for each test run independently.

@svladykin
Copy link
Contributor Author

@baldawar
It runs exactly as configured:

14 simple tests * 5 forks * (1 minute warmup + 1 minute measurement) = 140 minutes
2 combo tests * 5 forks * (2 minutes warmup + 5 minutes measurement) = 70 minutes
Total 210 minutes.

@State(Scope.Benchmark) is used for CityLots2State which means it is created only once and citylots are read only once per fork.

Also, I checked the throughput numbers for sanity and they are better then in Benchmarks test because of warmup. This looks correct to me.

This long run time is not a bug, it is a feature to get statistically meaningful results. Btw, now if you open JMH logs you can see how different numbers between forks of the same becnhmark while they are much closer within a fork. I think this should be investigated. If we fix this probably we can reduce the number of forks.

@svladykin
Copy link
Contributor Author

This is how JMH report looks like, the :events metric shows the same throughput as the original Benchmarks, while general metric shows how many citylots datasets per second we process:

Benchmark                                                   (benchmark)   Mode  Cnt       Score      Error  Units
CityLots2Benchmarks.run01Simple                                   EXACT  thrpt   30       1.368 ±    0.019  ops/s
CityLots2Benchmarks.run01Simple:events                            EXACT  thrpt   30  291478.546 ± 4091.946  ops/s
CityLots2Benchmarks.run01Simple                                WILDCARD  thrpt   30       1.092 ±    0.009  ops/s
CityLots2Benchmarks.run01Simple:events                         WILDCARD  thrpt   30  232660.693 ± 1890.315  ops/s
CityLots2Benchmarks.run01Simple                                  PREFIX  thrpt   30       1.348 ±    0.017  ops/s
CityLots2Benchmarks.run01Simple:events                           PREFIX  thrpt   30  287140.384 ± 3622.276  ops/s
CityLots2Benchmarks.run01Simple         PREFIX_EQUALS_IGNORE_CASE_RULES  thrpt   30       1.367 ±    0.014  ops/s
CityLots2Benchmarks.run01Simple:events  PREFIX_EQUALS_IGNORE_CASE_RULES  thrpt   30  291182.920 ± 2926.339  ops/s
CityLots2Benchmarks.run01Simple                                  SUFFIX  thrpt   30       1.324 ±    0.018  ops/s
CityLots2Benchmarks.run01Simple:events                           SUFFIX  thrpt   30  282113.057 ± 3856.940  ops/s
CityLots2Benchmarks.run01Simple         SUFFIX_EQUALS_IGNORE_CASE_RULES  thrpt   30       1.340 ±    0.015  ops/s
CityLots2Benchmarks.run01Simple:events  SUFFIX_EQUALS_IGNORE_CASE_RULES  thrpt   30  285544.405 ± 3240.952  ops/s
CityLots2Benchmarks.run01Simple                      EQUALS_IGNORE_CASE  thrpt   30       1.183 ±    0.018  ops/s
CityLots2Benchmarks.run01Simple:events               EQUALS_IGNORE_CASE  thrpt   30  252047.025 ± 3882.145  ops/s
CityLots2Benchmarks.run01Simple                                 NUMERIC  thrpt   30       0.806 ±    0.005  ops/s
CityLots2Benchmarks.run01Simple:events                          NUMERIC  thrpt   30  171728.258 ± 1147.665  ops/s
CityLots2Benchmarks.run01Simple                            ANYTHING-BUT  thrpt   30       0.781 ±    0.007  ops/s
CityLots2Benchmarks.run01Simple:events                     ANYTHING-BUT  thrpt   30  166389.360 ± 1430.806  ops/s
CityLots2Benchmarks.run01Simple                ANYTHING-BUT-IGNORE-CASE  thrpt   30       0.783 ±    0.010  ops/s
CityLots2Benchmarks.run01Simple:events         ANYTHING-BUT-IGNORE-CASE  thrpt   30  166867.515 ± 2077.023  ops/s
CityLots2Benchmarks.run01Simple                     ANYTHING-BUT-PREFIX  thrpt   30       0.777 ±    0.004  ops/s
CityLots2Benchmarks.run01Simple:events              ANYTHING-BUT-PREFIX  thrpt   30  165644.385 ±  860.477  ops/s
CityLots2Benchmarks.run01Simple                     ANYTHING-BUT-SUFFIX  thrpt   30       0.779 ±    0.003  ops/s
CityLots2Benchmarks.run01Simple:events              ANYTHING-BUT-SUFFIX  thrpt   30  166043.650 ±  729.597  ops/s
CityLots2Benchmarks.run01Simple                   ANYTHING-BUT-WILDCARD  thrpt   30       0.920 ±    0.007  ops/s
CityLots2Benchmarks.run01Simple:events            ANYTHING-BUT-WILDCARD  thrpt   30  196123.968 ± 1476.610  ops/s
CityLots2Benchmarks.run01Simple                          COMPLEX_ARRAYS  thrpt   30       0.189 ±    0.001  ops/s
CityLots2Benchmarks.run01Simple:events                   COMPLEX_ARRAYS  thrpt   30   40340.070 ±  262.628  ops/s
CityLots2Benchmarks.run02Combo                            PARTIAL_COMBO  thrpt   25       0.271 ±    0.003  ops/s
CityLots2Benchmarks.run02Combo:events                     PARTIAL_COMBO  thrpt   25   57690.763 ±  604.684  ops/s
CityLots2Benchmarks.run02Combo                                    COMBO  thrpt   25       0.106 ±    0.001  ops/s
CityLots2Benchmarks.run02Combo:events                             COMBO  thrpt   25   22570.994 ±  178.067  ops/s

@svladykin
Copy link
Contributor Author

Here is a good example of in-fork numbers vs across-forks numbers for the same benchmark (EXACT on Java 8).
Iterations within a fork are pretty stable but there is a 5% difference between forks:

# Fork: 1 of 5
Iteration   1: 1.406 ops/s
                 events: 299529.567 ops/s

Iteration   2: 1.408 ops/s
                 events: 299953.293 ops/s

Iteration   3: 1.405 ops/s
                 events: 299258.836 ops/s

Iteration   4: 1.407 ops/s
                 events: 299701.398 ops/s

Iteration   5: 1.405 ops/s
                 events: 299436.825 ops/s

Iteration   6: 1.406 ops/s
                 events: 299521.963 ops/s

# Fork: 5 of 5
Iteration   1: 1.325 ops/s
                 events: 282346.065 ops/s

Iteration   2: 1.335 ops/s
                 events: 284377.792 ops/s

Iteration   3: 1.335 ops/s
                 events: 284347.654 ops/s

Iteration   4: 1.335 ops/s
                 events: 284399.564 ops/s

Iteration   5: 1.337 ops/s
                 events: 284815.593 ops/s

Iteration   6: 1.330 ops/s
                 events: 283277.466 ops/s

@timbray
Copy link
Collaborator

timbray commented Aug 28, 2024

@svladykin What's that running on, your desktop? EC2 instance? (Trying to think of reasons why forks should differ in performance, coming up empty so far.)

@svladykin
Copy link
Contributor Author

@timbray It is from Github Actions run for this PR https://github.com/aws/event-ruler/pull/178/checks#step:6:1019
I can see similar behavior on my own linux desktop as well. Makes sense to try reproducing it on arm64 macos to make sure it is not a platform specific issue.

It is hard to say why this happens without investigation. I'm not yet an expert in this library and maybe there is some actual non-deterministic code in there. But maybe it is different memory layout, inconsistent gc activity or jit compilation. At least we can see that it happens in all tested java versions.

@baldawar
Copy link
Collaborator

you can also add -prof or -lprof to see what's happening underneath https://github.com/openjdk/jmh/blob/master/jmh-samples/src/main/java/org/openjdk/jmh/samples/JMHSample_35_Profilers.java#L125

also, in case it helps, long time ago when I was looked into JMH benchmarks for ruler, I bookmarked this paper to watchout for caveats with using JMH http://asgaard.ece.ualberta.ca/papers/Journal/TSE_2019_Costa_Whats_Wrong_With_My_Benchmark_Results_Studying_Bad_Practices_in_JMH_Benchmarks.pdf

@svladykin
Copy link
Contributor Author

@baldawar Thank you for the pointers, I'm aware of the profilers and these common pitfalls.

@svladykin
Copy link
Contributor Author

Did some jvm tuning to reduce variability: the major source was jit compilation of jackson code. Did not have time to dive deeper and find out what exact place (or places?) in the jackson code causes this, thus disabling inlining for jackson to make jit unable to perform too aggressive optimizations for it. This lowers a little bit the overall performance, but we should not care about it: the goal of these benchmarks to measure event-ruler throughput, not jackson.

Reduced benchmark time given the improved stability: now it should finish in under 80 minutes.

Also, improved the output format:

Benchmark                                                  (benchmark)   Mode  Cnt       Score      Error  Units
CityLots2JmhBenchmarks.group01Simple                             EXACT  thrpt   10  215826.056 ± 1685.682  ops/s
CityLots2JmhBenchmarks.group01Simple                          WILDCARD  thrpt   10  174364.846 ± 1590.776  ops/s
CityLots2JmhBenchmarks.group01Simple                            PREFIX  thrpt   10  215057.911 ±  764.390  ops/s
CityLots2JmhBenchmarks.group01Simple   PREFIX_EQUALS_IGNORE_CASE_RULES  thrpt   10  218052.163 ± 1815.789  ops/s
CityLots2JmhBenchmarks.group01Simple                            SUFFIX  thrpt   10  218279.258 ± 2020.594  ops/s
CityLots2JmhBenchmarks.group01Simple   SUFFIX_EQUALS_IGNORE_CASE_RULES  thrpt   10  217354.997 ± 2662.748  ops/s
CityLots2JmhBenchmarks.group01Simple                EQUALS_IGNORE_CASE  thrpt   10  198260.913 ± 2161.272  ops/s
CityLots2JmhBenchmarks.group01Simple                           NUMERIC  thrpt   10  143255.971 ± 2079.436  ops/s
CityLots2JmhBenchmarks.group01Simple                      ANYTHING-BUT  thrpt   10  143463.276 ± 1444.423  ops/s
CityLots2JmhBenchmarks.group01Simple          ANYTHING-BUT-IGNORE-CASE  thrpt   10  138094.242 ±  748.715  ops/s
CityLots2JmhBenchmarks.group01Simple               ANYTHING-BUT-PREFIX  thrpt   10  139181.706 ± 1249.204  ops/s
CityLots2JmhBenchmarks.group01Simple               ANYTHING-BUT-SUFFIX  thrpt   10  142143.993 ± 1257.279  ops/s
CityLots2JmhBenchmarks.group01Simple             ANYTHING-BUT-WILDCARD  thrpt   10  156440.792 ± 2748.445  ops/s
CityLots2JmhBenchmarks.group02Complex                   COMPLEX_ARRAYS  thrpt    6   37028.629 ± 1152.132  ops/s
CityLots2JmhBenchmarks.group02Complex                    PARTIAL_COMBO  thrpt    6   52485.597 ± 1468.224  ops/s
CityLots2JmhBenchmarks.group02Complex                            COMBO  thrpt    6   20551.057 ±   65.786  ops/s

@svladykin
Copy link
Contributor Author

Also, in CI switched to corretto and added java 21.

@baldawar
Copy link
Collaborator

baldawar commented Sep 9, 2024

There's a fair bit of difference in performance between your last results and #178 (comment). Is that because both are on two different machines ?

@svladykin
Copy link
Contributor Author

Yes, these numbers from a different box, but we should expect the numbers to be different because of disabling inlining for jackson. I think it is ok, because we should not focus on the absolute numbers since they will be different on different hardware anyways, instead we should focus on being able to improve or at least not loose relative throuhput numbers before and after any code change in event-ruler.

The new numbers from Guthub Actions for Java 8 are also lower:

Benchmark                                                  (benchmark)   Mode  Cnt       Score      Error  Units
CityLots2JmhBenchmarks.group01Simple                             EXACT  thrpt   10  216105.483 ± 1130.202  ops/s
CityLots2JmhBenchmarks.group01Simple                          WILDCARD  thrpt   10  184305.512 ± 2634.162  ops/s
CityLots2JmhBenchmarks.group01Simple                            PREFIX  thrpt   10  224162.293 ± 5989.259  ops/s
CityLots2JmhBenchmarks.group01Simple   PREFIX_EQUALS_IGNORE_CASE_RULES  thrpt   10  222526.705 ±  915.481  ops/s
CityLots2JmhBenchmarks.group01Simple                            SUFFIX  thrpt   10  204310.465 ± 4760.335  ops/s
CityLots2JmhBenchmarks.group01Simple   SUFFIX_EQUALS_IGNORE_CASE_RULES  thrpt   10  221633.446 ±  495.026  ops/s
CityLots2JmhBenchmarks.group01Simple                EQUALS_IGNORE_CASE  thrpt   10  200930.707 ±  988.268  ops/s
CityLots2JmhBenchmarks.group01Simple                           NUMERIC  thrpt   10  147397.069 ±  817.688  ops/s
CityLots2JmhBenchmarks.group01Simple                      ANYTHING-BUT  thrpt   10  143008.115 ± 5017.983  ops/s
CityLots2JmhBenchmarks.group01Simple          ANYTHING-BUT-IGNORE-CASE  thrpt   10  143384.276 ±  706.014  ops/s
CityLots2JmhBenchmarks.group01Simple               ANYTHING-BUT-PREFIX  thrpt   10  149053.190 ± 2013.433  ops/s
CityLots2JmhBenchmarks.group01Simple               ANYTHING-BUT-SUFFIX  thrpt   10  139416.635 ± 3799.945  ops/s
CityLots2JmhBenchmarks.group01Simple             ANYTHING-BUT-WILDCARD  thrpt   10  159613.253 ± 1039.436  ops/s
CityLots2JmhBenchmarks.group02Complex                   COMPLEX_ARRAYS  thrpt    6   37657.915 ±  700.029  ops/s
CityLots2JmhBenchmarks.group02Complex                    PARTIAL_COMBO  thrpt    6   53417.981 ±  237.175  ops/s
CityLots2JmhBenchmarks.group02Complex                            COMBO  thrpt    6   21093.577 ±  474.944  ops/s

@baldawar
Copy link
Collaborator

baldawar commented Sep 10, 2024

Interesting, so inlining for jackson leads to 10~20% performance difference. I'd take consistent numbers for the tests but still good to know. Skimming the PR, everything looks good. Will do a second review before merging.

Sidenote, if its easy for you to set it up, it would have been great to publish the results as a comment on the PR. I was looking at https://github.com/benchmark-action/github-action-benchmark but there's non-zero amount of work to set it up, so I planned on doing it later (unless you get to it first)
.

runs-on: ${{ matrix.os }}
steps:
- uses: actions/checkout@v4
- name: Set up JDK ${{ matrix.java }}
uses: actions/setup-java@v4
with:
distribution: 'temurin'
distribution: 'corretto'
java-version: ${{ matrix.java }}
cache: 'maven'
- name: Verify with Maven
run: mvn --batch-mode --errors --update-snapshots verify
- name: Run benchmarks
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is kept because we haven't moved everything yet?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have ported everything from CL2Benchmark but not CL2NoCompileBenchmark. Not sure if it is actually an important use case.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are important so I'm happy to have "Run Benchmarks" step still until these tests that check for rule building, memory usage, et all.


private void addRules(String... rules) throws Exception {
for (String rule : rules) {
String rname = String.format("r%d", ruleCount++);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my experience "r" + ruleCount would be much faster and allow tests to wrap up sooner.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I copy-pasted this code. This advise makes sense for cases where we generate rule names all the time, but here we do that just a few times on setup at the beginning and then run a benchmark for at least 90 seconds. Such optimization will make zero difference.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it depends on how long setup takes. If its few seconds, then not an issue. if its minutes, then would be worth the improvement.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, exactly. Here for simple benchmarks we have under 10 rules, for complex ones more, but still negligible number.

Comment on lines +28 to +41
@Benchmark
@Warmup(iterations = 4, batchSize = 1, time = 10, timeUnit = SECONDS)
@Measurement(iterations = 5, batchSize = 1, time = 10, timeUnit = SECONDS)
public void group01Simple(MachineStateSimple machineState, CityLots2State cityLots2State, Blackhole blackhole) throws Exception {
run(machineState, cityLots2State, blackhole);
}

@Benchmark
@Warmup(iterations = 2, batchSize = 1, time = 60, timeUnit = SECONDS)
@Measurement(iterations = 3, batchSize = 1, time = 60, timeUnit = SECONDS)
public void group02Complex(MachineStateComplex machineState, CityLots2State cityLots2State, Blackhole blackhole) throws Exception {
run(machineState, cityLots2State, blackhole);
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it worth splitting this by each matcher to parallelize the runs and reduce the runtime even more ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you go this route, you would be able to get away with building a HashMap of rule-matchines during setup() and then return pass the machine during load tests. It'll reduce the number of scaffolding classes for benchmarks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This grouping is done not for parallelism, parallelizing these benchmark on the same box is a bad idea because the results will be noisy.

The reason for this grouping is to have different setup for different benchmarks: for simple ones it is 40 seconds warmup + 50 seconds measurement, for complex ones it is 2 minutes warmup + 3 minutes measurement just because every call for complex benchmarks takes much longer.

We could split across multiple boxes for parallelism, but I don't think it is worth the effort and probably would be harder to collect results.

What we actually want to do regarding parallelism is to run every benchmark by 2 threads (and check that throughput is ~2x of 1 thread) to make sure that we don't have any contention or false sharing. Right now these benchmarks are single threaded and thus we don't actually know if we have an issue. I will push this change.

@svladykin
Copy link
Contributor Author

I can look into benchmark actions, but I guess it makes sense to do this separately.

@baldawar
Copy link
Collaborator

It's a nice-to-have. If its a lot of work, then not worth the effort of clubbing it with this PR. I'd favor merging this in and get to that sometime in future.

@svladykin
Copy link
Contributor Author

It should not be too much work, but I see that it requires creating a separate branch to store the results, so it is going to require multiple PRs anyways.

@baldawar baldawar merged commit c831e62 into aws:main Sep 12, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants