-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JMH Benchmarks added. #178
Conversation
Benchmarks that run for hours to characterize performance of an operation that typically takes much less than 1 millisecond seems surprising? When I'm doing optimization work I want benchmarks that complete in a few seconds to facilitate iterative development. Also, these runs will be expensive, no? Is the source of the reported benchmarking instability well understood? I'm just wondering if there's a cheaper approach. |
@timbray Not sure what do you mean by "expensive", Github Actions are free for open source projects. These benhmarks are more about having stable baseline for overall library performance, they are not supposed to be run after every code change in your editor. If you work on a specific improvement you should have a separate smaller benchmark for it to be able to iterate quickly. Or just pick one from citylots2, no need to run all of them during development. The source of instability is not known as of now (I only started diving deep into this code). Fixing it should allow reducing time noticeably, but it will not be seconds anyways. |
Anything that takes 4hrs to run is "expensive" by some measure. But OK. The fact that the benchmarking results aren't stable is worrying, because it suggests that Ruler's users could experience that lack of stability in production, which could be a real problem in some applications. My first guess would be some JVM memory issue? Anyhow, I think it's probably important to understand. Welcome to the Ruler world! |
This instability is worrying because from what I have seen sometimes the numbers are more stable within a fork than between forks, but the differrence is usually not too high (below maybe 5%) to consider it dangerous. At the same time these 5% I see in my JMH benchmarks with proper warmup and measurement time, while current single shot style I agree that GC could probably be the issue here since the code does lots of allocations, but maybe there are other sources of nondeterminism as well. |
@svladykin Is it possible to check what's the long poll for the test runs ? wondering if most of the latent right now is reading / parsing to the citylots json into memory for each test run independently. |
@baldawar 14 simple tests * 5 forks * (1 minute warmup + 1 minute measurement) = 140 minutes
Also, I checked the throughput numbers for sanity and they are better then in This long run time is not a bug, it is a feature to get statistically meaningful results. Btw, now if you open JMH logs you can see how different numbers between forks of the same becnhmark while they are much closer within a fork. I think this should be investigated. If we fix this probably we can reduce the number of forks. |
This is how JMH report looks like, the
|
Here is a good example of in-fork numbers vs across-forks numbers for the same benchmark (EXACT on Java 8).
|
@svladykin What's that running on, your desktop? EC2 instance? (Trying to think of reasons why forks should differ in performance, coming up empty so far.) |
@timbray It is from Github Actions run for this PR https://github.com/aws/event-ruler/pull/178/checks#step:6:1019 It is hard to say why this happens without investigation. I'm not yet an expert in this library and maybe there is some actual non-deterministic code in there. But maybe it is different memory layout, inconsistent gc activity or jit compilation. At least we can see that it happens in all tested java versions. |
you can also add -prof or -lprof to see what's happening underneath https://github.com/openjdk/jmh/blob/master/jmh-samples/src/main/java/org/openjdk/jmh/samples/JMHSample_35_Profilers.java#L125 also, in case it helps, long time ago when I was looked into JMH benchmarks for ruler, I bookmarked this paper to watchout for caveats with using JMH http://asgaard.ece.ualberta.ca/papers/Journal/TSE_2019_Costa_Whats_Wrong_With_My_Benchmark_Results_Studying_Bad_Practices_in_JMH_Benchmarks.pdf |
@baldawar Thank you for the pointers, I'm aware of the profilers and these common pitfalls. |
Did some jvm tuning to reduce variability: the major source was jit compilation of jackson code. Did not have time to dive deeper and find out what exact place (or places?) in the jackson code causes this, thus disabling inlining for jackson to make jit unable to perform too aggressive optimizations for it. This lowers a little bit the overall performance, but we should not care about it: the goal of these benchmarks to measure event-ruler throughput, not jackson. Reduced benchmark time given the improved stability: now it should finish in under 80 minutes. Also, improved the output format:
|
Also, in CI switched to |
There's a fair bit of difference in performance between your last results and #178 (comment). Is that because both are on two different machines ? |
Yes, these numbers from a different box, but we should expect the numbers to be different because of disabling inlining for jackson. I think it is ok, because we should not focus on the absolute numbers since they will be different on different hardware anyways, instead we should focus on being able to improve or at least not loose relative throuhput numbers before and after any code change in event-ruler. The new numbers from Guthub Actions for Java 8 are also lower:
|
Interesting, so inlining for jackson leads to 10~20% performance difference. I'd take consistent numbers for the tests but still good to know. Skimming the PR, everything looks good. Will do a second review before merging. Sidenote, if its easy for you to set it up, it would have been great to publish the results as a comment on the PR. I was looking at https://github.com/benchmark-action/github-action-benchmark but there's non-zero amount of work to set it up, so I planned on doing it later (unless you get to it first) |
runs-on: ${{ matrix.os }} | ||
steps: | ||
- uses: actions/checkout@v4 | ||
- name: Set up JDK ${{ matrix.java }} | ||
uses: actions/setup-java@v4 | ||
with: | ||
distribution: 'temurin' | ||
distribution: 'corretto' | ||
java-version: ${{ matrix.java }} | ||
cache: 'maven' | ||
- name: Verify with Maven | ||
run: mvn --batch-mode --errors --update-snapshots verify | ||
- name: Run benchmarks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is kept because we haven't moved everything yet?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have ported everything from CL2Benchmark
but not CL2NoCompileBenchmark
. Not sure if it is actually an important use case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are important so I'm happy to have "Run Benchmarks" step still until these tests that check for rule building, memory usage, et all.
|
||
private void addRules(String... rules) throws Exception { | ||
for (String rule : rules) { | ||
String rname = String.format("r%d", ruleCount++); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my experience "r" + ruleCount
would be much faster and allow tests to wrap up sooner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I copy-pasted this code. This advise makes sense for cases where we generate rule names all the time, but here we do that just a few times on setup at the beginning and then run a benchmark for at least 90 seconds. Such optimization will make zero difference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it depends on how long setup takes. If its few seconds, then not an issue. if its minutes, then would be worth the improvement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, exactly. Here for simple benchmarks we have under 10 rules, for complex ones more, but still negligible number.
@Benchmark | ||
@Warmup(iterations = 4, batchSize = 1, time = 10, timeUnit = SECONDS) | ||
@Measurement(iterations = 5, batchSize = 1, time = 10, timeUnit = SECONDS) | ||
public void group01Simple(MachineStateSimple machineState, CityLots2State cityLots2State, Blackhole blackhole) throws Exception { | ||
run(machineState, cityLots2State, blackhole); | ||
} | ||
|
||
@Benchmark | ||
@Warmup(iterations = 2, batchSize = 1, time = 60, timeUnit = SECONDS) | ||
@Measurement(iterations = 3, batchSize = 1, time = 60, timeUnit = SECONDS) | ||
public void group02Complex(MachineStateComplex machineState, CityLots2State cityLots2State, Blackhole blackhole) throws Exception { | ||
run(machineState, cityLots2State, blackhole); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it worth splitting this by each matcher to parallelize the runs and reduce the runtime even more ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you go this route, you would be able to get away with building a HashMap of rule-matchines during setup() and then return pass the machine during load tests. It'll reduce the number of scaffolding classes for benchmarks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This grouping is done not for parallelism, parallelizing these benchmark on the same box is a bad idea because the results will be noisy.
The reason for this grouping is to have different setup for different benchmarks: for simple ones it is 40 seconds warmup + 50 seconds measurement, for complex ones it is 2 minutes warmup + 3 minutes measurement just because every call for complex benchmarks takes much longer.
We could split across multiple boxes for parallelism, but I don't think it is worth the effort and probably would be harder to collect results.
What we actually want to do regarding parallelism is to run every benchmark by 2 threads (and check that throughput is ~2x of 1 thread) to make sure that we don't have any contention or false sharing. Right now these benchmarks are single threaded and thus we don't actually know if we have an issue. I will push this change.
I can look into benchmark actions, but I guess it makes sense to do this separately. |
It's a nice-to-have. If its a lot of work, then not worth the effort of clubbing it with this PR. I'd favor merging this in and get to that sometime in future. |
It should not be too much work, but I see that it requires creating a separate branch to store the results, so it is going to require multiple PRs anyways. |
Description of changes:
Added JMH becnhmarks to have more reliable performance data. Since the performance numbers are not stable across runs had to do 5 forks per benchamark. Total run time should be under 4 hours (Github Actions must be able to run up to 6 hours https://docs.github.com/en/actions/administering-github-actions/usage-limits-billing-and-administration#usage-limits).
Benchmark / Performance (for source code changes):
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.