Install https://github.com/async-profiler/async-profiler/releases/tag/v3.0 and configure it as described as per https://github.com/async-profiler/async-profiler?tab=readme-ov-file#basic-usage.
Build a proper JMH benchmark jar to have more options to run the benchmark.
$ ./gradlew jmhJar
On x86 Linux, run QueryEntityLazyInitCollectionLoop
attaching wall/cpu and alloc profiling, gc and perfnorm
with 2 forks and 100 loop iterations (it's a specific benchmark parameter):
$ java -jar basic/target/libs/hibernate-orm-benchmark-basic-1.0-SNAPSHOT-jmh.jar QueryEntityLazyInitCollectionLoop -f 2 -prof gc -prof perfnorm -prof "async:rawCommand=alloc,wall;event=cpu;output=jfr;dir=/tmp;libPath=${ASYNC_PROFILER_HOME}/lib/libasyncProfiler.so" -pcount=100
The rawCommand
parameter has been used to additionally set secondary events which are not yet included into the JMH async-profiler plugin
i.e. wall-clock profiling and allocation profiling (using the default alloc
event thresholds, see JVMTI_EVENT_SAMPLED_OBJECT_ALLOC
at https://docs.oracle.com/en/java/javase/17/docs/specs/jvmti.html i.e. 512KB).
To change the allocation threshold replace alloc
with alloc=128
for 128 bytes threshold.
Running the previous command should print something like (assuming Linux perf to be installed):
Secondary result "org.hibernate.benchmark.queryl1hit.QueryEntityLazyInitCollectionLoop.test:·async":
Async profiler results:
/tmp/org.hibernate.benchmark.queryl1hit.QueryEntityLazyInitCollectionLoop.single-Throughput/jfr-cpu.jfr
Async profiler results:
/tmp/org.hibernate.benchmark.queryl1hit.QueryEntityLazyInitCollectionLoop.single-Throughput/jfr-cpu.jfr
and finally the results:
Benchmark (count) Mode Cnt Score Error Units
QueryEntityLazyInitCollectionLoop.test 100 thrpt 6 40.391 ± 1.919 ops/s
QueryEntityLazyInitCollectionLoop.test:CPI 100 thrpt 2 0.303 clks/insn
QueryEntityLazyInitCollectionLoop.test:IPC 100 thrpt 2 3.300 insns/clk
QueryEntityLazyInitCollectionLoop.test:L1-dcache-load-misses 100 thrpt 2 5329572.054 #/op
QueryEntityLazyInitCollectionLoop.test:L1-dcache-loads 100 thrpt 2 173216246.787 #/op
QueryEntityLazyInitCollectionLoop.test:L1-icache-load-misses 100 thrpt 2 192632.126 #/op
QueryEntityLazyInitCollectionLoop.test:L1-icache-loads 100 thrpt 2 14780212.318 #/op
QueryEntityLazyInitCollectionLoop.test:branch-misses 100 thrpt 2 187771.566 #/op
QueryEntityLazyInitCollectionLoop.test:branches 100 thrpt 2 92994927.715 #/op
QueryEntityLazyInitCollectionLoop.test:cycles 100 thrpt 2 140706458.155 #/op
QueryEntityLazyInitCollectionLoop.test:dTLB-load-misses 100 thrpt 2 38567.276 #/op
QueryEntityLazyInitCollectionLoop.test:dTLB-loads 100 thrpt 2 3969019.541 #/op
QueryEntityLazyInitCollectionLoop.test:iTLB-load-misses 100 thrpt 2 16704.402 #/op
QueryEntityLazyInitCollectionLoop.test:iTLB-loads 100 thrpt 2 414076.369 #/op
QueryEntityLazyInitCollectionLoop.test:instructions 100 thrpt 2 464215980.492 #/op
QueryEntityLazyInitCollectionLoop.test:stalled-cycles-frontend 100 thrpt 2 13056272.972 #/op
QueryEntityLazyInitCollectionLoop.test:·async 100 thrpt NaN ---
QueryEntityLazyInitCollectionLoop.test:·gc.alloc.rate 100 thrpt 6 1155.831 ± 53.937 MB/sec
QueryEntityLazyInitCollectionLoop.test:·gc.alloc.rate.norm 100 thrpt 6 30431129.790 ± 11700.499 B/op
QueryEntityLazyInitCollectionLoop.test:·gc.count 100 thrpt 6 22.000 counts
QueryEntityLazyInitCollectionLoop.test:·gc.time 100 thrpt 6 25.000 ms
The position of the jfr
files can be used along with the async-profiler converter, to produce the 3 types of flamegraphs we're interested in.
For alloc flamegraphs:
$ java -cp ${ASYNC_PROFILER_HOME}/lib/converter.jar jfr2flame --alloc --total /tmp/org.hibernate.benchmark.queryl1hit.QueryEntityLazyInitCollectionLoop.single-Throughput/jfr-cpu.jfr alloc.html
The --total
flag is used to produce flamegraph samples based on the configred allocation threshold, aiming to show the total size of allocations.
Instead, for wall clock and cpu flamegraphs:
$ java -cp ${ASYNC_PROFILER_HOME}/lib/converter.jar jfr2flame --state default /tmp/org.hibernate.benchmark.queryl1hit.QueryEntityLazyInitCollectionLoop.single-Throughput/jfr-cpu.jfr cpu.html
$ java -cp ${ASYNC_PROFILER_HOME}/lib/converter.jar jfr2flame --state runnable,sleeping /tmp/org.hibernate.benchmark.queryl1hit.QueryEntityLazyInitCollectionLoop.single-Throughput/jfr-cpu.jfr wall.html
Why we're using --state
here?
see async-profiler/async-profiler#740 (comment) for more info: TLDR both wall and cpu events are collected
using jdk.ExecutionSample
, hence to distinguish between them, async-profiler uses the state
field to differentiate between them.
To better understand this, we can type:
$ jfr print --events jdk.ExecutionSample /tmp/org.hibernate.benchmark.queryl1hit.QueryEntityLazyInitCollectionLoop.test-Throughput-count-100/jfr-cpu.jfr | grep -c STATE_DEFAULT
302
$ jfr summary /tmp/org.hibernate.benchmark.queryl1hit.QueryEntityLazyInitCollectionLoop.test-Throughput-count-100/jfr-cpu.jfr
Version: 2.0
Chunks: 1
Start: 2024-05-17 12:32:19 (UTC)
Duration: 3 s
Event Type Count Size (bytes)
=========================================================
jdk.ObjectAllocationInNewTLAB 7087 141954
jdk.ExecutionSample 3509 55622
jdk.NativeLibrary 28 2157
jdk.ActiveSetting 23 705
jdk.InitialSystemProperty 17 1066
jdk.CPULoad 3 63
jdk.GCHeapSummary 3 123
jdk.Metadata 1 6952
jdk.Checkpoint 1 294850
jdk.ActiveRecording 1 75
jdk.OSInformation 1 108
jdk.CPUInformation 1 365
jdk.JVMInformation 1 348
jdk.ObjectAllocationOutsideTLAB 0 0
jdk.JavaMonitorEnter 0 0
jdk.ThreadPark 0 0
profiler.Log 0 0
profiler.Window 0 0
profiler.LiveObject 0 0
Which shows 2 interesting things:
- as expected, the cpu-related
jdk.ExecutionSample
events are just a part of the overall ones e.g. 302 vs 3509 - there are other events collected by flight-recorder while
async-profiler
agent has been attached