pa is an efficient tool for analyzing performance data based on statistical simulation, i.e., bootstrap. The performance data can come from measurements such as performance tests or benchmarks.
There are two main ways how to analyze performance data with pa:
- Single version analysis: statistic (e.g., mean) + variability (confidence interval of the statistic)
- Analysis between two versions: confidence interval ratio of a statistic (e.g., mean)
Inspired by [1], pa employs a Monte-Carlo technique called bootstrap [2] to estimate the population confidence interval from a given sample. It uses hierarchical random re-sampling with replacement [3]. In the context of performance analysis, these hierarchical levels correspond to levels where measurement repetition happens to have reliable results. These hierarchical levels are:
- invocation
- iteration
- fork
- trial
- instance
Higher levels are composed of multiple occurrences of lower levels, i.e., an iteration (level 2) consists of many invocations (level 1), and so on.
pa requires only Go with version 1.14 or higher (Install Page).
Install pa by running go get github.com/chrstphlbr/pa
.
pa comes with a simple command line interface (optional flags in [...]
with their defaults):
pa [-bs 10000] [-is 0] [-sl 0.01] [-st mean] [-os] [-m 1] [-tra id:id] \
file_1 \
[file_2 ... file_n]
If 1 file (file_1
) is provided, the single version analysis is performed, i.e., the confidence intervals of a single performance experiment is computed.
If multiple files are provided, the two version analysis is performed:
the confidence intervals for both versions and the confidence interval ratio between the two versions is computed.
In the simple case, 2 files are provided, file_1
for version 1 and file_2
for version 2.
It is also possible to provide multiple files per version (of equal number) by setting the flag -m
.
Note that the files MUST be sorted alphabetically by their benchmarks (see section "Input Files").
Flags:
-bs
defines the number of bootstrap simulations, i.e., how many random samples are taken to estimate the population distribution-is
defines how many invocation samples are used (0 takes the mean across all invocations of an iteration, -1 takes all invocations, and > 0 for number of samples).-sl
defines the significance level. The confidence level for the confidence intervals is then1-sl
. The default is 0.01 which corresponds to a 99% confidence level.-st
defines the statistic for which a confidence interval is computed. The default ismean
. Another option ismedian
.-os
defines whether the statistic, as set by-st
, is included in the output file.-m
sets the number of files per version (control and test group). For example, if-m 3
pa expects 6 files, wherefile_1
,file_2
, andfile_3
belong to version 1, andfile_4
,file_5
, andfile_6
belong to version two.-tra
defines the transformation(s) applied to the benchmark results (i.e., the file(s)), in the form oftransformer1:transformer2
, wheretransformer1
is applied to the first (control) group andtransformer2
is applied to the second (test) group (if it exists). Transformers can be one ofid
(identity, no transformation) orf0.0
('f' for factor followed by a user-specified float64 value)
pa expects CSV input files of the following form. For JMH benchmark results, the tool bencher can transform JMH JSON output to this CSV file format.
project;commit;benchmark;params;instance;trial;fork;iteration;mode;unit;value_count;value
The columns represent the following values:
project
is the project namecommit
is the project version, e.g., a commit hashbenchmark
is the name of the fully-qualified benchmark methodparams
are the performance parameters (not the function/method parameters) of the benchmark in comma-separated form. Every parameter consists of a name and a value, separated by an equal sign (name=value
). For example JMH supports performance parameters through its@Param
annotationinstance
is the name of the instance or machine (level 5)trial
is the number of the trial (level 4)fork
is the fork number (level 3). For example JMH supports forks through their@Fork
annotationiteration
is the iteration number within a fork (level 2)mode
is the benchmark mode. For exmaple JMH supports average timeavgt
, throughputthrpt
, or sample timesample
unit
is the measurement unit of the benchmark value. Depending on themode
, the measurement unit can be ns/op for average time or op/s for throughputvalue_count
is the number of invocations (level 1) thevalue
occurred in this iteration. Every iteration can have multiple values (i.e., invocations), which are presented as a histogram. Each histogram value corresponds to one CSV row, and the occurrences of this value is defined byvalue_count
.value
is the performance metric with a certainunit
IMPORTANT: the input files must be sorted by benchmark
and params
, otherwise the tool will not work correctly.
This is because input files can be large and, therefore, pa works on file input streams.
pa writes the results in CSV form to stdout. The output can contain 3 types of CSV rows:
- rows starting with
#
are comments - empty rows
- all other rows are CSV rows
The columns are:
benchmark
is the name of the benchmarkparams
are the function/method parameters of the benchmark. pa does not populate this column, because the input format does not provide the function/method parametersperf_params
is a comma-separated list of performance parameters. See columnparams
of the input files for comparisonst
is the statistic the confidence interval is for. Can be "mean" or "median"ci_l
is the lower bound of the confidence intervalci_u
is the upper bound of the confidence intervalcl
is the confidence level of the confidence interval
The output file is a CSV with the following columns (without -os
):
benchmark;params;perf_params;ci_l;ci_u;cl
And with the statistic, as set by -os
, it has the following columns:
benchmark;params;perf_params;st;ci_lower;ci_u;cl
The output file is a CSV with the following columns (without -os
):
benchmark;params;perf_params;v1_ci_l;v1_ci_u;v1_cl;v2_ci_l;v2_ci_u;v2_cl;ratio_s;ratio_ci_l;ratio_ci_u;ratio_cl
And with the statistic, as set by -os
, it has the following columns:
benchmark;params;perf_params;v1_st;v1_ci_l;v1_ci_u;v1_cl;v2_st;v2_ci_l;v2_ci_u;v2_cl;ratio_st;ratio_ci_l;ratio_ci_u;ratio_cl
Compared to the single version analysis, the two version analysis has three or four (with or without -os
) columns, for both versions (v1
and v2
) and the confidence interval for the ratio between the two versions (ratio
).
[1] T. Kalibera and R. Jones, “Quantifying performance changes with effect size confidence intervals”, University of Kent, Technical Report 4–12, June 2012. Available: URL
[2] A. C. Davison and D. V. Hinkley, “Bootstrap methods and their application”
[3] S. Ren, H. Lai, W. Tong, M. Aminzadeh, X. Hou, and S. Lai, “Nonparametric bootstrapping for hierarchical data”, Journal of Applied Statistics, vol. 37, no. 9, pp. 1487–1498, 2010. Available: DOI