-
Notifications
You must be signed in to change notification settings - Fork 2
PAPI Interface
Vftrace incorporates the PAPI interface, which allows to read out hardware counters on many x86 platforms.
Add the --with-papi
argument to your configure call. The argument is the PAPI installation directory, where Vftrace expects an include directory with papi.h
and a lib
directory with libpapi.so
in it.
Hardware counters are specific to individual CPUs and might differ between different platforms. Moreover, which counters you can access also depends on the permissions you have and the settings of the operating system.
To see which counters are available via the PAPI interface, use papi_native_avail
. This displays a list of hardware counters and their PAPI name, e.g.
[...]
===============================================================================
Native Events in Component: perf_event
===============================================================================
| ix86arch::UNHALTED_CORE_CYCLES |
| count core clock cycles whenever the clock signal on the specific |
| core is running (not halted) |
| :e=0 |
| edge level (may require counter-mask >= 1) |
[...]
In addition to these "native" events, PAPI has a list of "preset" events. These are hardware counters which are available on most platforms. You can use papi_avail
to see the list of preset events and which of them are supported on your system, e.g.
[...]
================================================================================
PAPI Preset Events
================================================================================
Name Code Avail Deriv Description (Note)
PAPI_L1_DCM 0x80000000 Yes No Level 1 data cache misses
PAPI_L1_ICM 0x80000001 Yes No Level 1 instruction cache misses
PAPI_L2_DCM 0x80000002 Yes Yes Level 2 data cache misses
[...]
PAPI_L3_STM 0x8000000f No No Level 3 store misses
[...]
Hardware counters alone are rarely significant for performance evaluations. Most observables are derived from them. As a simple example, the clock frequency is the number of cycles executed in a certain time interval divided by the length of that interval. In practice, arbitrary functions of hardware counters can be of interest. For this reason, Vftrace allows for the usage of mathematical expressions to define hardware observables. They are specified in the config file as follows:
"papi": {
"disable": false,
"show_counters": true,
"sort_by_column": 0,
"counters": [
{
"native": "perf::CYCLES",
"symbol": "f1"
},
{
"native": "FP_ARITH:SCALAR_DOUBLE",
"symbol": "fpdouble"
},
{
"preset": "PAPI_DP_OPS",
"symbol": "fp_preset"
}
],
"observables": [
{
"name": "f",
"formula": "f1 / T * 1e-6",
"unit": "MHz"
},
{
"name": "perf",
"formula": "fpdouble / T * 1e-6",
"unit": "MFlop/s"
},
{
"name": "perf_preset",
"formula": "fp_preset / T * 1e-6",
"unit": "MFlop/s"
}
]
}
There are two sections: The first specifies PAPI events and associates a symbol to them. You can either use native events or preset events. In the second section, the symbols defined above are used in formulas to define hardware observables. Each observable needs a mandatory name and can have a unit associated with it.
Note that the formulas have to be understood region-wise. Hardware counters are sampled upon function-entry and -exit, and their difference is accumulated. The formulas specified in the papi section are evaluated at the end of the profiling run. The variable T
used above is a builtin variable and specifies the time spent in a region in seconds. See below for the list of builtin variables.
As a default, Vftrace prints a designated PAPI table, showing each stack entry and its observable measurement. Note that the _all
logfile contains the sum over all MPI ranks, so that these values might look artificially high. This is an example for a PAPI measurement on one rank:
Runtime PAPI profile - Observables
+----------+------------+-------------+----------------+-----------------------+
| #Calls | Func | f [MHz] | perf [MFlop/s] | perf_preset [MFlop/s] |
+----------+------------+-------------+----------------+-----------------------+
| 6 | function_1 | 2866.714646 | 2.662828 | 2.590860 |
| 1 | mpi_bcast | 2798.994213 | 0.007382 | 0.005111 |
| 1 | function_2 | 2797.122556 | 963.943687 | 963.767157 |
[...]
You can specify the column that you want to sort the table by using the sort_by_column
option in the papi section of the config file.
In addition to the table of derived observables, you can also print out a table of the raw hardware counters by setting show_counters: true
.
-
T
: Time spent in region in seconds -
CALLS
: Number of calls of a region
- We do not check if the formulas are well-defined. For example, if a division by zero occurs, the resulting observable value will be displayed as
nan
. - The evaluation of the formulas uses tinyexpr. It is included as a git submodule and downloaded automatically.