Skip to content

PAPI Interface

MeisterEule edited this page Dec 13, 2022 · 5 revisions

Vftrace incorporates the PAPI interface, which allows to read out hardware counters on many x86 platforms.

Compiling Vftrace

Add the --with-papi argument to your configure call. The argument is the PAPI installation directory, where Vftrace expects an include directory with papi.h and a lib directory with libpapi.so in it.

Available PAPI events

Hardware counters are specific to individual CPUs and might differ between different platforms. Moreover, which counters you can access also depends on the permissions you have and the settings of the operating system. To see which counters are available via the PAPI interface, use papi_native_avail. This displays a list of hardware counters and their PAPI name, e.g.

[...]
===============================================================================
 Native Events in Component: perf_event
===============================================================================
| ix86arch::UNHALTED_CORE_CYCLES                                               |
|            count core clock cycles whenever the clock signal on the specific |
|            core is running (not halted)                                      |
|     :e=0                                                                     |
|            edge level (may require counter-mask >= 1)                        |
[...]

In addition to these "native" events, PAPI has a list of "preset" events. These are hardware counters which are available on most platforms. You can use papi_avail to see the list of preset events and which of them are supported on your system, e.g.

[...]
================================================================================
  PAPI Preset Events
================================================================================
    Name        Code    Avail Deriv Description (Note)
PAPI_L1_DCM  0x80000000  Yes   No   Level 1 data cache misses
PAPI_L1_ICM  0x80000001  Yes   No   Level 1 instruction cache misses
PAPI_L2_DCM  0x80000002  Yes   Yes  Level 2 data cache misses
[...]
PAPI_L3_STM  0x8000000f  No    No   Level 3 store misses
[...]

Vftrace hardware observables

Defining hardware observables

Hardware counters alone are rarely significant for performance evaluations. Most observables are derived from them. As a simple example, the clock frequency is the number of cycles executed in a certain time interval divided by the length of that interval. In practice, arbitrary functions of hardware counters can be of interest. For this reason, Vftrace allows for the usage of mathematical expressions to define hardware observables. They are specified in the config file as follows:

  "papi": {
      "disable": false,
      "show_counters": true,
      "sort_by_column": 0,
      "counters": [
          {
          "native": "perf::CYCLES",
          "symbol": "f1"
          },
          {
          "native": "FP_ARITH:SCALAR_DOUBLE",
          "symbol": "fpdouble"
          },
          {
          "preset": "PAPI_DP_OPS",
          "symbol": "fp_preset"
          }

      ],
      "observables": [
          {
             "name": "f",
             "formula": "f1 / T * 1e-6",
             "unit": "MHz"
          },
          {
             "name": "perf",
             "formula": "fpdouble / T * 1e-6",
             "unit": "MFlop/s"
          },
          {
             "name": "perf_preset",
             "formula": "fp_preset / T * 1e-6",
             "unit": "MFlop/s"
          }
      ]
   }

There are two sections: The first specifies PAPI events and associates a symbol to them. You can either use native events or preset events. In the second section, the symbols defined above are used in formulas to define hardware observables. Each observable needs a mandatory name and can have a unit associated with it. Note that the formulas have to be understood region-wise. Hardware counters are sampled upon function-entry and -exit, and their difference is accumulated. The formulas specified in the papi section are evaluated at the end of the profiling run. The variable T used above is a builtin variable and specifies the time spent in a region in seconds. See below for the list of builtin variables.

Logfile tables

As a default, Vftrace prints a designated PAPI table, showing each stack entry and its observable measurement. Note that the _all logfile contains the sum over all MPI ranks, so that these values might look artificially high. This is an example for a PAPI measurement on one rank:

Runtime PAPI profile - Observables

+----------+------------+-------------+----------------+-----------------------+
|  #Calls  |   Func     |   f [MHz]   | perf [MFlop/s] | perf_preset [MFlop/s] |
+----------+------------+-------------+----------------+-----------------------+
|        6 | function_1 | 2866.714646 |       2.662828 |              2.590860 |
|        1 |  mpi_bcast | 2798.994213 |       0.007382 |              0.005111 |
|        1 | function_2 | 2797.122556 |     963.943687 |            963.767157 |
[...]

You can specify the column that you want to sort the table by using the sort_by_column option in the papi section of the config file. In addition to the table of derived observables, you can also print out a table of the raw hardware counters by setting show_counters: true.

Builtin Variables

  • T: Time spent in region in seconds
  • CALLS: Number of calls of a region

Further remarks

  • We do not check if the formulas are well-defined. For example, if a division by zero occurs, the resulting observable value will be displayed as nan.
  • The evaluation of the formulas uses tinyexpr. It is included as a git submodule and downloaded automatically.