Skip to content

Runtime Control

MeisterEule edited this page Jan 23, 2023 · 13 revisions

The runtime behavior and output can be controlled using two environment variables and a configuration file.

Environment Variables

Vftrace checks for two environment variables upon launch of the application.

VFTR_OFF

Setting VFTR_OFF deactivates profiling by Vftrace. This option takes precedence over any other settings. Possible values to switch Vftrace off are: “Yes”, “True”, “1”. Possible values to switch Vftrace on are: “No”, “False”, “0”. Default is: “False”

export VFTR_OFF="yes"

VFTR_CONFIG

VFTR_CONFIG specifies the path to a config file in JSON format. The used configuration in JSON format can be found appended to the logfiles (see Application Profiling#Configuration). A config file with default settings can be found in the manpages for vftrace, or generated with the vftrace_generate_default_config tool (see Tools#vftrace_generate_default_config). If no config file is given the default config is used. A complete explanation of the options in the config file is given in the manpages or in the wiki (see Runtime Control#Config File). To check a config file for correctness and compatibility with vftrace the vftrace_check_config tool is provided (see Tools#vftrace_check_config).

export VFTR_CONFIG=config.json

Config File

The config file defines the runtime behavior of Vftrace and finetunes the contents of the log files. In order to use a file as a config it needs to be exported with the VFTR_CONFIG environment variable (see Runtime Control#VFTR_CONFIG). The used configuration in JSON format, including default values, can be found appended to the logfiles (see Application Profiling#Configuration). A config file with default settings can be found in the manpages for Vftrace, or generated with the vftrace_generate_default_config tool (see Tools#vftrace_generate_default_config). If no config file is given the default config is used. To check a config file for correctness and compatibility with Vftrace the vftrace_check_config tool is provided (see Tools#vftrace_check_config). The default config is:

{
   "off": false,
   "output_directory": ".",
   "outfile_basename": null,
   "logfile_for_ranks": "none",
   "print_config": true,
   "strip_module_names": false,
   "demangle_cxx": false,
   "include_cxx_prelude": false,
   "profile_table": {
      "show_table": true,
      "show_calltime_imbalances": false,
      "show_callpath": false,
      "show_overhead": false,
      "sort_table": {
         "column": "time_excl",
         "ascending": false
      }
   },
   "name_grouped_profile_table": {
      "show_table": false,
      "max_stack_ids": 8,
      "sort_table": {
         "column": "time_excl",
         "ascending": false
      }
   },
   "sampling": {
      "active": false,
      "sample_interval": 0.005000,
      "outbuffer_size": 8,
      "precise_functions": null
   },
   "mpi": {
      "show_table": true,
      "log_messages": true,
      "only_for_ranks": "all",
      "show_sync_time": false,
      "show_callpath": false,
      "sort_table": {
         "column": "none",
         "ascending": false
      }
   },
   "cuda": {
      "show_table": true,
      "sort_table": {
         "column": "time",
         "ascending": false
      }
   },
   "hwprof": {
      "type": "dummy",
      "disable": false,
      "show_observables": true,
      "show_counters": false,
      "show_summary": false,
      "sort_by_column": 0,
      "default_scenario": "",
      "counters": [
      ],
      "observables": [
      ]
   }
}
  • off (Boolean): turn on/off vftrace profiling.
  • output_directory (String): Directory for log- and vfd-files.
  • outfile_basename (String): Basename for log- and vfd-files (_all.log). Setting it to null results in the executable name as basename.
  • logfile_for_ranks (String): Specifies which ranks get local log-files. Valid values are "none", "all", "a-b", "a,b", or a mix of the latter two. E.g. "2,3,5-8,11,13" will generate local log-files for ranks 2,3,5,6,7,8,11,13.
  • print_config (Boolean): Whether to append the used config to the logfile.
  • strip_module_names (Boolean): Whether to strip module names from Fortran symbols to make them more readable. Example: my_lengthy_module_MOD_my_function will be stripped to my_function. The separator of module and function name depends on the compiler. We check for the most common ones, but in rare cases, when a different separator is used, this option might not work.
  • demangle_cxx (Boolean): Whether to demangle C++ symbols to make them more readable. This requires Vftrace to be build with libiberty support and the application to be linked to -liberty. Example: _Z9quicksortIiEviPT and _Z9quicksortIdEviPT will both be demangled to quicksort.
  • include_cxx_prelude (Boolean): Whether to profile instrumented function that are called before main in C++. If global static variables are initialized with function calls, or if constructors are used code can be executed before main was called. In rare cases this can break the stacktree construction in vftrace. Omitting the prelude leads to cleaner profiles. Example:
    void before_main(void) {
       return 1;
    }
    static int dummy = before_main();
    int main() {
       return 0;
    }
    
    before_main is executed before main is called. The default is to delay profiling until main was called, leading to a profile like:
    Runtime profile
    +-------+-----------+-----------+-----------+----------+--------+------+
    | Calls | t_excl[s] | t_excl[%] | t_incl[s] | Function | Caller | STID |
    +-------+-----------+-----------+-----------+----------+--------+------+
    |     1 |     0.000 |     100.0 |     0.000 |     main |   init |    0 |
    |     0 |     0.000 |       0.0 |     0.000 |     init |   ---- |    1 |
    +-------+-----------+-----------+-----------+----------+--------+------+
    
    Excluding the prelude from the profile can lead to a mismatch in the total runtime and the sum of exclusive times. If the function in the prelude are to be profiled this option can be set to true leading to inclusion of before_main (as well as other prelude related symbols) into the profile:
    Runtime profile
    +-------+-----------+-----------+-----------+-------------------------------------------------+-------------------------------------------------+------+
    | Calls | t_excl[s] | t_excl[%] | t_incl[s] |                    Function                     |                     Caller                      | STID |
    +-------+-----------+-----------+-----------+-------------------------------------------------+-------------------------------------------------+------+
    |     1 |     0.000 |      33.1 |     0.000 | _Z41__static_initialization_and_destruction_0ii |                 _GLOBAL__sub_I__Z11before_mainv |    2 | 
    |     1 |     0.000 |      30.7 |     0.000 |                 _GLOBAL__sub_I__Z11before_mainv |                                            init |    4 | 
    |     1 |     0.000 |      22.0 |     0.000 |                                            main |                                            init |    1 | 
    |     1 |     0.000 |      14.3 |     0.000 |                                _Z11before_mainv | _Z41__static_initialization_and_destruction_0ii |    0 | 
    |     0 |     0.000 |       0.0 |     0.000 |                                            init |                                            ---- |    3 | 
    +-------+-----------+-----------+-----------+-------------------------------------------------+-------------------------------------------------+------+
    
  • profile_table (Section): This section controls how the runtime profile table is printed in the logfile.
    • show_table (Boolean): Whether to print the runtime profile table.
    • show_calltime_imbalances (Boolean): Whether to include columns on the calltime imbalances across ranks. If a routine is executed on multiple ranks and they show vastly different runtimes it hints as badly distributed load. The Imbalances[%] column shows the largest deviation (positive value = more than average; negative value = less than average) from the average time among all ranks in percent. The on rank shows the rank number on which this largest deviation occurred. In the following example the routine work2 might need optimization and better load balancing, as it has to much work, compared to the average rank in this routine.
      +-------+-----------+-----------+-----------+---------------+---------+--------------+--------+------+
      | Calls | t_excl[s] | t_excl[%] | t_incl[s] | Imbalances[%] | on rank |   Function   | Caller | STID |
      +-------+-----------+-----------+-----------+---------------+---------+--------------+--------+------+
      |     4 |    52.002 |      55.8 |    52.002 |         23.07 |       3 |        work2 |   main |    1 |
      |     4 |    40.003 |      42.9 |    40.003 |          0.00 |       1 |        work1 |   main |    3 |
      ...
      
    • show_callpath (Boolean): Whether to include a column containing the complete callstack as shown Application Profiling#Global Call Sacks. This can make the tables extremely wide and unreadable, but unnecessitates the need to lookup the StackID in the global call stack list.
      +-------+-----------+-----------+-----------+--------------+--------+------+------------------------+
      | Calls | t_excl[s] | t_excl[%] | t_incl[s] |   Function   | Caller | STID |        Callpath        |
      +-------+-----------+-----------+-----------+--------------+--------+------+------------------------+
      ...
      |     1 |     0.000 |       0.0 |     0.000 |          foo |   main |    3 |          foo<main<init |
      |     1 |     0.000 |       0.0 |     0.000 |          bar |   main |    6 |          bar<main<init |
      |     1 |     0.000 |       0.0 |     0.000 |          foo |    bar |    5 |      foo<bar<main<init |
      ...
      
    • show_overhead (Boolean): Whether to include a column containing the function specific Vftrace overhead. This helps to identify functions where the runtime is small compared to the overhead introduced by Vftrace. If a function has a high overhead compared to its exclusive time and has a high callcount, the chances are good, that the pure call overhead (without a profiler) will be significant as well. To reduce the functions impact on the profile it could be excluded from the instrumentation.
      +----------+-----------+-----------+-----------+-------------+--------------+--------+------+
      |  Calls   | t_excl[s] | t_excl[%] | t_incl[s] | overhead[s] |   Function   | Caller | STID |
      +----------+-----------+-----------+-----------+-------------+--------------+--------+------+
      | 29860703 |    19.763 |      98.1 |    19.763 |      12.897 |      fib_rec |   main |    1 |
      ...
      
    • sort_table (Section): This section determines how the runtime profile table should be sorted.
      • column (String): Specifies which column of the profile table should be used to sort the table. Possible values are: "time_exlc" (exclusive time), "time_incl" (inclusive time), "calls" (number of calls), "stack_id", "overhead", and "none".
      • ascending (Boolean): Whether the table will be sorted in ascending, or descending order regarding the selected column.
  • name_grouped_profile_table (Section): This section controls how the name grouped runtime profile table is printed in the logfile. The runtime profile table has an entry for every unique call stack.
    Runtime profile
    +-------+-----------+-----------+-----------+--------------+--------+------+
    | Calls | t_excl[s] | t_excl[%] | t_incl[s] |   Function   | Caller | STID |
    +-------+-----------+-----------+-----------+--------------+--------+------+
    ...
    |     1 |     0.000 |       0.0 |     0.000 |          foo |   bar1 |    8 | 
    |     1 |     0.000 |       0.0 |     0.000 |          foo |   bar4 |    4 | 
    |     1 |     0.000 |       0.0 |     0.000 |          foo |   bar2 |    0 | 
    |     1 |     0.000 |       0.0 |     0.000 |          foo |   bar3 |    2 | 
    ...
    
    The name grouped profile table has an entry for every unique function name.
    Runtime profile (Grouped by function name)
    +-------+-----------+-----------+-----------+--------------+---------+
    | Calls | t_excl[s] | t_excl[%] | t_incl[s] |   Function   |  STIDs  |
    +-------+-----------+-----------+-----------+--------------+---------+
    ...
    |     4 |     0.000 |       0.0 |     0.000 |          foo | 0,2,4,8 |
    ...
    
    • show_table (Boolean): Whether to print the name grouped runtime profile table.
    • max_stack_ids (Number): Limits the number of stack IDs to be included in the STIDs column. If the list is truncated it is indicated by three dots.
      +-------+-----------+-----------+-----------+--------------+---------+
      | Calls | t_excl[s] | t_excl[%] | t_incl[s] |   Function   |  STIDs  |
      +-------+-----------+-----------+-----------+--------------+---------+
      ...
      |     4 |     0.000 |       0.0 |     0.000 |          foo | 0,2,... |
      ...
      
    • sort_table (Section): This section determines how the name grouped runtime profile table should be sorted.
      • column (String): Specifies which column of the name grouped profile table should be used to sort the table. Possible values are: "time_exlc" (exclusive time), "time_incl" (inclusive time), "calls" (number of calls), "stack_id", "overhead", and "none".
      • ascending (Boolean): Whether the table will be sorted in ascending, or descending order regarding the selected column.
  • sampling (Section): This section controls the sampling of function events, messages, and writing of vfd-files (see Application Profiling#Sampling).
    • active (Boolean): Whether to activate sampling and writing of vfd-files.
    • sample_interval (Number): Shortest duration between two consecutive samples. A larger interval helps keeping the file size small, but might miss important information. Setting this to 0 or negative values enables sampling of every event. Function entry/exit events are ignored until the sample_interval timer expired since the last sample was taken. An exception are precise functions (see precise_functions). This example shows that the exit of the foo function is skipped, because the events were too close together.
      Stack and message samples:
              0.004959 call main<init
              1.016018 call foo<main<init
              2.016375 call foo<main<init
              2.016381 exit init
      
    • outbuffer_size (Number): Size of vfd output buffer in MiB.
    • precise_functions (String): Regular expression defining for which functions the sample_interval should be ignored. User regions and MPI routines are always precise. Precisely sampled functions are marked with an asterisk after the function name. The example (precise_functions="foo") shows how the entry and exit of the foo function are sampled, although the last sample was less than the sample time of 0.005s ago. In contrast to that only the entry event of the bar function was sampled:
      Stack and message samples:
              0.005213 call main<init
              1.014496 call bar<main<init
              1.014504 call foo*<main<init
              1.014505 exit foo*<main<init
              1.014506 exit init
      
  • mpi (Section): This section controls the collection of MPI information during runtime and the printing of the communication profile table.
    • show_table (Boolean): Whether to show the communication profile in the logfile (see Application Profiling#Communication Profile for more information)..
    • log_messages (Boolean): Whether to gather message information for the communication profile (see Application Profiling#Communication Profile) and take samples for the vfd-files (see Application Profiling#Message Samples).
    • only_for_ranks (String): Specifies for which ranks messages should be logged. Only if both of the participating ranks appear in the range list the message is logged. The title of the communication profile table in the logfiles contains the included ranks. Valid values are "none", "all", "a-b", "a,b", or a mix of the latter two. E.g. "0,3" will only log messages involving both ranks 0 and 3:
              1.367125 call MPI_Bcast*<main<init
              1.367127 send in stackID 5 from threadID 0
                       count=1 type=MPI_INT(4Bytes) rate=   0.1238 MiB/s peer=0 tag=-1
              1.367158 send end
              1.367127 send in stackID 5 from threadID 0
                       count=1 type=MPI_INT(4Bytes) rate=   0.1238 MiB/s peer=3 tag=-1
              1.367158 send end
              1.367162 exit MPI_Bcast*<main<init
      
    • show_sync_time (Boolean): If active collective blocking communication is preceeded by a barrier with the same communicator as the communication routine. The time until all ranks arrive at the barrier is measured and included into the runtime profile as MPI_<function_name>_sync. This enables insight into load imbalances which lead to waiting periods due to waiting for communication.
      Runtime profile
      +-------+-----------+-----------+-----------+----------------+-----------+------+
      | Calls | t_excl[s] | t_excl[%] | t_incl[s] |    Function    |  Caller   | STID |
      +-------+-----------+-----------+-----------+----------------+-----------+------+
      |     4 |     2.009 |      36.7 |     2.009 | MPI_Bcast_sync | MPI_Bcast |    1 | 
      |     4 |     0.008 |       0.1 |     2.017 |      MPI_Bcast |      main |    2 |
      ... 
      
    • show_callpath (Boolean): Whether to include a column containing the complete callstack as shown Application Profiling#Global Call Sacks. This can make the tables extremely wide and unreadable, but unnecessitates the need to lookup the StackID in the global call stack list.
      +----------+------------------+------------------+------------------+------------------+------------------+-----------+--------+------+---------------------+
      | Messages | avg send size[B] | avg recv size[B] | avg send BW[B/s] | avg recv BW[B/s] | avg comm time[s] | Function  | Caller | STID |      Callpath       |
      +----------+------------------+------------------+------------------+------------------+------------------+-----------+--------+------+---------------------+
      |        7 |        4.000e+00 |        4.000e+00 |        3.219e+05 |       9.903e-315 |        5.853e-04 | MPI_Bcast |   main |    1 | MPI_Bcast<main<init |
      +----------+------------------+------------------+------------------+------------------+------------------+-----------+--------+------+---------------------+
      
    • sort_table (Section): This section determines how the communication profile table should be sorted.
      • column (String): Specifies which column of the communication profile table should be used to sort the table. Possible values are: "messages" (number of messages), "send_size" (average send message size), "recv_size" (average recv message size), "send_bw" (average send bandwidth), "recv_bw" (average recv bandwidth), "calls" (number of calls), "comm_time" (time spend in communication), "stack_id", and "none".
      • ascending (Boolean): Whether the table will be sorted in ascending, or descending order regarding the selected column.
  • cuda
    • show_table
    • sort_table
      • column
      • ascending
  • hwprof (Section): This section controls the profiling of hardware performance counters. For details see (...)
    • type (String): Specifies one of the three possible types of hardware counter interfaces. Possible values are "Dummy", "PAPI" and "VEPERF". The "Dummy" interface is used for testing purposes. Note that providing a value for this variable is obligatory.
    • active (Boolean): The switch to turn hardware profiling on or off. Turning off hardware profiling can significantly reduce the Vftrace overhead.
    • show_observables (Boolean): Show a table in the logfiles which contains, for each stack ID, the value of the observables defined below.
    • show_counters (Boolean): Show a table in the logfiles which contains, for each stack ID, the value of the raw hardware counter defined below.
    • show_summary (Boolean): Show a table with the summed values of observables and counters, if active.
    • sort_by_column (Integer): The column of the observables table by whose values to sort.
    • default_scenario (String): The full path to another json file which contains hardware counters and observables.
    • counters (List of sections): A list of sections, each one specifying a hardware counter. Each section consists of two values:
      • hwc_name (String): The name of the hardware counter. In case of PAPI, this must be one of the counters identified by "papi_avail" or "papi_native_avail".
      • symbol (String): A symbol which represents this counter in the formulas for the observables. There are pre-defined symbols which cannot be chosen (...)
    • observables (List of sections): A list of sections, each one specifying an observable, i.e. a measurement inferred from the hardware counters. Each section has three values:
      • name (String): The observable's name.
      • formula (String): A formula which computes the observable's value from the hardware counters defined in the corresponding section.
      • unit (String): The observable's unit, to be shown in the logfile table. This value is optional.
Clone this wiki locally