-
Notifications
You must be signed in to change notification settings - Fork 2
Runtime Control
The runtime behavior and output can be controlled using two environment variables and a configuration file.
Vftrace checks for two environment variables upon launch of the application.
Setting VFTR_OFF
deactivates profiling by Vftrace. This option takes precedence over any other settings.
Possible values to switch Vftrace off are: “Yes”, “True”, “1”.
Possible values to switch Vftrace on are: “No”, “False”, “0”.
Default is: “False”
export VFTR_OFF="yes"
VFTR_CONFIG
specifies the path to a config file in JSON format.
The used configuration in JSON format can be found appended to the logfiles (see Application Profiling#Configuration).
A config file with default settings can be found in the manpages for vftrace, or generated with the vftrace_generate_default_config tool (see Tools#vftrace_generate_default_config).
If no config file is given the default config is used.
A complete explanation of the options in the config file is given in the manpages or in the wiki (see Runtime Control#Config File).
To check a config file for correctness and compatibility with vftrace the vftrace_check_config
tool is provided (see Tools#vftrace_check_config).
export VFTR_CONFIG=config.json
The config file defines the runtime behavior of Vftrace and finetunes the contents of the log files.
In order to use a file as a config it needs to be exported with the VFTR_CONFIG
environment variable (see Runtime Control#VFTR_CONFIG).
The used configuration in JSON format, including default values, can be found appended to the logfiles (see Application Profiling#Configuration).
A config file with default settings can be found in the manpages for Vftrace, or generated with the vftrace_generate_default_config tool (see Tools#vftrace_generate_default_config).
If no config file is given the default config is used.
To check a config file for correctness and compatibility with Vftrace the vftrace_check_config
tool is provided (see Tools#vftrace_check_config).
The default config is:
{
"off": false,
"output_directory": ".",
"outfile_basename": null,
"logfile_for_ranks": "none",
"print_config": true,
"strip_module_names": false,
"demangle_cxx": false,
"include_cxx_prelude": false,
"profile_table": {
"show_table": true,
"show_calltime_imbalances": false,
"show_callpath": false,
"show_overhead": false,
"sort_table": {
"column": "time_excl",
"ascending": false
}
},
"name_grouped_profile_table": {
"show_table": false,
"max_stack_ids": 8,
"sort_table": {
"column": "time_excl",
"ascending": false
}
},
"sampling": {
"active": false,
"sample_interval": 0.005000,
"outbuffer_size": 8,
"precise_functions": null
},
"mpi": {
"show_table": true,
"log_messages": true,
"only_for_ranks": "all",
"show_sync_time": false,
"show_callpath": false,
"sort_table": {
"column": "none",
"ascending": false
}
},
"cuda": {
"show_table": true,
"sort_table": {
"column": "time",
"ascending": false
}
},
"hwprof": {
"type": "dummy",
"disable": false,
"show_observables": true,
"show_counters": false,
"show_summary": false,
"sort_by_column": 0,
"default_scenario": "",
"counters": [
],
"observables": [
]
}
}
- off (Boolean): turn on/off vftrace profiling.
- output_directory (String): Directory for log- and vfd-files.
- outfile_basename (String): Basename for log- and vfd-files (_all.log).
Setting it to
null
results in the executable name as basename. - logfile_for_ranks (String): Specifies which ranks get local log-files. Valid values are "none", "all", "a-b", "a,b", or a mix of the latter two. E.g. "2,3,5-8,11,13" will generate local log-files for ranks 2,3,5,6,7,8,11,13.
- print_config (Boolean): Whether to append the used config to the logfile.
- strip_module_names (Boolean): Whether to strip module names from Fortran symbols to make them more readable.
Example:
my_lengthy_module_MOD_my_function
will be stripped tomy_function
. The separator of module and function name depends on the compiler. We check for the most common ones, but in rare cases, when a different separator is used, this option might not work. - demangle_cxx (Boolean): Whether to demangle C++ symbols to make them more readable.
This requires Vftrace to be build with
libiberty
support and the application to be linked to-liberty
. Example:_Z9quicksortIiEviPT
and_Z9quicksortIdEviPT
will both be demangled toquicksort
. - include_cxx_prelude (Boolean): Whether to profile instrumented function that are called before main in C++.
If global static variables are initialized with function calls, or if constructors are used code can be
executed before main was called. In rare cases this can break the stacktree construction in vftrace.
Omitting the prelude leads to cleaner profiles.
Example:
void before_main(void) { return 1; } static int dummy = before_main(); int main() { return 0; }
before_main
is executed beforemain
is called. The default is to delay profiling until main was called, leading to a profile like:Excluding the prelude from the profile can lead to a mismatch in the total runtime and the sum of exclusive times. If the function in the prelude are to be profiled this option can be set toRuntime profile +-------+-----------+-----------+-----------+----------+--------+------+ | Calls | t_excl[s] | t_excl[%] | t_incl[s] | Function | Caller | STID | +-------+-----------+-----------+-----------+----------+--------+------+ | 1 | 0.000 | 100.0 | 0.000 | main | init | 0 | | 0 | 0.000 | 0.0 | 0.000 | init | ---- | 1 | +-------+-----------+-----------+-----------+----------+--------+------+
true
leading to inclusion ofbefore_main
(as well as other prelude related symbols) into the profile:Runtime profile +-------+-----------+-----------+-----------+-------------------------------------------------+-------------------------------------------------+------+ | Calls | t_excl[s] | t_excl[%] | t_incl[s] | Function | Caller | STID | +-------+-----------+-----------+-----------+-------------------------------------------------+-------------------------------------------------+------+ | 1 | 0.000 | 33.1 | 0.000 | _Z41__static_initialization_and_destruction_0ii | _GLOBAL__sub_I__Z11before_mainv | 2 | | 1 | 0.000 | 30.7 | 0.000 | _GLOBAL__sub_I__Z11before_mainv | init | 4 | | 1 | 0.000 | 22.0 | 0.000 | main | init | 1 | | 1 | 0.000 | 14.3 | 0.000 | _Z11before_mainv | _Z41__static_initialization_and_destruction_0ii | 0 | | 0 | 0.000 | 0.0 | 0.000 | init | ---- | 3 | +-------+-----------+-----------+-----------+-------------------------------------------------+-------------------------------------------------+------+
- profile_table (Section): This section controls how the runtime profile table is printed in the logfile.
- show_table (Boolean): Whether to print the runtime profile table.
- show_calltime_imbalances (Boolean): Whether to include columns on the calltime imbalances across ranks.
If a routine is executed on multiple ranks and they show vastly different runtimes it hints as badly distributed load.
The
Imbalances[%]
column shows the largest deviation (positive value = more than average; negative value = less than average) from the average time among all ranks in percent. Theon rank
shows the rank number on which this largest deviation occurred. In the following example the routinework2
might need optimization and better load balancing, as it has to much work, compared to the average rank in this routine.+-------+-----------+-----------+-----------+---------------+---------+--------------+--------+------+ | Calls | t_excl[s] | t_excl[%] | t_incl[s] | Imbalances[%] | on rank | Function | Caller | STID | +-------+-----------+-----------+-----------+---------------+---------+--------------+--------+------+ | 4 | 52.002 | 55.8 | 52.002 | 23.07 | 3 | work2 | main | 1 | | 4 | 40.003 | 42.9 | 40.003 | 0.00 | 1 | work1 | main | 3 | ...
- show_callpath (Boolean): Whether to include a column containing the complete callstack as shown Application Profiling#Global Call Sacks.
This can make the tables extremely wide and unreadable, but unnecessitates the need to lookup the StackID in the global call stack list.
+-------+-----------+-----------+-----------+--------------+--------+------+------------------------+ | Calls | t_excl[s] | t_excl[%] | t_incl[s] | Function | Caller | STID | Callpath | +-------+-----------+-----------+-----------+--------------+--------+------+------------------------+ ... | 1 | 0.000 | 0.0 | 0.000 | foo | main | 3 | foo<main<init | | 1 | 0.000 | 0.0 | 0.000 | bar | main | 6 | bar<main<init | | 1 | 0.000 | 0.0 | 0.000 | foo | bar | 5 | foo<bar<main<init | ...
- show_overhead (Boolean): Whether to include a column containing the function specific Vftrace overhead.
This helps to identify functions where the runtime is small compared to the overhead introduced by Vftrace.
If a function has a high overhead compared to its exclusive time and has a high callcount,
the chances are good, that the pure call overhead (without a profiler) will be significant as well.
To reduce the functions impact on the profile it could be excluded from the instrumentation.
+----------+-----------+-----------+-----------+-------------+--------------+--------+------+ | Calls | t_excl[s] | t_excl[%] | t_incl[s] | overhead[s] | Function | Caller | STID | +----------+-----------+-----------+-----------+-------------+--------------+--------+------+ | 29860703 | 19.763 | 98.1 | 19.763 | 12.897 | fib_rec | main | 1 | ...
- sort_table (Section): This section determines how the runtime profile table should be sorted.
- column (String): Specifies which column of the profile table should be used to sort the table. Possible values are: "time_exlc" (exclusive time), "time_incl" (inclusive time), "calls" (number of calls), "stack_id", "overhead", and "none".
- ascending (Boolean): Whether the table will be sorted in ascending, or descending order regarding the selected column.
- name_grouped_profile_table (Section): This section controls how the name grouped runtime profile table is printed in the logfile.
The runtime profile table has an entry for every unique call stack.
The name grouped profile table has an entry for every unique function name.
Runtime profile +-------+-----------+-----------+-----------+--------------+--------+------+ | Calls | t_excl[s] | t_excl[%] | t_incl[s] | Function | Caller | STID | +-------+-----------+-----------+-----------+--------------+--------+------+ ... | 1 | 0.000 | 0.0 | 0.000 | foo | bar1 | 8 | | 1 | 0.000 | 0.0 | 0.000 | foo | bar4 | 4 | | 1 | 0.000 | 0.0 | 0.000 | foo | bar2 | 0 | | 1 | 0.000 | 0.0 | 0.000 | foo | bar3 | 2 | ...
Runtime profile (Grouped by function name) +-------+-----------+-----------+-----------+--------------+---------+ | Calls | t_excl[s] | t_excl[%] | t_incl[s] | Function | STIDs | +-------+-----------+-----------+-----------+--------------+---------+ ... | 4 | 0.000 | 0.0 | 0.000 | foo | 0,2,4,8 | ...
- show_table (Boolean): Whether to print the name grouped runtime profile table.
- max_stack_ids (Number): Limits the number of stack IDs to be included in the STIDs column.
If the list is truncated it is indicated by three dots.
+-------+-----------+-----------+-----------+--------------+---------+ | Calls | t_excl[s] | t_excl[%] | t_incl[s] | Function | STIDs | +-------+-----------+-----------+-----------+--------------+---------+ ... | 4 | 0.000 | 0.0 | 0.000 | foo | 0,2,... | ...
- sort_table (Section): This section determines how the name grouped runtime profile table should be sorted.
- column (String): Specifies which column of the name grouped profile table should be used to sort the table. Possible values are: "time_exlc" (exclusive time), "time_incl" (inclusive time), "calls" (number of calls), "stack_id", "overhead", and "none".
- ascending (Boolean): Whether the table will be sorted in ascending, or descending order regarding the selected column.
- sampling (Section): This section controls the sampling of function events,
messages, and writing of vfd-files (see Application Profiling#Sampling).
- active (Boolean): Whether to activate sampling and writing of vfd-files.
- sample_interval (Number): Shortest duration between two consecutive samples.
A larger interval helps keeping the file size small, but might miss important information.
Setting this to
0
or negative values enables sampling of every event. Function entry/exit events are ignored until thesample_interval
timer expired since the last sample was taken. An exception are precise functions (seeprecise_functions
). This example shows that the exit of the foo function is skipped, because the events were too close together.Stack and message samples: 0.004959 call main<init 1.016018 call foo<main<init 2.016375 call foo<main<init 2.016381 exit init
- outbuffer_size (Number): Size of vfd output buffer in MiB.
- precise_functions (String): Regular expression defining for which functions
the sample_interval should be ignored.
User regions and MPI routines are always precise.
Precisely sampled functions are marked with an asterisk after the function name.
The example (
precise_functions="foo"
) shows how the entry and exit of thefoo
function are sampled, although the last sample was less than the sample time of 0.005s ago. In contrast to that only the entry event of the bar function was sampled:Stack and message samples: 0.005213 call main<init 1.014496 call bar<main<init 1.014504 call foo*<main<init 1.014505 exit foo*<main<init 1.014506 exit init
- mpi (Section): This section controls the collection of MPI information during runtime
and the printing of the communication profile table.
- show_table (Boolean): Whether to show the communication profile in the logfile (see Application Profiling#Communication Profile for more information)..
- log_messages (Boolean): Whether to gather message information for the communication profile (see Application Profiling#Communication Profile) and take samples for the vfd-files (see Application Profiling#Message Samples).
- only_for_ranks (String): Specifies for which ranks messages should be logged.
Only if both of the participating ranks appear in the range list the message
is logged.
The title of the communication profile table in the logfiles contains the included ranks.
Valid values are "none", "all", "a-b", "a,b", or a mix of the latter two.
E.g. "0,3" will only log messages involving both ranks 0 and 3:
1.367125 call MPI_Bcast*<main<init 1.367127 send in stackID 5 from threadID 0 count=1 type=MPI_INT(4Bytes) rate= 0.1238 MiB/s peer=0 tag=-1 1.367158 send end 1.367127 send in stackID 5 from threadID 0 count=1 type=MPI_INT(4Bytes) rate= 0.1238 MiB/s peer=3 tag=-1 1.367158 send end 1.367162 exit MPI_Bcast*<main<init
- show_sync_time (Boolean): If active collective blocking communication is preceeded
by a barrier with the same communicator as the communication routine.
The time until all ranks arrive at the barrier is measured
and included into the runtime profile as
MPI_<function_name>_sync
. This enables insight into load imbalances which lead to waiting periods due to waiting for communication.Runtime profile +-------+-----------+-----------+-----------+----------------+-----------+------+ | Calls | t_excl[s] | t_excl[%] | t_incl[s] | Function | Caller | STID | +-------+-----------+-----------+-----------+----------------+-----------+------+ | 4 | 2.009 | 36.7 | 2.009 | MPI_Bcast_sync | MPI_Bcast | 1 | | 4 | 0.008 | 0.1 | 2.017 | MPI_Bcast | main | 2 | ...
- show_callpath (Boolean): Whether to include a column containing the complete
callstack as shown Application Profiling#Global Call Sacks.
This can make the tables extremely wide and unreadable, but
unnecessitates the need to lookup the StackID in the global call stack list.
+----------+------------------+------------------+------------------+------------------+------------------+-----------+--------+------+---------------------+ | Messages | avg send size[B] | avg recv size[B] | avg send BW[B/s] | avg recv BW[B/s] | avg comm time[s] | Function | Caller | STID | Callpath | +----------+------------------+------------------+------------------+------------------+------------------+-----------+--------+------+---------------------+ | 7 | 4.000e+00 | 4.000e+00 | 3.219e+05 | 9.903e-315 | 5.853e-04 | MPI_Bcast | main | 1 | MPI_Bcast<main<init | +----------+------------------+------------------+------------------+------------------+------------------+-----------+--------+------+---------------------+
- sort_table (Section): This section determines how the communication profile table should be sorted.
- column (String): Specifies which column of the communication profile table should be used to sort the table. Possible values are: "messages" (number of messages), "send_size" (average send message size), "recv_size" (average recv message size), "send_bw" (average send bandwidth), "recv_bw" (average recv bandwidth), "calls" (number of calls), "comm_time" (time spend in communication), "stack_id", and "none".
- ascending (Boolean): Whether the table will be sorted in ascending, or descending order regarding the selected column.
- cuda
- show_table
- sort_table
- column
- ascending
- hwprof (Section): This section controls the profiling of hardware performance counters. For details see (...)
- active