-
Notifications
You must be signed in to change notification settings - Fork 2
Finalization
The vftr_finalize
routine was registered by atexit
in the initialization of Vftrace
to be called right before program termination
(see Initialization#Preparing Finalization.
First action the vftr_finalize
function is to check whether vftrace is off
or uninitialized
.
off
means that vftr_finalize
was called before, which can happen in MPI-parallel codes.
Vftrace requires exchange of data between the ranks in the finalization, which is done within the MPI-framework.
Therefore, the finalization has to happen before MPI is finalized.
int vftr_MPI_Finalize() {
// it is necessary to finalize vftrace here, in order to properly communicate stack ids
// between processes. After MPI_Finalize communication between processes is prohibited
vftr_finalize();
return PMPI_Finalize();
}
However, in the initialization vftr_finalize
was registered with atexit
,
effectively executing it at code termination.
This is intended for serial codes, but not for MPI-parallel ones,
where vftr_finalize
would be executed a second time.
To prevent this vftr_finalize
changes the state to off
which prevents a second execution.
void vftr_finalize() {
if (vftrace.state == off || vftrace.state == uninitialized) {
// was already finalized
// Maybe by MPI_Finalize
// vftr_finalize was already registered by atexit
// before vftrace knew that this was an MPI-program
return;
}
// update the vftrace state
vftrace.state = off;
...
If vftr_finalize
is calles from within MPI_Finalize
the local threadstack (see Threadsafe Stack Tracking) is not empty yet.
Therefore, the stack is popped until it contains the init
function only.
// in case finalize was not called from the threadstacks root
// the threadstack needs to be poped completely
thread_t *my_thread = vftr_get_my_thread(&(vftrace.process.threadtree));
threadstack_t *my_threadstack = vftr_get_my_threadstack(my_thread);
while (my_threadstack->stackID > 0) {
vftr_function_exit(NULL, NULL);
my_thread = vftr_get_my_thread(&(vftrace.process.threadtree));
my_threadstack = vftr_get_my_threadstack(my_thread);
}
In the stack finalization the exclusive times in the stacks are computed. During runtime only inclusive times are measured. The exclusive times are computed by subtracting the inclusive time of all callee functions from the inclusive time.
In order to write a summarizing logfile for MPI-parallel program one rank needs to collect all the stack and profiling information. One challenge that arises is that the stacktree and thus stackIDs can differ between ranks. Simply sending the profiling information with the stackID to a master rank does not work, if the master rank has different IDs. Therefore the first step is to make sure that all ranks agree on which stack should get which ID.
In Vftrace the first step is to compute a hash for every stack, which are easier to handle, sort, and compare than strings of different length.
The function vftr_compute_stack_hashes
goes through all stacks constructs their stack string
(e.g. "foo<bar<main<init"
) and computes its 64bit-hash (e.g. 7039dea2f67cce8e
)
as concatenation of two 32-big hashes
(jenkins-32 (7039dea2
) and murmur3-32 (f67cce8e
)).
Having two fundamentally different hash algorithms
reduces the probability of having hash collisions.
Identical stacks on different ranks will produce the same hash.
The resulting hash lists are gathered on the master rank
in the routine vftr_collate_hashes
.
The list of all stacks contains a lot of reoccurring entries.
In order to remove them such that every entry is unique, the list is sorted first,
which makes it easy to remove multiple entries.
The sorted and cleaned list is then distributed to all ranks.
Now each stack has a shared ID given by its stack-string-has position in the sorted hash-list.
This shared id is called the global ID.
Each rank goes through its entire stack and assigns the global ID to its stacks based on the position in the sorted hashlist. Simultaniously two conversion tables are created which convert between local and global stack an vice versa. The master rank (0) creates a collated stacktree. It transfers all stack and profile information of its locally known stacks into it. Thereby, creating a list of global stack IDs, which it does not have access to. The list of missing global stack IDs is subsequently send to each rank, which compile the available stack information (name, and global callerID) and send it to the master rank. The master in turn will update the list of missing stacks and repeats the process with the next rank until all stacks are collated on the master rank.
For each profile type a vftr_collate_<type>profiles
routine exists.
These routines compile their local profiling information in a custom mpi-type and send it to the master rank, which sums the profile based on the global ID.
Now the master rank has acquired all stack and profile information needed.
Vftrace writes two types of logfiles:
- Summary-logfile, which contains information that was collected from all ranks on the master.
- Rank-logfiles, which contain only rank specific profiling information. Which information is written to the is controlled by the supplied config file (see Runtime Control#Config File). An explanation of the individual parts of a logfile is given in Application Profiling#The Logfile.
In the sampling file finalization the stacktree, and the threadtree are written to the vfd-file.
Then the header, which contained mostly zeroes so far, is filled in with the information gathered during the application profiling (see VFD File Format).
After closing the file access it is renamed to <basename>_<rank>.vfd
.
It was named <basename>_<pid>.tmpvfd
, because for parallel applications the rank
is not known until MPI_Init
or MPI_Init_thread
was issued, so the final filename was not known either.