Skip to content

Finalization

Felix Uhl edited this page Nov 30, 2022 · 3 revisions

The vftr_finalize routine was registered by atexit in the initialization of Vftrace to be called right before program termination (see Initialization#Preparing Finalization.

MPI-Finalize and Stack Popping

First action the vftr_finalize function is to check whether vftrace is off or uninitialized. off means that vftr_finalize was called before, which can happen in MPI-parallel codes. Vftrace requires exchange of data between the ranks in the finalization, which is done within the MPI-framework. Therefore, the finalization has to happen before MPI is finalized.

int vftr_MPI_Finalize() {
   // it is necessary to finalize vftrace here, in order to properly communicate stack ids
    // between processes. After MPI_Finalize communication between processes is prohibited
   vftr_finalize();
   return PMPI_Finalize();
}

However, in the initialization vftr_finalize was registered with atexit, effectively executing it at code termination. This is intended for serial codes, but not for MPI-parallel ones, where vftr_finalize would be executed a second time. To prevent this vftr_finalize changes the state to off which prevents a second execution.

void vftr_finalize() {
   if (vftrace.state == off || vftrace.state == uninitialized) {
      // was already finalized
      // Maybe by MPI_Finalize
      // vftr_finalize was already registered by atexit
      // before vftrace knew that this was an MPI-program
      return;
   }
   // update the vftrace state
   vftrace.state = off;
...

If vftr_finalize is calles from within MPI_Finalize the local threadstack (see Threadsafe Stack Tracking) is not empty yet. Therefore, the stack is popped until it contains the init function only.

   // in case finalize was not called from the threadstacks root
   // the threadstack needs to be poped completely
   thread_t *my_thread = vftr_get_my_thread(&(vftrace.process.threadtree));
   threadstack_t *my_threadstack = vftr_get_my_threadstack(my_thread);
   while (my_threadstack->stackID > 0) {
      vftr_function_exit(NULL, NULL);
      my_thread = vftr_get_my_thread(&(vftrace.process.threadtree));
      my_threadstack = vftr_get_my_threadstack(my_thread);
   }

Stack Finalization

In the stack finalization the exclusive times in the stacks are computed. During runtime only inclusive times are measured. The exclusive times are computed by subtracting the inclusive time of all callee functions from the inclusive time.

Rank Collation

In order to write a summarizing logfile for MPI-parallel program one rank needs to collect all the stack and profiling information. One challenge that arises is that the stacktree and thus stackIDs can differ between ranks. Simply sending the profiling information with the stackID to a master rank does not work, if the master rank has different IDs. Therefore the first step is to make sure that all ranks agree on which stack should get which ID.

Stack Hashing

In Vftrace the first step is to compute a hash for every stack, which are easier to handle, sort, and compare than strings of different length. The function vftr_compute_stack_hashes goes through all stacks constructs their stack string (e.g. "foo<bar<main<init") and computes its 64bit-hash (e.g. 7039dea2f67cce8e) as concatenation of two 32-big hashes (jenkins-32 (7039dea2) and murmur3-32 (f67cce8e)). Having two fundamentally different hash algorithms reduces the probability of having hash collisions. Identical stacks on different ranks will produce the same hash. The resulting hash lists are gathered on the master rank in the routine vftr_collate_hashes. The list of all stacks contains a lot of reoccurring entries. In order to remove them such that every entry is unique, the list is sorted first, which makes it easy to remove multiple entries. The sorted and cleaned list is then distributed to all ranks. Now each stack has a shared ID given by its stack-string-has position in the sorted hash-list. This shared id is called the global ID.

Collate stacks

Each rank goes through its entire stack and assigns the global ID to its stacks based on the position in the sorted hashlist. Simultaniously two conversion tables are created which convert between local and global stack an vice versa. The master rank (0) creates a collated stacktree. It transfers all stack and profile information of its locally known stacks into it. Thereby, creating a list of global stack IDs, which it does not have access to. The list of missing global stack IDs is subsequently send to each rank, which compile the available stack information (name, and global callerID) and send it to the master rank. The master in turn will update the list of missing stacks and repeats the process with the next rank until all stacks are collated on the master rank.

Collate Profiles

For each profile type a vftr_collate_<type>profiles routine exists. These routines compile their local profiling information in a custom mpi-type and send it to the master rank, which sums the profile based on the global ID. Now the master rank has acquired all stack and profile information needed.

Logfiles

Vftrace writes two types of logfiles:

  1. Summary-logfile, which contains information that was collected from all ranks on the master.
  2. Rank-logfiles, which contain only rank specific profiling information. Which information is written to the is controlled by the supplied config file (see Runtime Control#Config File). An explanation of the individual parts of a logfile is given in Application Profiling#The Logfile.

Sampling Files

In the sampling file finalization the stacktree, and the threadtree are written to the vfd-file. Then the header, which contained mostly zeroes so far, is filled in with the information gathered during the application profiling (see VFD File Format). After closing the file access it is renamed to <basename>_<rank>.vfd. It was named <basename>_<pid>.tmpvfd, because for parallel applications the rank is not known until MPI_Init or MPI_Init_thread was issued, so the final filename was not known either.