forked from ARM-software/perf-libs-tools
-
Notifications
You must be signed in to change notification settings - Fork 0
License
jmather-sesi/perf-libs-tools
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
================================================================================ perf-libs-tools Copyright 2017-21 Arm Limited All rights reserved ================================================================================ Tools to support Arm Performance Libraries ------------------------------------------ This project provides tools to enable users of HPC applications to understand which routines from Arm Performance Libraries are being called. The main component of this suite is the logging library that produces output to summarize the key information about routines called. A selection of Python scripts for visualizating some of this data is also included. Licensing --------- This project is distributed under an Apache 2.0 license, available in the file LICENSE. All inbound contributions will also be under this same licence. Compiling --------- 1) Ensure that you are using the latest version of the logging library to match your build of Arm Performance Libraries. This version matches 21.0. Note you can build using either GCC or Arm Compiler, however an Arm Performance Libraries or Arm Compiler module must have been loaded to add the correct directory containing 'armpl.h' to your path. 2) From this top level directory type "make". You should now have a "lib/libarmpl-summarylog.so" created. Usage ----- 1) Use LD_PRELOAD to pick-up the newly created logging library: export LD_PRELOAD=$PWD/lib/libarmpl-summarylog.so If your application is itself threaded you should use the library libarmpl_mp-summarylog.so instead as this will enable correct logging from the different threads all calling Arm Performance Libraries functions concurrently. 2) When you are building your application ensure that you are linking in the shared library (e.g. libarmpl_lp64_mp.so) rather than the static library (e.g. libarmpl_lp64_mp.a). 3) Run your application as normal. Output files will be produced in the /tmp/ directory when the program completes. These will be called /tmp/armplsummary_<pid>.apl where <pid> is the process ID. Note an extra line is added to the output of your program reading Arm Performance Libraries output summary stored by default in files named /tmp/armplsummary_16776.apl with the appropriate PID included. The root of this can be modified using the environment variable ARMPL_SUMMARY_FILEROOT. 4) For certain applications we have seen situations where programs do not terminate as expected, for instance if they are killed, crash, or are forked from Python. In these cases it can be useful to have intermediate output files written during the execution. In order to use this functionality use the environment variable ARMPL_SUMMARY_FREQ to set an initial frequency of library calls between each output. By default this frequency grows by 10% each output in order to limit the overhead in cases where the user has little knowledge of the number of cases expected. If using this frequency output, these extra output files will have an additional subscript in the name written, e.g. /tmp/armplsummary_16776_01.apl This is because a terminating program may die mid-write of the output file, hence allowing multiple file names to be iterated around means you can check which was the latest output. Note we have also included the library "lib/libarmpl-logger.so" which is used in the same way, however this gives much more verbose output producing an output line for every single library call. The file root for output here is set using ARMPL_LOGGING_FILEROOT. Output ------ The output files generated list the following information: Column 1 : "Routine:" Column 2 : <routine name> Column 3 : "nCalls:" Column 4 : <number of calls> Column 5 : "Mean_time" Column 6 : <mean time of this call> Column 7 : "nUserCalls:" Column 8 : <number of top level calls> Column 9 : "Mean_user_time:" Column 10 : <mean time of this top-level call> Column 11 : "Inputs:" Column 12 onwards : input parameters to routine, integers then characters That is, at the end of the execution of a program a summary list is produced itemizing each of the routines called, detailing the input cases that it has been called with, and then calculating the mean time taken for those inputs. Only the integer and character parameters are recorded in this output. This is because the input cases of matrix contents should not normally be relevant to understanding the cases used. It does mean certain input parameters, such as ALPHA and BETA for a DGEMM case, are missed, however this is an acceptable situation for the automatic creation of a logging library. Note that we now provide two sets of timing results: inclusive and top-level. Typically users may call a routine such as DPOTRF, but internally they may in turn call many other LAPACK or BLAS functions. Tools ----- In the "tools/" directory are some example scripts to produce graphical summaries of the output produced from the logging library. They are written using the Python package MatPlotLib, and we recommend running the visualization part of that script on a local machine rather than on a remote box. At present we are seeing these as a two-stage process: (1) process the /tmp/armplsummary_<pid>.apl output (2) visualize this output There will be many more useful visualizations that can be done with this data, and we look forward to receiving additional contributions. All tools, at present, are written to be run from the "tools" directory so you need to cd tools before following the instructions below. ---Overall library usage--- ./process_summary.py <input files> This produces information about all library calls made. A high-level summary is returned, folllowed by break-downs of all the calls to each library component (BLAS, LAPACK, FFT). This data is split into output summarising both the total of calls and time spent in the function, and also the number and time spent in this function from just user-called invocations. For FFT calls this data also matches planning and execution stages enabling detailed profliing of fftw_execute() calls with only a pointer. An additional file generated is the BLAS summary data, previously produced by "./process-blas.sh", which can be fed into "./blas_usage.py", see below. Input: One or more /tmp/armplsummary_<pid>.apl files. Output: <text> and /tmp/armpl.blas ---DGEMM calls--- ./process-dgemm.sh <input files> This produces summary information about GEMM calls made in an application. Input: One or more /tmp/armplsummary_<pid>.apl files. Output: /tmp/armpl.dgemm ./heat_dgemm.py This visualizes the data from an armpl.dgemm file. Usage: heat_dgemm.py [-h] [-i INFILE] [-l] Optional arguments: -h, --help Show help message and exit -i INFILE, --input INFILE Input file -l, --legend Show graph legend What does it show? Two rows, each with two heat maps and a pie chart The top row shows the number of calls made. The bottom row shows the time spent. The heat maps show the proportion of calls/time at each problem dimension for matix A (left) and B (right). The pie charts show the proportion of calls/time for the given transpose options for matrices A and B. Input: /tmp/armpl.dgemm Output: A matplotlib window showing usage. ---BLAS calls--- ./process-blas.sh DEPRECATED -- use "./process-summary.py" instead This produces summary information about BLAS calls made in an application. Input: One or more /tmp/armplsummary_<pid>.apl files. Output: /tmp/armpl.blas ./blas_usage.py This visualizes the data from an armpl.blas file generated by 'process_summary.py'. Usage: blas_usage.py [-h] [-l] [-n] [-x] [-i INFILE] Optional arguments: -h, --help Show help message and exit -i INFILE, --input INFILE Input file -l, --legend Show graph legend -n, --normalize Normalize bars to percentages of max -x, --exclude Exclude unrepresented functions What does it show? Two sets of three rows of bar charts: Top set: Number of calls Bottom set: Time spent in these calls For each set: Top row: BLAS level 1 routines Middle row: BLAS level 2 routines Bottom row: BLAS level 3 routines On each row different families of functions are shown, for example GEMM. For each family different bars show the number of routines of: single precision real (red) double precision real (blue) single precision complex (yellow) double precision complex (green) Useful additional option include "-x" to exclude routines that are never called and "-n" to normalise time taken against the most expensive routine. Input: /tmp/armpl.blas. Output: A matplotlib window showing usage. Known issues ------------ * If being used with GCC it is necessary to remove the functions cdotc_, cdotu_, zdotc_ and zdotu_ from src/PROTOTYPES. For convenience these are the first four lines of that file. * The top level call timings do not work when called from within a nested OpenMP parallel regions. Set your code to use only one level of OpenMP. * For certain codes that create enormous numbers of FFTW plans then it may be necessary to prevent trying to match plans with executes. This situation would result in a significant, and worsening, run-time performance of the application. This is not detailed here, but instructions can be made available upon request.
About
No description, website, or topics provided.
Resources
License
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published
Languages
- C 93.9%
- Python 5.8%
- Other 0.3%