-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathmazurov-chep2012.tex
563 lines (460 loc) · 31 KB
/
mazurov-chep2012.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
\documentclass[a4paper]{jpconf}
\usepackage{graphicx}
\usepackage{float}
%======================================================================================================================
% DELETE FROM RELEASE:
%======================================================================================================================
\usepackage{hyperref}
\hypersetup{
colorlinks = true,
urlcolor = blue,
pdfauthor = {Alexander Mazurov},
pdfkeywords = {scientific computing, system and distributing programming},
pdftitle = {Advanced Modular Software Performance Monitoring},
pdfsubject = {Advanced Modular Software Performance Monitoring},
pdfpagemode = UseNone
}
\usepackage{lineno}
% Customize page headers
\usepackage{fancyhdr}
\setlength{\headheight}{15.2pt}
\pagestyle{fancyplain}
\fancyhf{}
\lhead{\fancyplain{}{{\bf Version 1.10} (\today)}}
\rhead{\fancyplain{}{\href{https://github.com/mazurov/chep2012/compare/v1.9...v1.10}{https://github.com/mazurov/chep2012/compare/v1.9...v1.10}}}
\linenumbers
%======================================================================================================================
\begin{document}
\title{Advanced Modular Software Performance Monitoring }
\author{A Mazurov\ad{1,2,3} and B Couturier\ad{1}}
%======================================================================================================================
% DELETE FROM RELEASE:
%======================================================================================================================
\thispagestyle{fancy}
%======================================================================================================================
\address{\ad{1} CERN, European Organization for Nuclear Research, Geneva, Switzerland}
\address{\ad{2} University of Ferrara, Ferrara, Italy}
\address{\ad{3} Institute for Nuclear Research, Troitsk, Russia}
\ead{alexander.mazurov@cern.ch}
\newcommand\iamp{{Intel\textsuperscript{\textregistered} VTune\texttrademark Amplifier XE} }
\newcommand\amp{{VTune\texttrademark Amplifier XE} }
\newcommand\intel{{Intel\textsuperscript{\textregistered}} }
\newcommand\google{{Google\textsuperscript{\textregistered} }}
\begin{abstract}
The LHCb software is based on the Gaudi framework, on top of which are built several large and complex software
applications. As the LHCb experiment is now in the active phase of collecting and analyzing data, performance problems
arise in various parts of the software, from the High Level Trigger (HLT) programs to data analysis frameworks.
It is not easy to find hotspots in the code --- only specialized tools can help to understand where CPU or memory usage
are not reasonable. There exist many performance analyzing tools, but the main problem is that they show reports in
terms of class and function names and such information usually is not very useful --- the majority of algorithm
developers use the Gaudi framework abstractions and usually do not know about functions which lie at the lower level.
We will show a new approach which adds to performance reports a higher abstraction level based on knowledge of
framework architecture and run-time object properties. A set of profiling tools (based on \iamp) and visualization
interfaces has been developed and deployed.
\end{abstract}
\section{Introduction}
In LHCb, as in all High Energy Physics (HEP) experiments, complex software is used to process the data recorded
by the detectors. Performance is an essential characteristic of this software, especially when dealing with
the High Level Trigger (HLT): its role is to filter events coming from the hardware based trigger in order to identify
those with interesting physics, and to write them to disk in real-time. The number of events processed per
second (event rate) is therefore one of the crucial characteristics of the HLT, as it has to keep up with the data rate
delivered by the hardware triggers ($10^6$ events per second) in order to avoid data loss. To reach such high
throughput, the processing is performed on many nodes in parallel by highly optimized algorithms. In order to optimize
the algorithms, and to keep track of the evolution of the event rate when changes are applied to the HLT, it is
necessary to measure the overall peformance of the code but also to understand which algorithms are costly in term of
CPU and memory.
In this paper our focus is on the analysis of frequency and duration of function calls in algorithms (this type of
analysis is commonly named CPU profiling). Profiling helps to identify parts of the code that take a long time to
execute. In performance analysis, those places often are referenced as hotspots. Obviously, hotspots affect the event
rate of event processing software. So, one of the main goals of profiling HEP software is to point out to application
developers the places in the code that need to be tuned to increase the event rate.
The first study on CPU profiling at LHCb was carried out by Daniele Francesco Kruse and Karol Kruzelecki in their work
“Modular Software Performance Monitoring” \cite{modular}. They conclude that instead of profiling
the application as a whole it would be better to divide it into modules and profile those modules separately.
In general terms, a module can be defined as an application’s structural component that is used to group logically
related functions. Grouping performance results by module allows a better insight into where the performance
issues are coming from. Since each module is under the responsibility of a specific developer, the provided reports
can be delivered to the right person. For example, in the Gaudi~\cite{gaudi}
framework at LHCb each algorithm used for event processing is such a module which can be profiled independently.
More details on Gaudi will be given in Section 3.1.
This design principle was first implemented in a set of profiling tools based on perfmon2~\cite{perfmon2} library.
These tools have several drawbacks. First, the produced analysis reports used the hardware event counters metrics.
Only developers with a good knowledge of the hardware could read and interpret those reports. Since the major part of
developers in HEP are physicists, the number of users of those tools are very low. Second, since the current tools
do not use the counters multiplexing feature of perfmon2 library, the target program should be run several times
to collect all required hardware counters. As a result, the profiling time is significantly increasing.
To fill some of the the gaps of the previous tools we created the Gaudi Intel Profiling Auditor. This profiling tool
uses the same module principle that was described in \cite{modular}, but is based on \iamp~\cite{vtune}.
\amp is the newer performance profiling tool, that provides better functionality than perfmon2 library.
In the next section we briefly review \iamp. Then we discuss how the Gaudi Intel Profiling Auditor can integrate \amp to
the Gaudi framework and show examples of using those tools to profile LHCb's HLT.
\section{\iamp}
In this section we give an overview of the modern performance profiling tool \iamp, describe its basic features and analysis
reports.
\subsection{Overview}
\iamp is a commercial application for software performance analysis that is available for both Linux and Windows
operating systems. \amp belongs to the runtime instrumentation class of profiling tools. This means that the code is
instrumented before execution and the program is fully supervised by the tool. A target application can be profiled without
any modification of the codebase.
\iamp has various kinds of code analysis including hotspot analysis, concurrency analysis, locks and waits analysis.
In our tools we use a hotspot analysis based on the user-mode sampling feature of \amp. User-mode sampling allows
to profile a program by exploring a call stack of a running program and produce one simple metric --- amount of time
spent in the function.
The amount of time spent in a function (CPU time) is calculated by interrupting a process and collecting samples of
call stacks from all active threads. CPU time value is calculated by counting the number of a function's appearances at
the top of a call stack. This means that stack sampling is a statistical method and does not provide a 100\% accurate
measurement. However, for a large number of samples the sampling error does not have a serious impact on the accuracy of
analysis. More details about sampling accuracy will be provided in section 2.2.
\amp also supports the hardware event-based sampling and provides advanced metrics based on event counters inside
a processor. Reports that use those metrics require knowledge of hardware architecture unlike the user-mode
sampling reports that can be understood by any application developers. Furthermore, while the user-mode sampling can
be performed on any 32 and 64-bit x86 based machine, the hardware event-based sampling is targeted only for a specific
\intel microarchitecture and requires a special driver to be installed on the operating system. The advantage of
the hardware event-based sampling that it can be used for fine tuning of algorithms in places where the user-mode
sampling could not point out the reasons for the hotspot.
The goal of our profiling tool is to provide analysis reports to a wider audience of software developers and, therefore,
for implementation we chose the user-mode sampling method over hardware-mode sampling.
\subsection{Sampling interval}
The sampling interval is an important parameter of the user-mode sampling method. It can impact on results accuracy and
on total profiling time. \intel recommends to use a 10 ms interval. Using this value the average overhead of the sampling
is about 5\% in the most applications. The minimum sampling interval value depends on the operating system. For example, a 10 ms interval
is the minimum value for the old Linux kernel 2.4, whereas 1 ms is the minimum value for the modern Linux $\ge$ 2.6 kernels.
To determine an appropriate sampling interval, consider the duration of the collection, the speed of your processors,
and the amount of software activity. For example, if the duration of sampling time is more than 10 minutes,
consider increasing the sampling interval to 50 milliseconds. This reduces the number of interrupts and
the number of samples collected and written to disk. The smaller the sampling interval, the larger the number of samples.
\subsection{Tools}
\amp has two major interfaces --- a command-line tool {\it amplxe-cl} and a Graphical User Interface
tool {\it amplxe-gui}. {\it Amplxe-gui} generally plays a role of analysis results presenter, but can also be used as
a wrapper to the command-line tool. {\it Amplxe-cl} is used to execute the profiling supervisor with appropriate
parameters. The second important function of {\it amplxe-cl} is to export CPU usage reports to CSV text format.
This feature allows to use collected data not only inside \amp, but also in external user applications.
\subsection{Profiling reports}
In this section we review essential profiling reports that are available in \amp. These reports can be obtained either
from {\it amplxe-gui} or {\it amplxe-cl tool}, but for short we present only GUI screenshots.
An ordered function’s CPU time usage report is a basic report of almost at all performance profilers (Figure~\ref{fig01}).
\begin{figure}[H]
\begin{minipage}{\textwidth}
\includegraphics[width=\textwidth]{figs/fig01.png}
\caption{\label{fig01}Function’s CPU Time report. The first column contains function names. The second column is a CPU
time usage and the last column contains the names of the shared libraries where the functions are defined.}
\end{minipage}
\end{figure}
\amp provides many grouping options:
\begin{figure}[H]
\begin{minipage}{\textwidth}
\includegraphics[width=\textwidth]{figs/fig02.png}
\caption{\label{fig02}Various grouping options.}
\end{minipage}
\end{figure}
Example of grouping by shared library:
\begin{figure}[H]
\begin{minipage}{\textwidth}
\includegraphics[width=\textwidth]{figs/fig03.png}
\caption{\label{fig03}Shared libraries CPU time report. First column contains a name of shared library.}
\end{minipage}
\end{figure}
The striking feature of \amp is an ability to filter data based on a selection in the timeline. This feature
does not exists in other popular profilers:
\begin{figure}[H]
\begin{minipage}{\textwidth}
\includegraphics[]{figs/fig04.png}
\caption{\label{fig04}Filter data on a selection in timeline.}
\end{minipage}
\end{figure}
CPU usage by code line can be created if a target application was compiled with debug symbols:
\begin{figure}[H]
\begin{minipage}{\textwidth}
\includegraphics[width=\textwidth]{figs/fig05.png}
\caption{\label{fig05}CPU time usage by code source line.}
\end{minipage}
\end{figure}
\subsection{Detecting code dependency}
Besides finding hotspots, another useful function of the profiling tool is to reveal the code dependencies.
Usually HEP applications have a lot of lines of code and were developed by many people during a long period of time.
Therefore, determining relations between parts of code is very difficult. Since \amp has a top-down tree report of
functions calls (Figure~\ref{fig06}.), we can determine the code dependency in the application and see CPU usage in the
call chain.
\begin{figure}[H]
\begin{minipage}{\textwidth}
\includegraphics[width=\textwidth]{figs/fig06.png}
\caption{\label{fig06}Top-down tree report.}
\end{minipage}
\end{figure}
\section{Gaudi Intel Profiling Auditor}
In the previous section we show that \iamp is a powerful performance profiling tool. What are the disadvantages of the tool?
Why do we need to implement something~else? In the current section we are going to answer those questions.
First, we talk about the Gaudi framework and then show how the Gaudi Intel Profiling Auditor can enhance \amp reports.
\subsection{Gaudi}
\subsubsection{Overview.}
Gaudi is a C++ software framework used to build event data processing applications using a set of standard components.
Gaudi is a core framework used by several HEP experiments, in particular LHCb and ATLAS at LHC.
All of the event processing applications, including simulation, reconstruction, high-level trigger and analysis
are based on this framework. By design, the framework decouples the objects describing the data and those
implementing the algorithms. Due to this design, developers can concentrate only on physics related tasks
in algorithms and usually do not care about other parts of the framework.
\begin{figure}[H]
\begin{minipage}{\textwidth}
\includegraphics[width=\textwidth]{figs/fig07.png}
\caption{\label{fig07}Gaudi Architecture (as described in ~\cite{gaudi}). Applications are made by composing sequences of Algorithms and adding
specific Services and Tools.}
\end{minipage}
\end{figure}
The Gaudi framework is a highly customizable framework. Any component of the system can be configured by user options.
\subsubsection{Gaudi Auditors.}
The Application Manager is one of the major components of Gaudi. It takes care of instantiating and
calling algorithms. A supplement to this component is the Auditor Service that enable to add auditors to a
Gaudi application. The auditor is a set of user functions that are called on some workflow events in the Manager.
For example, we could add custom action that is called when the Manager wants to execute some algorithm or when
an algorithm is finished. There are many different events types and we can add as many auditors as needed.
In other frameworks and programming languages, this type of functions are often referenced as callback functions.
In the following section we show how we can use an auditor to build a profiling tool.
\subsection{Profiling Auditor}
\subsubsection{Objectives.}
A Gaudi application can be profiled by \amp without any modifications of the codebase. This tool can collect any data
about CPU consumption in code lines, functions, classes, shared libraries, threads, but it has one disadvantage.
\amp knows nothing about Gaudi framework's algorithms. However, algorithms are the central point of any framework
application, all major event processing occurs there. In principle, a general task of framework users is just to write
algorithms that solve a problem and usually nothing more. So, if the profiler could generate a report that can group
function’s CPU usage by algorithm then application developers could look to the profiling result
from a new point of view. This point of view can help to reveal previously invisible hotspots.
In order to provide such report the Gaudi Intel Profiler Auditor was developed.
\subsubsection{User API of \iamp.}
Each Gaudi algorithm has a name that is assigned to an algorithm at run-time. \amp, in turn, is supplied with
a C library that allows to import those names to the target report. In order to use the library from user applications,
the public User API is provided in \amp. The API enables to control the data collection process and set marks during
the execution of the code. The possibility to mark code regions at runtime is the striking feature
of our new profiling tool, because a CPU usage in the region between algorithm’s start and finish points is what we
need for the report that group functions by algorithm.
Event API is a part of the User API that is in charge of marking.
\begin{description}
\item[\_itt\_event \_\_itt\_event\_create(const \_\_itt\_char \*name, int namelen );] \hfill \\
Create a user event type with the specified name. This API returns a handle to the user event type that should be
passed into the following APIs as a parameter. The namelen parameter refers to the number of characters,
not the number of bytes.
\item[int \_\_itt\_event\_start( \_\_itt\_event event );] \hfill \\
Call this API with an already created user event handle to register an instance of that event. This event appears
in the Timeline panel display as a tick mark.
\item[int \_\_itt\_event\_end( \_\_itt\_event event );] \hfill \\
Call this API following a call to \_\_itt\_event\_start() to show the user event as a tick mark with a duration line
from start to end. If this API is not called, the user event appears in the Timeline pane as a single tick mark.
\end{description}
\subsubsection{Implementation.}
An auditor is a good component for implementing the required profiling tool. In this case, we do not need to modify
the algorithm’s code and need only to write two callback functions: at algorithm start and finish. In order to generate
the target report those functions need to call Event API functions of \amp.
An appropriate auditor was created and named Gaudi Intel Profiling Auditor. It was deployed to the GaudiProfiling
package of Gaudi framework as a shared library. Below we show how this profiling tool marks regions and what reports
can be generated.
Gaudi has a special type of algorithms --- Sequence. Each instance of Sequence can execute other algorithms or
sequences. So, an application's event loop could have not only a flat but also a tree structure. Moreover, the same
algorithm instance can occur in different sequences. Therefore, it was decided that an algorithm's region between its start
and finish should be marked by the branch identifier. In this case, we get more detailed information about usage of
the algorithm in the application. A branch identifier is constructed from an algorithm name and its parents
in the sequence tree. For example, let's profile an application that has the following sequence tree:
\begin{verbatim}
Hlt
HltDecisionSequence
Hlt1
Hlt1DiMuonHighMass
Hlt1DiMuonHighMassFilterSequence
Hlt1DiMuonHighMassStreamer
FastVeloHlt
MuonRec
Velo2CandidatesDiMuonHighMass
GECLooseUnit
createITLiteClusters
createVeloLiteClusters
Hlt1DiMuonHighMassL0DUFilterSequence
L0DUFromRaw
Hlt1DiMuonHighMassL0DUFilter
\end{verbatim}
In \amp the report that use information on marked regions can be obtained by choosing the
"Task Type / Function / Call Stack" grouping options as seen on Figure~\ref{fig08}.
\begin{figure}[H]
\begin{minipage}{\textwidth}
\includegraphics[width=\textwidth]{figs/fig08.png}
\caption{\label{fig08}Group and order CPU usage by branch identifier.}
\end{minipage}
\end{figure}
For example, the selected branch identifier {\it "Hlt HltDecisionSequence Hlt1 Hlt1DiMuonHighMass Hlt1DiMuonHighMassFilterSequence Hlt1DiMuonHighMassStreamer"} in the report on Figure~\ref{fig08} is constructed from the names of algorithms that were executing
when the \amp supervisor sampled a call stack. Each algorithm name in the branch is separated by the space.
For each branch we could see a CPU usage by function (Figure~\ref{fig09}):
\begin{figure}[H]
\begin{minipage}{\textwidth}
\includegraphics[width=\textwidth]{figs/fig09.png}
\caption{\label{fig09}Group and order CPU usage by branch identifier.}
\end{minipage}
\end{figure}
On Figure~\ref{fig09} we see the functions's CPU usage in the algorithm {\it Hlt1DiMuonHighMassStreamer} in the branch
{\it "Hlt HltDecisionSequence Hlt1 Hlt1DiMuonHighMass Hlt1DiMuonHighMassFilterSequence"}.
As can be observed the main goal was achieved --- we get the report that groups function CPU usage by algorithm.
So, the next step is only to interpret profiling results by application developers and, if needed, to tune algorithms.
In addition to reports on algorithms, in the Gaudi Intel Profiling Auditor were added options that allow to skip
unimportant regions of the code during profiling. Information about functions in those regions is not collected and,
as a result, we get clearer reports and a decrease of total profiling time. For example, usually time critical
processes happens in the event loop. Thus, initialization and finalization phases are not interesting for developers.
Due to this, the auditor has options that trigger the start of profiling on the first event in the event loop
and stop it after the last event.
\subsubsection{Usage example.}
For user convenience the command-line tool {\it intelprofiler} that is wrapper of the {\it amplxe-cl} was created .
It does the dirty work of initializing the environment and setting the common parameters of profiling.
For example, we can start a profiling by executing the simple shell command:
\begin{verbatim}
$> intelprofiler -o /path/to/collected/data/ gaudi_program.py
\end{verbatim}
where the option -o points to the directory where to store collected data.
Generally a Gaudi program is configured from option files that are written in Python language.
To enable the profiler we need to add at least the following three lines:
\begin{verbatim}
from Configurables import IntelProfilerAuditor
profiler = IntelProfilerAuditor()
AuditorSvc().Auditors += [profiler]
\end{verbatim}
At the first line we imported the auditor library. Then instantiated the profiler and at the third line added the auditor
to the Auditor Service component.
If you would like to skip unnecessary events in the event loop you need to add the following lines:
\begin{verbatim}
profiler.StartFromEventN = 5000
profiler.StopAtEventN = 15000
\end{verbatim}
where we collect profiling data only for events in the range between 5000 and 15000.
From all above we can see that Gaudi Intel Profiling Auditor gives to application developers a full control on the
profiling process.
\section{HLT Profiling Examples}
In the previous section we demonstrated how Gaudi Intel Profiler Auditor can assist in profiling Gaudi applications.
The original motivation for creating this auditor was a profiling of HLT applications of the LHCb experiment.
As was stated in the introduction, trigger programs are most sensitive to the event processing time.
Therefore, a performance profiling is an essential tool in the hands of trigger's applications developers.
In this section we show three examples of using \amp and Gaudi Intel Profiling Auditor to profile the Moore
application --- a Gaudi based HLT framework at LHCb.
\subsection{Memory Allocation Functions}
In the first example we profile a Moore program twice. The first time a program was executed with the standard memory
allocation function {\it operator new} from libstdc++ library:
\begin{figure}[H]
\begin{minipage}{\textwidth}
\includegraphics[width=\textwidth]{figs/fig10.png}
\caption{\label{fig10}Hotspot functions in the Moore application with the standard memory allocation function.}
\end{minipage}
\end{figure}
and the second time it was executed with the memory allocation function {\it tc\_new} from tcmalloc library
developed by \google~\cite{perftools}:
~
\begin{figure}[H]
\begin{minipage}{\textwidth}
\includegraphics[width=\textwidth]{figs/fig11.png}
\caption{\label{fig11}Hotspot function in the Moore application with the memory allocation functions from tcmalloc
library.}
\end{minipage}
\end{figure}
The figures indicates that {\it tc\_new} function is twice faster than {\it operatornew}. Moreover, the total
application time reduction by 5\% was observed if we replace standard allocation functions with function
from tcmalloc library.
\subsection{Measuring Profiling Accuracy}
To check the CPU time measurement accuracy we compared the results obtained by the Gaudi Intel Profiler Auditor
and by the Gaudi Timer Auditor. The Timer Auditor proceed in the same way as the Profiler Auditor ---
it calculates the difference between the algorithm’s finish time and the time at the start of the algorithm.
Unlike the Gaudi Intel Profiler Auditor, the Timer Auditor calculates the exact time spent in the algorithm.
So, we can assume a CPU time observed by the Timer Auditor as a reference value. The limitation of the Timer is that
it creates reports only for algorithms times and could not provide results for a low level of granularity
(for functions or code instructions). Therefore, only the algorithm’s CPU times were compared.
Since the VTune Amplifier XE instruments the code before execution, the absolute CPU time measured by the Profiler can
differ from the time measured by the Timer auditor. But the time distribution of all algorithms should stay the same
in both auditors. So, for the test we took a real HLT application and run it twice. The first time we used
the Timer auditor and the second time the Gaudi Timer Auditor was used. Then we selected five hotspot algorithms and
calculated their time distribution relative to the top hotspot algorithm. The process was repeated three times with
different numbers of events: 10 (Table~\ref{tevents10}), 100 (Table~\ref{tevents100}) and
1000 events (Table~\ref{tevents1000}):
\begin{table}[H]
\caption{\label{tevents10}10 events}
\begin{center}
\begin{tabular}{rrrr}
\br
Algorithm name & Timer (\%) & Profiler (\%) & \bf{Difference} \\
\mr
L0Muon & 100 & 100 & -\\
Hlt1TrackAllL0Unit & 63.71 & 63.571 & \bf{0.139}\\
FastVeloHlt & 33.065 & 7.143 & \bf{25.922}\\
L0Calo & 8.065 & 0 & \bf{8.065}\\
HltPVsPV3D & 4.032 & 0 & \bf{4.032}\\
\br
\end{tabular}
\end{center}
\end{table}
\begin{table}[H]
\caption{\label{tevents100}100 events}
\begin{center}
\begin{tabular}{rrrr}
\br
Algorithm name & Timer (\%) & Profiler (\%) & \bf{Difference} \\
\mr
L0Muon & 100 & 100 & - \\
Hlt1TrackAllL0Unit & 36.985 & 42.353 & \bf{-5.368}\\
FastVeloHlt & 29.648 & 28.235 & \bf{1.413}\\
L0Calo & 7.94 & 15.294 & \bf{-7.354}\\
HltPVsPV3D & 2.613 & 4 & \bf{-1.387}\\
\br
\end{tabular}
\end{center}
\end{table}
\begin{table}[H]
\caption{\label{tevents1000}1000 events}
\begin{center}
\begin{tabular}{rrrr}
\br
Algorithm name & Timer (\%) & Profiler (\%) & \bf{Difference} \\
\mr
L0Muon & 100 & 100 & -\\
Hlt1TrackAllL0Unit & 35.872 & 35.147 & \bf{0.725}\\
FastVeloHlt & 29.648 & 28.235 & \bf{1.413}\\
L0Calo & 30.478 & 29.736 & \bf{0.742}\\
HltPVsPV3D & 2.491 & 2.25 & \bf{0.241}\\
\br
\end{tabular}
\end{center}
\end{table}
As expected, our test shows that the hotspot algorithms are the same in both auditors and the accuracy of
the CPU time distribution measured by the Profiler is increasing while increasing the number of events.
As a result, we can be confident that the Profiler can identify the hotspots with the high precision.
\subsection{Custom reports}
In the second example is demonstrated how custom reports can be created. Basic profiling reports can be picked up in
\amp, but if a custom report is required then a user tool needs to be created . This application can get the CPU usage
data from VTune™ Amplifier XE by using its export function. For example, if we export CPU Time data that is shown on
Figure~\ref{fig08} then the following pie chart report can be created.
\begin{figure}[H]
\begin{minipage}{\textwidth}
\begin{center}
\includegraphics[width=100mm]{figs/fig12.png}
\caption{\label{fig12}CPU Time percentage of top-level algorithms in the Gaudi sequence tree.}
\end{center}
\end{minipage}
\end{figure}
The report on Figure~\ref{fig12} was produced by a user application that took an exported CSV data and compiled it to
javascript code that can be inserted to any web page \cite{reports}.
\section{Conclusions}
In this paper we presented the Gaudi Intel Performance Auditor --- a CPU profiling tool that is used
in LHCb experiment at CERN. This tool integrates the functionality of \iamp performance profiler to
the LHCb core framework Gaudi. The key advantage of the auditor is an ability to produce reports that use
the framework's modules to present performance analysis results. Those reports help developers to identify hotspots in
the code and improve the application performance. Besides the reports, the Gaudi Intel Performance Auditor provides
the options that allow to control the \iamp supervisor’s process from the Gaudi applications.
The results have further strengthened our confidence in the profiling sampling technique. This technique gives us
a reasonable overhead of total profiling time (5\% at \iamp) in comparison to the tools that count the functions calls.
For example, the popular profiling tool Valgrind~\cite{valgrind} counts every code instruction and programs running
under this tool usually run from five to twenty times slower than running outside Valgrind. Though Valgrind provides
precise measurements, using the sampling technique we can get accurate results by tuning the sampling interval or
increasing the number of processing events.
We hope that this example of using \iamp in Gaudi framework would help to implement other advanced profiling tools.
\section*{References}
\begin{thebibliography}{9}
\bibitem{modular} "Modular Software Performance Monitoring", Daniele Francesco Kruse and Karol Kruzelecki 2011 J.
Phys.: Conf. Ser. 331 042014.
\bibitem{gaudi} Barrand G et al. 2001 Comput. Phys. Commun. 140 45-55
\bibitem{perfmon2} "Perfmon2: a standard performance monitoring interface for Linux", Stephane Eranian\\
http://perfmon2.sourceforge.net/perfmon2-20080124.pdf
\bibitem{vtune} \iamp \\ http://software.intel.com/en-us/articles/intel-vtune-amplifier-xe/
\bibitem{perftools} \google Performance Tools \\ http://code.google.com/p/gperftools/
\bibitem{reports} Custom pie chart reports \\ http://amazurov.ru/cern/hltprofilingresults/
\bibitem{valgrind} Valgrind \\ http://valgrind.org/
\end{thebibliography}
\end{document}