-
Notifications
You must be signed in to change notification settings - Fork 6
/
UCC_CA_Profile_DIFF_No_DUP_Details.txt
550 lines (491 loc) · 44.8 KB
/
UCC_CA_Profile_DIFF_No_DUP_Details.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
Below is a capture of using a profiler to do small optimizations of UCC Differencing code
Code with interleaved Times captured by AMD CodeAnalyst in this case are shown.
Just a few examples of what a decent profiler can help with.
Text capture of Details of using AMD CodeAnalyst Timing sampling (current time based profile)
on RELEASE (Fully Optimized) Build of UCC (Debug symbols & info also done to support CodeAnalyst)
Visual C++ 2010 Express making 32 bit Windows UCC.exe and run on 64 bit Windows 7.1 OS using
O2
W4
optimize for speed
Whole program optimization at Link time
MT
The profile used a statistical Time sampling approach
Operations in the profile included the Time of
<2 extra worker Threads on 2 CPU AMD>
Read,
Analyze, Count keywords,
<Single CPU for the rest>
do Complexity metrics
and do Differencing with NO Duplicate checks
and finally produce output files
UCC.exe -nodup -d -threads 2
-dir "C:\C++\boost_1_48_0\tools"
"C:\C++\boost_1_58_0\tools"
-outdir "C:\TEST\UCC\Files_OUT" -ascii
783 files processed in boost_1_48_0\tools (baseline A)
749 files processed in boost_1_58_0\tools (baseline B)
1532 files total
Partial capture of overall Times given as percent of total time used by UCC. Clipped to show highest 93.47% of UCC Time used.
=================================================================================================================================
CS:EIP Symbol + Offset Timer samples
0x439080 CmpMngr::SimilarLine >>>> 35.74 <<<<
0x4b8f80 memchr 19.15
0x418530 std::basic_string<char,std::char_traits<char>,std::allocator<char> >::operator[] 7.4
0x4058e0 std::basic_string<char,std::char_traits<char>,std::allocator<char> >::find 4.04
0x405760 std::basic_string<char,std::char_traits<char>,std::allocator<char> >::assign 2.34
0x4bc0b0 memcpy 2.28
0x4564e0 CUtil::CountTally >>>> 2.28
0x401040 std::char_traits<char>::compare 2.19
0x456060 CUtil::ToLower >>>> 1.5
0x405d50 std::basic_string<char,std::char_traits<char>,std::allocator<char> >::_Copy 1.44
0x418820 std::basic_streambuf<char,std::char_traits<char> >::snextc 0.95
0x457ab0 CUtil::ClearRedundantSpaces >>>> 0.91
0x406160 std::operator+<char,std::char_traits<char>,std::allocator<char> > 0.82
0x4056b0 std::basic_string<char,std::char_traits<char>,std::allocator<char> >::append 0.77
0x406540 std::basic_string<char,std::char_traits<char>,std::allocator<char> >::append 0.75
0x41bf80 std::getline<char,std::char_traits<char>,std::allocator<char> > 0.73
0x4bd68d malloc 0.72
0x457c50 CUtil::ReplaceSmartQuotes 0.72
0x405c50 std::basic_string<char,std::char_traits<char>,std::allocator<char> >::assign 0.72
0x4562c0 CUtil::FindKeyword 0.71
0x405b60 std::basic_string<char,std::char_traits<char>,std::allocator<char> >::append 0.64
0x4114e0 CCJavaCsCounter::LSLOC 0.6
0x47b2e0 DiffTool::CompareFilePaths 0.49
0x405640 std::basic_string<char,std::char_traits<char>,std::allocator<char> >::assign 0.46
0x458d60 std::basic_string<char,std::char_traits<char>,std::allocator<char> >::find_first_not_of 0.43
0x4c913b _read_nolock 0.33
0x4b9205 free 0.29
0x4b9089 operator new 0.29
0x4b902d operator delete 0.29
0x405990 std::basic_string<char,std::char_traits<char>,std::allocator<char> >::_Chassign 0.29
0x4bc23a NO SYMBOL 0.27
0x4ba4e6 __from_strstr_to_strchr 0.26
0x4b8c10 memmove 0.26
0x455900 CTagCounter::CountTagTally 0.26
0x40aa10 std::_Tree<std::_Tmap_traits<std::basic_string<char,std::char_traits<char>,std::allocator<char> > 0.26
0x4bc164 NO SYMBOL 0.22
0x438520 CmpMngr::FindModifiedLines 0.22
0x416fd0 CCodeCounter::CountComplexity 0.22
0x454ff0 CTagCounter::LSLOC 0.21
0x41b590 std::operator+<char,std::char_traits<char>,std::allocator<char> > 0.21
0x4bc25a NO SYMBOL 0.2
0x4bc910 memset 0.18
0x455e80 CUtil::TrimString 0.16
0x4bc246 NO SYMBOL 0.15
0x44b510 CPythonCounter::LSLOC 0.15
45 functions, 94 instructions, Total: 12689 samples, 93.47% of shown samples (don't care about % of other session samples)
The below are the most "approachable" for optimization changes.
CmpMngr::SimilarLine
CUtil::CountTally
CUtil::ToLower
CUtil::ClearRedundantSpaces
SimilarLine is the clear candidate for another look.
=================================================================================================================================
Address Line Source Timer samples 35.74 % of TOTAL UCC.exe run time
281 bool CmpMngr::SimilarLine( const string &baseLine, int *x1, const string &compareLine, int *x2 )
0x439080 282 {
283 // Profiling this shows it was called 135,592 times when 12,553 Files were paired for Differencing
284 // Due to being called from an inner LOOP of another inner LOOP (see FindModifiedLines)
285 // Cmd line: split for readability
286 // -threads 2 -nodup -d
287 // -dir "C:\Linux\Stable_3_11_6\linux-3.11.6\arch"
288 // "C:\Linux\linux_3_13_4\arch"
289 // -outdir "C:\TEST\UCC\Files_OUT" -ascii
290 //
291 // 2 changes: Use C style int arrays instead of more general (and slower) std vector container class
292 // Moved allocation/free of work buffers up to Caller level to prevent memory alloc/free thrashing here
293 //
294 bool retVal = false;
295 int m, n, i, j, k;
296 double LCSlen;
0x439086 297 m = (int)baseLine.size();
0x439089 298 n = (int)compareLine.size();
299
300 // Commented out and replaced with C style array passed from Caller
301 // vector<int> x1, x2;
302 // x1.resize(m + 1, 0);
303 // x2.resize(m + 1, 0);
0x439090 304 memset( x1, 0, (m + 1) * sizeof( int ) );
0x4390ae 305 memset( x2, 0, (m + 1) * sizeof( int ) );
306
307 // compute length of LCS
308 // - no need to use CBitMatrix
0x4390ba 309 for (j = n - 1; j >= 0; j--) 0.13
310 {
0x4390c6 311 for (k = 0; k <= m; k++) 0.08
312 {
0x4390d4 313 x2[k] = x1[k]; 2.31
0x4390ca 314 x1[k] = 0; 4.69
315 }
0x4390e5 316 for (i = m - 1; i >= 0; i--) 4.38
317 {
0x4390f7 318 if (baseLine[i] == compareLine[j]) 15.44
319 {
0x439114 320 x1[i] = 1 + x2[i+1]; 1.84
321 }
0x43911e 322 else 0.12
323 {
0x439120 324 if (x1[i+1] > x2[i]) 5.69
325 {
0x4390ec 326 x1[i] = x1[i+1]; 0.32
327 }
0x43912f 328 else 0.18
329 {
0x439131 330 x1[i] = x2[i]; 0.49
331 }
332 }
333 }
334 }
0x439144 335 LCSlen = x1[0];
336 if ((LCSlen / (double)m * 100 >= MATCH_THRESHOLD) &&
0x439146 337 (LCSlen / (double)n * 100 >= MATCH_THRESHOLD)) 0.07
0x439173 338 retVal = true;
339
340 return retVal;
0x439175 341 } 0.01
61 lines, 0 instructions, Summary: 4852 samples, 35.74% of shown samples, (don't care about 12.28% of total samples which includes all other running processes)
=================================================================================================================================
Timing Interpretation: First inner loop
3 LOOPs 1 outer with 2 separate inner LOOPs (extremely low overhead for outer loop so not discussed below)
First inner loop just over 7 % of TOTAL run time
for (k = 0; k <= m; k++) 0.08
{
x2[k] = x1[k]; 2.31
x1[k] = 0; 4.69
}
Using CodeAnalyst to show the Code Bytes (like AMD or Intel ASM in this case)
Address Line Source Code Bytes Timer samples
307 // compute length of LCS
308 // - no need to use CBitMatrix
0x4390ba 309 for (j = n - 1; j >= 0; j--) 0.13
0x4390ba mov eax,[ebp-10h] 8B 45 F0 0.01
0x4390bd add esp,18h 83 C4 18
0x4390c0 dec eax 48
0x4390c1 mov [ebp-04h],eax 89 45 FC
0x4390c4 js $+80h (0x439144) 78 7E
----- break -----
0x43913f dec dword [ebp-04h] FF 4D FC
0x439142 jns $-7ch (0x1004390c6) 79 82 0.12
310 {
0x4390c6 311 for (k = 0; k <= m; k++) 0.08
0x4390c6 test esi,esi 85 F6 0.06
0x4390c8 js $+1dh (0x4390e5) 78 1B 0.02
312 {
0x4390d4 313 x2[k] = x1[k]; 2.31
0x4390d4 mov edi,[eax] 8B 38
0x4390d6 mov [ecx+eax],edi 89 3C 01 2.31
0x4390ca 314 x1[k] = 0; 4.69
0x4390ca mov ecx,[ebp+18h] 8B 4D 18
0x4390cd mov eax,ebx 8B C3 0.01
0x4390cf sub ecx,ebx 2B CB 0.06
0x4390d1 lea edx,[esi+01h] 8D 56 01
----- break -----
0x4390d9 mov [eax],00000000h C7 00 00 00 00 00 0.27
0x4390df add eax,04h 83 C0 04 1.84
0x4390e2 dec edx 4A 2.5
0x4390e3 jnz $-0fh (0x1004390d4) 75 EF
315 }
9 lines, 21 instructions, Summary: 978 samples, 7.20% of shown samples
Referring to Lines 311 to 315
Those familar with Intel ASM will see that the use of CPU registers could be better.
ECX is not used effectively as a counter
The suprise to me is the time needed to just Zero x1[k] at over 4 1/2 percent of TOTAL runtime !
Comment out loop and replace with
memcpy( x2, x1, ( m + 1 ) * sizeof( int ) );
memset( x1, 0, ( m + 1 ) * sizeof( int ) );
Which does the exact same logic using optimized C library code.
=================================================================================================================================
Timing Interpretation: Second inner loop
Second inner loop 28.46 % of TOTAL run time
for (i = m - 1; i >= 0; i--) 4.38
{
if (baseLine[i] == compareLine[j]) 15.44
{
x1[i] = 1 + x2[i+1]; 1.84
}
else 0.12
{
if (x1[i+1] > x2[i]) 5.69
{
x1[i] = x1[i+1]; 0.32
}
else 0.18
{
x1[i] = x2[i]; 0.49
}
}
}
Address Line Source Code Bytes Timer samples
0x4390e5 316 for (i = m - 1; i >= 0; i--) 4.38
0x4390e5 lea edi,[esi-01h] 8D 7E FF
0x4390e8 test edi,edi 85 FF 0.49
0x4390ea js $+55h (0x43913f) 78 53 0.05
----- break -----
0x439133 sub esi,04h 83 EE 04 2.2
0x439136 dec edi 4F 0.57
0x439137 jns $-40h (0x1004390f7) 79 BE 0.65
0x439139 mov ebx,[ebp+10h] 8B 5D 10 0.03
0x43913c mov esi,[ebp-0ch] 8B 75 F4 0.38
317 {
0x4390f7 318 if (baseLine[i] == compareLine[j]) 15.44
0x4390f7 mov ebx,[ebp+0ch] 8B 5D 0C 1.22
0x4390fa cmp [ebx+14h],10h 83 7B 14 10 1.16
0x4390fe jb $+04h (0x439102) 72 02 1.25
0x439100 mov ebx,[ebx] 8B 1B 1.08
0x439102 mov ecx,[ebp-04h] 8B 4D FC 1.07
0x439105 mov eax,[ebp+14h] 8B 45 14 0.69
0x439108 call $-00020bd8h (0x100418530) E8 23 F4 FD FF 0.59
0x43910d mov cl,[ebx+edi] 8A 0C 3B 3.03
0x439110 cmp cl,[eax] 3A 08 1.55
0x439112 jnz $+0eh (0x439120) 75 0C 3.79
319 {
0x439114 320 x1[i] = 1 + x2[i+1]; 1.84
0x439114 mov edx,[ebp-08h] 8B 55 F8 0.1
0x439117 mov eax,[esi+edx+04h] 8B 44 16 04 1.31
0x43911b inc eax 40 0.29
0x43911c mov [esi],eax 89 06 0.14
321 }
0x43911e 322 else 0.12
0x43911e jmp $+15h (0x439133) EB 13 0.12
323 {
0x439120 324 if (x1[i+1] > x2[i]) 5.69
0x439120 mov ecx,[ebp-08h] 8B 4D F8 2.24
0x439123 mov eax,[esi+04h] 8B 46 04 0.94
0x439126 mov ecx,[esi+ecx] 8B 0C 0E 0.86
0x439129 cmp eax,ecx 3B C1 0.68
0x43912b jle $+06h (0x439131) 7E 04 0.97
325 {
0x4390ec 326 x1[i] = x1[i+1]; 0.32
0x4390ec mov eax,[ebp+18h] 8B 45 18 0.12
0x4390ef sub eax,ebx 2B C3 0.06
0x4390f1 lea esi,[ebx+edi*4] 8D 34 BB
0x4390f4 mov [ebp-08h],eax 89 45 F8
----- break -----
0x43912d mov [esi],eax 89 06 0.14
327 }
0x43912f 328 else 0.18
0x43912f jmp $+04h (0x439133) EB 02 0.18
329 {
0x439131 330 x1[i] = x2[i]; 0.49
0x439131 mov [esi],ecx 89 0E 0.49
331 }
332 }
333 }
334 }
19 lines, 37 instructions, Summary: 3863 samples, 28.46% of shown samples
2 Approaches: 1 easy, other requires notes during Debug session of the inner loop and some thought (sometime)
EASY: compareLine[j] does NOT change within the loop so new Second inner loop is:
cmp_j = compareLine[ j ];
for ( i = m - 1; i >= 0; i-- )
{
if ( baseLine[i] == cmp_j ) // Should be faster. Need to profile changes now... see below
{
x1[i] = 1 + x2[i+1];
}
else
{
if (x1[i+1] > x2[i])
{
x1[i] = x1[i+1];
}
else
{
x1[i] = x2[i];
}
}
}
New Profiler run with above changes applied. Compared with run at top of this file.
Partial capture of overall Times given as percent of total time used by UCC. Clipped to show highest 90.70% of UCC Time used.
=================================================================================================================================
CS:EIP Symbol + Offset Timer samples
0x439090 CmpMngr::SimilarLine 30.92 was 35.74
0x4b8f60 memchr 20.81 was 19.15
0x4bc090 memcpy 4.74 was 2.28
0x4058e0 std::basic_string<char,std::char_traits<char>,std::allocator<char> >::find 4.39 was 4.04
0x455e20 CUtil::CountTally 2.86 was 2.28
0x401040 std::char_traits<char>::compare 2.52
0x405760 std::basic_string<char,std::char_traits<char>,std::allocator<char> >::assign 2.45
0x4bf0e7 _VEC_memcpy 1.96
0x405d50 std::basic_string<char,std::char_traits<char>,std::allocator<char> >::_Copy 1.76
0x4c553d _VEC_memzero 1.71
0x4bc8f0 memset 1.39
0x4559a0 CUtil::ToLower 1.38
0x418e20 std::basic_streambuf<char,std::char_traits<char> >::snextc 1.06
0x4573f0 CUtil::ClearRedundantSpaces 1.02
0x4056b0 std::basic_string<char,std::char_traits<char>,std::allocator<char> >::append 0.86
0x41bea0 std::getline<char,std::char_traits<char>,std::allocator<char> > 0.86
0x405c50 std::basic_string<char,std::char_traits<char>,std::allocator<char> >::assign 0.84
0x4061c0 std::operator+<char,std::char_traits<char>,std::allocator<char> > 0.82
0x457590 CUtil::ReplaceSmartQuotes 0.81
0x4065a0 std::basic_string<char,std::char_traits<char>,std::allocator<char> >::append 0.76
0x411af0 CCJavaCsCounter::LSLOC 0.74
0x4bd66d malloc 0.74
0x405640 std::basic_string<char,std::char_traits<char>,std::allocator<char> >::assign 0.73
0x47b0d0 DiffTool::CompareFilePaths 0.7
0x405b60 std::basic_string<char,std::char_traits<char>,std::allocator<char> >::append 0.61
0x455c00 CUtil::FindKeyword 0.51
0x4c90eb _read_nolock 0.4
0x4b900d operator delete 0.38
0x405990 std::basic_string<char,std::char_traits<char>,std::allocator<char> >::_Chassign 0.37
0x4b8bf0 memmove 0.34
0x4b9069 operator new 0.34
0x4b91e5 free 0.31
0x455240 CTagCounter::CountTagTally 0.3
0x4ba4c6 __from_strstr_to_strchr 0.3
34 functions, 34 instructions, Total: 9627 samples, 90.70% of shown samples
Because the percent used by SimilarLine decreased the percentages for other procedures will increase
but the overall runtime is still lower.
Along with making SimilarLine faster the other benefit is the 3 now missing entries between memchr and memcpy
0x418530 std::basic_string<char,std::char_traits<char>,std::allocator<char> >::operator[] 7.4
0x4058e0 std::basic_string<char,std::char_traits<char>,std::allocator<char> >::find 4.04
0x405760 std::basic_string<char,std::char_traits<char>,std::allocator<char> >::assign 2.34
Getting rid of so many uses of operator[] is a good thing! along with find() and assign()
=================================================================================================================================
Address Line Source Code Bytes Timer samples 30.92 % of TOTAL UCC.exe run time
0x4390cf 311 for (j = n - 1; j >= 0; j--) 0.5
312 {
313 // Left as example of previous code optimized for speed
314 // for ( k = 0; k <= m; k++ )
315 // {
316 // x2[ k ] = x1[ k ];
317 // x1[ k ] = 0;
318 // }
0x4390e0 319 memcpy( x2, x1, ( m + 1 ) * sizeof( int ) ); 0.21 <<< BIG change here. About 1/3 of a percent
0x4390ee 320 memset( x1, 0, ( m + 1 ) * sizeof( int ) ); 0.13 <<< instead of 7 % of Total time
321
0x4390f7 322 cmp_j = compareLine[ j ]; 0.24 <<< ADDED 1/4 percent overhead to TOTAL runtime
323
0x439105 324 for ( i = m - 1; i >= 0; i-- ) 4.45
325 {
0x43911a 326 if ( baseLine[i] == cmp_j ) 14.24 <<< 1.2 % improvement. Was 15.44 %
327 { <<< Including overhead above slightly under 1 % of TOTAL time improvement
0x43912d 328 x1[i] = 1 + x2[i+1]; 1.84
329 }
0x439134 330 else
331 {
0x439136 332 if (x1[i+1] > x2[i]) 6.1
333 {
0x439112 334 x1[i] = x1[i+1]; 0.62
335 }
0x439142 336 else 0.26
337 {
0x439144 338 x1[i] = x2[i]; 2.2
339 }
340 }
341 }
342 }
0x439155 343 LCSlen = x1[0];
344 if ((LCSlen / (double)m * 100 >= MATCH_THRESHOLD) &&
0x439157 345 (LCSlen / (double)n * 100 >= MATCH_THRESHOLD)) 0.1
0x439184 346 retVal = true;
347
348 return retVal;
0x439186 349 } 0.01
39 lines, 0 instructions, Summary: 3278 samples, 30.88% of shown samples
=================================================================================================================================
Details of what works better now
memcpy( x2, x1, ( m + 1 ) * sizeof( int ) );
memset( x1, 0, ( m + 1 ) * sizeof( int ) );
Address Line Source Code Bytes Timer samples
0x4390e0 319 memcpy( x2, x1, ( m + 1 ) * sizeof( int ) ); 0.21
0x4390e0 mov esi,[ebp-08h] 8B 75 F8 0.07
0x4390e3 mov ecx,[ebp+14h] 8B 4D 14 0.08
0x4390e6 push esi 56
0x4390e7 push ebx 53
0x4390e8 push ecx 51 0.07
0x4390e9 call $+00082fa7h (0x4bc090) E8 A2 2F 08 00
0x4390ee 320 memset( x1, 0, ( m + 1 ) * sizeof( int ) ); 0.13
0x4390ee push esi 56 0.05
0x4390ef push byte 00h 6A 00 0.01
0x4390f1 push ebx 53 0.08
0x4390f2 call $+000837feh (0x4bc8f0) E8 F9 37 08 00
2 lines, 10 instructions, Summary: 36 samples, 0.34% of shown samples
if ( baseLine[i] == cmp_j ) NEW version
Address Line Source Code Bytes Timer samples
0x43911a 326 if ( baseLine[i] == cmp_j ) 14.24
0x43911a mov esi,[ebp+0ch] 8B 75 0C 1.47
0x43911d cmp [esi+14h],10h 83 7E 14 10 2.89
0x439121 jb $+04h (0x439125) 72 02 3.42
0x439123 mov esi,[esi] 8B 36 0.34
0x439125 mov dl,[ebp-01h] 8A 55 FF 2
0x439128 cmp [esi+ecx],dl 38 14 0E 1.31 << BYTE (char size) compare
0x43912b jnz $+0bh (0x439136) 75 09 2.81
if (baseLine[i] == compareLine[j]) OLD version
Address Line Source Code Bytes Timer samples
0x4390f7 318 if (baseLine[i] == compareLine[j]) 15.44
0x4390f7 mov ebx,[ebp+0ch] 8B 5D 0C 1.22
0x4390fa cmp [ebx+14h],10h 83 7B 14 10 1.16
0x4390fe jb $+04h (0x439102) 72 02 1.25
0x439100 mov ebx,[ebx] 8B 1B 1.08
0x439102 mov ecx,[ebp-04h] 8B 4D FC 1.07
0x439105 mov eax,[ebp+14h] 8B 45 14 0.69
0x439108 call $-00020bd8h (0x100418530) E8 23 F4 FD FF 0.59 << used to be Call to library code before compare
0x43910d mov cl,[ebx+edi] 8A 0C 3B 3.03
0x439110 cmp cl,[eax] 3A 08 1.55 << BYTE compare
0x439112 jnz $+0eh (0x439120) 75 0C 3.79
Details of Second inner loop. This is a good candidate for more optimizations...
Address Line Source Code Bytes Timer samples
0x439105 324 for ( i = m - 1; i >= 0; i-- ) 4.45
0x439105 mov ecx,[ebp-10h] 8B 4D F0 0.03
0x439108 mov dl,[eax+edi] 8A 14 38 0.01
0x43910b mov [ebp-01h],dl 88 55 FF 0.06
0x43910e test ecx,ecx 85 C9 0.05
0x439110 js $+3fh (0x43914f) 78 3D 0.08
----- break -----
0x439146 sub eax,04h 83 E8 04 2.59
0x439149 dec ecx 49 1.18
0x43914a jns $-30h (0x10043911a) 79 CE 0.44
0x43914c mov edi,[ebp-0ch] 8B 7D F4 0.02
325 {
0x43911a 326 if ( baseLine[i] == cmp_j ) 14.24
0x43911a mov esi,[ebp+0ch] 8B 75 0C 1.47
0x43911d cmp [esi+14h],10h 83 7E 14 10 2.89
0x439121 jb $+04h (0x439125) 72 02 3.42
0x439123 mov esi,[esi] 8B 36 0.34
0x439125 mov dl,[ebp-01h] 8A 55 FF 2
0x439128 cmp [esi+ecx],dl 38 14 0E 1.31
0x43912b jnz $+0bh (0x439136) 75 09 2.81
327 {
0x43912d 328 x1[i] = 1 + x2[i+1]; 1.84
0x43912d mov edx,[edi+eax+04h] 8B 54 07 04 0.13
0x439131 inc edx 42 1.52
0x439132 mov [eax],edx 89 10 0.19
329 }
0x439134 330 else
0x439134 jmp $+12h (0x439146) EB 10
331 {
0x439136 332 if (x1[i+1] > x2[i]) 6.1
0x439136 mov edx,[eax+04h] 8B 50 04 2.63
0x439139 mov esi,[edi+eax] 8B 34 07 2.83
0x43913c cmp edx,esi 3B D6 0.08
0x43913e jle $+06h (0x439144) 7E 04 0.56
333 {
0x439112 334 x1[i] = x1[i+1]; 0.62
0x439112 mov edi,[ebp+14h] 8B 7D 14
0x439115 lea eax,[ebx+ecx*4] 8D 04 8B 0.01
0x439118 sub edi,ebx 2B FB 0.03
----- break -----
0x439140 mov [eax],edx 89 10 0.58
335 }
0x439142 336 else 0.26
0x439142 jmp $+04h (0x439146) EB 02 0.26
337 {
0x439144 338 x1[i] = x2[i]; 2.2
0x439144 mov [eax],esi 89 30 2.2
339 }
340 }
341 }
342 }
19 lines, 32 instructions, Summary: 3152 samples, 29.70% of shown samples
=================================================================================================================================
Next steps
Analysis/Debug
Is it possible to refactor the code in
the Second inner loop to completely avoid use of arrays ? ? ?
That would be one of the focus viewpoints of the Debug session.
I am guessing that about 6 to 9 or so int variables could be used instead...
IF arrays are not needed
then some overhead in the Calling code for the x1 and x2 arrays would be gone as well.
Hopefully this has helped show how simple use of Profiler results can benefit UCC.
Have Fun!
Randy Maxwell