This project comprises of three different parts:
- Comparing several hash functions in a dictionary-based application in terms of value distribution.
- Writing a terminal application that can print a definition to a single word or convert a .txt file into an html page with the same text, but where definition pops up on mouse hover of each word.
- Analyzing the performance of this application and optimizing it with the use of assembly.
- Comparing hash functions
- Application
- Optimization
A shortened dictionary database has been used with the total number of words equal to 5608. For higher objectiveness 5 hash table sizes have been considered (521, 1031, 2053, 4099, 8209). The hash functions being compared are the following:
Simply H(string) = 0
.
Obviously, this hash function is not applicable to anything.
H(string) = length(string)
The flaw of this hash function is that most of the elements are concentrated in the first few buckets of the hash table. This is due to the fact that the average English word's length is 4.7 characters, and as the number of buckets increases the distribution remains the same.
H(string) = string[0]
Almost the same as the previous one - the distribution doesn't change and no sign of homogeneity can be seen.
H(string) = sum for i from 0 to length(string) - 1 of string[i]
And again the distribution stays almost the same, though this time it's much better due to words with large hashes not going to the beginning because of being greater than the hash table's size. The range of elements convergence can be estimated as [97 * 4.7, 122 * 4.7] = [455.9, 573.4] (97 - ASCII 'a', 122 - ASCII 'z' and 4.7 - average length of an English word). The plot clearly shows a local peak in this interval, although there are others, the highest one being located at about [700, 800]. This can be explained by the fact that the dictionary database includes a plethora of long archaic words, which shift the distribution to the right.
H_0(string) = 0; H_i+1(string) = rightRotate(H_i(string)) xor string[i]
H_0(string) = 0; H_i+1(string) = leftRotate(H_i(string)) xor string[i]
As can be seen from the diagrams, left rotation does a better job compared to right rotation. This can probably be explained by the nature of data we are analyzing the hash table on. Each byte is an ASCII character and less than 128. This means the most significant bit of each byte is 0. Let us consider an example: take the first character be 'k' or 107 in decimal or 01101011 in binary. For simplicity suppose the hash is 16 bit, not 32.
Not extended ASCII range
|=================|
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
----------------------- ----------------------
0 0 0 0 0 0 0 0 0 1 1 0 1 0 1 1 hash = 'k'
^
1 0 0 0 0 0 0 0 0 0 1 1 0 1 0 1 rightRotate(hash, 1)
^
...
* * * * * * * 1 0 * * * * * * *
^
During the first 8 iterations of the hash function, 7th bit is always 0, i.e. it doesn't influence the result hash at all which leads to a more uneven distribution than that of the left rotate. If we now look at it
Not extended ASCII range
|=================|
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
----------------------- ----------------------
0 0 0 0 0 0 0 0 0 1 1 0 1 0 1 1 hash = 'k'
0 0 0 0 0 0 0 0 1 1 0 1 0 1 1 0 leftRotate(hash, 1)
it's clear that all 8 least significant bits influence the result.
32-bit third version of aappleby's hash function. See this article for more information on the subject.
So far, this algorithm looks pretty solid - it produces an even distribution and has the lowest standard deviation.
Classical cyclic redundancy check algorithm that uses exclusive or operation, bitwise shifts as well as a lookup table. For more information, read this.
Unsurprisingly, this hashing algorithm has the best bucket distribution, which is expected for it involves significantly more operations and has been invented with multiple tests and research.
More diagrams for sizes 521, 2053, 8209 as well as the shortened dictionary used can be found in "bin/res/00_compare_hashes/".
Here are the standard deviations for all the tests:
Algorithm | 521 | 1031 | 2053 | 4099 | 8209 |
---|---|---|---|---|---|
Constant | 245.28 | 174.45 | 123.65 | 87.52 | 61.85 |
String length | 76.54 | 54.67 | 38.84 | 27.52 | 19.46 |
First character | 60.02 | 43.00 | 30.60 | 21.70 | 15.35 |
Characters sum | 7.41 | 4.98 | 4.36 | 3.36 | 2.47 |
Xor and right rotate | 8.20 | 4.18 | 2.28 | 1.74 | 1.25 |
Xor and left rotate | 3.57 | 2.53 | 1.88 | 1.35 | 0.95 |
Murmur3 | 3.51 | 2.60 | 1.85 | 1.26 | 0.91 |
CRC32 | 3.16 | 2.30 | 1.63 | 1.15 | 0.83 |
CRC32, Murmur3 and the algorithm using exclusive or alongside left bitwise rotation have shown the best results.
Each hash function has been tested 10^9 times on the same strings of size 3, 5, 10, 15 and 255 and average execution time of each one is calculated for flags -O[0-3]. The strings are:
- "The" (half average-length word)
- "Meow!" (average-length word)
- "Witchcraft" (double average-length word)
- "Conceptualizing" (triple average-length word)
- "Evil is evil. Lesser, greater, middling, it's all the same. Proportions are negotiated, boundaries blurred. I'm not a pious hermit. I haven't done only good in my life. But if I'm to choose between one evil and another, then I prefer not to choose at all." (sentence)
Entire log of the tests is located in speed_comparison.txt:
====Testing speed of hash functions -O0====
g++ -o bin/intermediates/00_compare_hashes/main_speed_tests.o -c src/00_compare_hashes/main_speed_tests.cpp -O0 -DNDEBUG -w
g++ -o bin/intermediates/hash_functions.o -c src/hash_functions.cpp -O0 -DNDEBUG -w
g++ -o bin/00_speed_tests.out -O0 bin/intermediates/00_compare_hashes/main_speed_tests.o bin/intermediates/hash_functions.o
Testing constant hash
Sample 0 [3]: 2.24635 ns (Half average-length word)
Sample 1 [5]: 1.77401 ns (Average-length word)
Sample 2 [10]: 1.72516 ns (Double average-length word)
Sample 3 [15]: 1.71477 ns (Triple average-length word)
Sample 4 [255]: 1.71628 ns (Sentence)
...
====Testing speed of hash functions -O3====
g++ -o bin/intermediates/00_compare_hashes/main_speed_tests.o -c src/00_compare_hashes/main_speed_tests.cpp -O3 -DNDEBUG -w
g++ -o bin/intermediates/hash_functions.o -c src/hash_functions.cpp -O3 -DNDEBUG -w
g++ -o bin/00_speed_tests.out -O3 bin/intermediates/00_compare_hashes/main_speed_tests.o bin/intermediates/hash_functions.o
Testing constant hash
Sample 0 [3]: 1.27853 ns (Half average-length word)
Sample 1 [5]: 1.2118 ns (Average-length word)
Sample 2 [10]: 1.20717 ns (Double average-length word)
Sample 3 [15]: 1.19389 ns (Triple average-length word)
Sample 4 [255]: 1.19498 ns (Sentence)
...
Testing xor left rotate hash
Sample 0 [3]: 7.02953 ns (Half average-length word)
Sample 1 [5]: 9.87868 ns (Average-length word)
Sample 2 [10]: 17.2935 ns (Double average-length word)
Sample 3 [15]: 24.6277 ns (Triple average-length word)
Sample 4 [255]: 382.953 ns (Sentence)
Testing murmur3 hash
Sample 0 [3]: 4.69705 ns (Half average-length word)
Sample 1 [5]: 5.07867 ns (Average-length word)
Sample 2 [10]: 5.81174 ns (Double average-length word)
Sample 3 [15]: 6.22499 ns (Triple average-length word)
Sample 4 [255]: 72.6188 ns (Sentence)
Testing crc32 hash
Sample 0 [3]: 3.30064 ns (Half average-length word)
Sample 1 [5]: 4.28765 ns (Average-length word)
Sample 2 [10]: 9.62293 ns (Double average-length word)
Sample 3 [15]: 15.9525 ns (Triple average-length word)
Sample 4 [255]: 506.83 ns (Sentence)
Testing optimized crc32 hash
Sample 0 [3]: 2.62107 ns (Half average-length word)
Sample 1 [5]: 3.73992 ns (Average-length word)
Sample 2 [10]: 5.97254 ns (Double average-length word)
Sample 3 [15]: 8.32176 ns (Triple average-length word)
Sample 4 [255]: 166.021 ns (Sentence)
For more information on how the tests were carried out, see main_speed_tests.cpp and 00_run.sh.
Algorithm | 3 | 5 | 10 | 15 | 255 |
---|---|---|---|---|---|
Constant | 2.25 | 1.77 | 1.72 | 1.71 | 1.72 |
String length | 3.81 | 3.96 | 3.75 | 4.04 | 7.25 |
First character | 3.36 | 3.33 | 3.34 | 3.33 | 3.33 |
Characters sum | 8.88 | 13.29 | 32.90 | 35.16 | 571.75 |
Xor and right rotate | 12.14 | 18.87 | 35.65 | 52.24 | 875.86 |
Xor and left rotate | 12.86 | 19.40 | 36.14 | 52.76 | 858.10 |
Murmur3 | 14.38 | 19.32 | 24.45 | 30.59 | 386.84 |
CRC32 | 8.69 | 12.51 | 25.07 | 38.84 | 835.02 |
CRC32 optimized | 4.04 | 4.48 | 6.41 | 8.59 | 166.31 |
Algorithm | 3 | 5 | 10 | 15 | 255 |
---|---|---|---|---|---|
Constant | 1.28 | 1.21 | 1.21 | 1.19 | 1.19 |
String length | 4.07 | 3.19 | 3.42 | 3.11 | 5.97 |
First character | 2.36 | 2.33 | 2.37 | 2.35 | 2.37 |
Characters sum | 6.60 | 9.57 | 16.82 | 24.10 | 379.96 |
Xor and right rotate | 7.07 | 10.51 | 19.04 | 27.58 | 442.25 |
Xor and left rotate | 7.03 | 9.88 | 17.29 | 24.63 | 382.95 |
Murmur3 | 4.70 | 5.08 | 5.81 | 6.22 | 72.62 |
CRC32 | 3.30 | 4.29 | 9.62 | 15.95 | 506.83 |
CRC32 optimized | 2.62 | 3.74 | 5.97 | 8.32 | 166.02 |
CRC32 optimized is discussed later.
Clearly, the winners seem to be Murmur3 and CRC32 for they have the best distribution results and execution times. And even though Murmur3 shows itself best for longer strings (due to it performing operations on four bytes at a time), on shorter ones it sometimes runs even slower than CRC32.
The full dictionary contains 121199 words. Below are the tests for load factors equal to 0.75 and 0.95.
size1 = 121,199 / 0.75 = 161,598.67 or 161,599 (the nearest prime number to it)
size2 = 121,199 / 0.95 = 127,577.89 or 127,579 (the nearest prime number to it)
Left rotate algorithm shows itself quite badly here. In addition, the difference between load factors isn't that drastic for Murmur3 and CRC32, so 0.95 has been chosen in order to save some memory with almost none performance cost.
Choosing between Murmur3 and CRC32 is a bit trickier. Even though the latter runs slower for long strings, the first one has only slightly worse standard deviation, and worse performance with optimization levels starting from 1 for short strings. Due to the fact that the Hash Table is used for a dictionary with words ranging from 4 to 10 letters on average, and CRC32 having a built-in hardware implementation by Intel as will be shown in the optimization part, I have chosen CRC32.
Help message looks like the following
[Help]
These are commands for define:
-h, --help [show this message]
-w, --word [define just one word]
-d, --doc [specify input document]
-o, --output [specify output file]
As can be seen the program supports two different modes:
Example input
$ ./define.out -w Surprise
Example output
Definitions for 'surprise'
1. [n] The act of coming upon, or taking, unawares; the act of seizing unexpectedly; surprisal; as, the fort was taken by surprise.
2. [n] The state of being surprised, or taken unawares, by some act or event which could not reasonably be foreseen; emotion excited by what is sudden and strange; a suddenly excited feeling of wonder or astonishment.
3. [n] Anything that causes such a state or emotion.
4. [n] A dish covered with a crust of raised paste, but with no other contents.
The idea is you can hover over words and see their definitions. For example if we consider the first chapter (taken from here) of J.R.R. Tolkien's Hobbit, then the result will be something like this (note that there might be visual differences in different web browsers):
$ ./define.out --doc hobbit_chapter1.txt -o hobbit_chapter1.html
The word hovered on is 'door'.
First it is important to choose the way we test performance of the hash table.
Let's first try to analyze performance of the program on a txt file containing books "Harry Potter and the Chamber of Secrets", "Harry Potter and the Goblet of Fire", "Harry Potter and the Half-blood Prince", "Harry Potter and the Deathly Hallows" (30,684 lines and 2,646,069 characters total).
Let's look at the profiler
It's clear that the majority of CPU time is taken by I/O functions, which makes the test not objective.
Here I have removed writing to output file.
Let's look at the profiler
Still there are functions at the top which aren't related to the Hash Table.
In order to minimize the time required by loading and parsing input file, I have found a file containing just 466,472 words (with no definitions whatsoever). The test comprises of 5 insertions of all the words into the hash table and 200 searches of all of them. Let's look at the profiler
This is much better! Now we can start optimizing the Hash Table.
Clearly the hash function takes most of the time, so it's worth the effort to decrease its execution time.
GCC
g++ -S -DNDEBUG -O1 -masm=intel hash_functions.cpp -o hash_functions.s
produces the following assembly listing for getCrc32Hash()
:
_Z12getCrc32HashPKc:
.LFB47:
.cfi_startproc
movzx eax, BYTE PTR [rdi]
test al, al
je .L38
mov edx, 0
lea rsi, _ZL11CRC32_TABLE[rip]
.L37:
mov ecx, edx
sal ecx, 8
shr edx, 24
xor eax, edx
movzx eax, al
xor ecx, DWORD PTR [rsi+rax*4]
mov edx, ecx
add rdi, 1
movzx eax, BYTE PTR [rdi]
test al, al
jne .L37
.L35:
mov eax, edx
ret
.L38:
mov edx, 0
jmp .L35
.cfi_endproc
As have already be mentioned, there is a built-in assembly crc32 instruction (read this paper by Intel on the subject for more information). To incorporate it to my C code I used inline assembly (see gnu documentation):
uint32_t getOPCrc32Hash(const char* string)
{
uint32_t hash = 0;
__asm__
(
".intel_syntax noprefix \n\t"
"xor rax, rax \n\t"
"xor r12, r12 \n\t"
" \n\t"
".LOOP_CRC32: \n\t"
" mov r12b, BYTE PTR [rdi] \n\t"
" test r12b, r12b \n\t"
" jz .LOOP_CRC32_END \n\t"
" \n\t"
" crc32 rax, r12b \n\t"
" \n\t"
" inc rdi \n\t"
" jmp .LOOP_CRC32 \n\t"
".LOOP_CRC32_END: \n\t"
".att_syntax \n\t"
:"=a"(hash)
:
:"r12", "rdi"
);
return hash;
}
This has improved overall performance quite significantly even with -O3. File optimization_tests.txt contains comparison tests for the two versions and optimization flags -O0 through -O3 (in addition, it has tests of next optimizations). Also note that each version of the program has been tested 3 times to get a more precise average time.
optimization_tests.txt:
...
====Tests hash table only -O3====
g++ -o bin/intermediates/02_optimize/optimized/main_optimized.o -c src/02_optimize/optimized/main_optimized.cpp -O3 -mavx2 -march=native -DNDEBUG -DHASH_TABLE_ONLY
g++ -o bin/intermediates/file_loader.o -c src/file_loader.cpp -O3 -mavx2 -march=native -DNDEBUG -DHASH_TABLE_ONLY
g++ -o bin/intermediates/hash_table.o -c src/hash_table.cpp -O3 -mavx2 -march=native -DNDEBUG -DHASH_TABLE_ONLY
g++ -o bin/intermediates/bucket.o -c src/bucket.cpp -O3 -mavx2 -march=native -DNDEBUG -DHASH_TABLE_ONLY
g++ -o bin/intermediates/hash_functions.o -c src/hash_functions.cpp -O3 -mavx2 -march=native -DNDEBUG -DHASH_TABLE_ONLY
g++ -o bin/02_optimized.out -O3 bin/intermediates/02_optimize/optimized/main_optimized.o bin/intermediates/file_loader.o bin/intermediates/hash_table.o bin/intermediates/bucket.o bin/intermediates/hash_functions.o
Time: 11223.7 ms
Time: 11095.8 ms
Time: 11043.5 ms
...
====Tests crc32 optimized -O3====
g++ -o bin/intermediates/02_optimize/optimized/main_optimized.o -c src/02_optimize/optimized/main_optimized.cpp -O3 -mavx2 -march=native -DNDEBUG -DCRC32_OPTIMIZED
g++ -o bin/intermediates/file_loader.o -c src/file_loader.cpp -O3 -mavx2 -march=native -DNDEBUG -DCRC32_OPTIMIZED
g++ -o bin/intermediates/hash_table.o -c src/hash_table.cpp -O3 -mavx2 -march=native -DNDEBUG -DCRC32_OPTIMIZED
g++ -o bin/intermediates/bucket.o -c src/bucket.cpp -O3 -mavx2 -march=native -DNDEBUG -DCRC32_OPTIMIZED
g++ -o bin/intermediates/hash_functions.o -c src/hash_functions.cpp -O3 -mavx2 -march=native -DNDEBUG -DCRC32_OPTIMIZED
g++ -o bin/02_optimized.out -O3 bin/intermediates/02_optimize/optimized/main_optimized.o bin/intermediates/file_loader.o bin/intermediates/hash_table.o bin/intermediates/bucket.o bin/intermediates/hash_functions.o
Time: 9228.9 ms
Time: 9103.98 ms
Time: 9090.28 ms
...
Version | -O0 | -O1 | -O2 | -O3 |
---|---|---|---|---|
Not optimized | 17.37 | 11.21 | 11.19 | 11.12 |
CRC32 optimized | 10.53 | 9.30 | 9.30 | 9.14 |
Average time is given in seconds.
So we have increased performance by
- -O0: 65%
- -O1: 21%
- -O2: 20%
- -O3: 22%
And now generated program's profile looks like the following:
Hash function now has a lesser percentage of the total execution time.
The second most time-cost function is HashTable's find().
GCC
g++ -S -DNDEBUG -O1 -masm=intel hash_table.cpp -o hash_table.s
produces the following assembly listing for find():
_Z4findPK9HashTablePKc:
.LFB51:
.cfi_startproc
push r14
.cfi_def_cfa_offset 16
.cfi_offset 14, -16
push r13
.cfi_def_cfa_offset 24
.cfi_offset 13, -24
push r12
.cfi_def_cfa_offset 32
.cfi_offset 12, -32
push rbp
.cfi_def_cfa_offset 40
.cfi_offset 6, -40
push rbx
.cfi_def_cfa_offset 48
.cfi_offset 3, -48
mov r13, rdi
mov r14, rsi
mov rdi, rsi
call [QWORD PTR 24[r13]]
mov eax, eax
mov edx, 0
div QWORD PTR 0[r13]
lea rdx, [rdx+rdx*2]
mov rax, QWORD PTR 8[r13]
lea r12, [rax+rdx*8]
cmp QWORD PTR 8[r12], 0
je .L24
mov ebp, 0
mov eax, 0
.L23:
sal rax, 5
mov rbx, rax
mov rax, QWORD PTR [r12]
mov rsi, QWORD PTR [rax+rbx]
mov rdi, r14
call [QWORD PTR 16[r13]]
test eax, eax
je .L27
add ebp, 1
mov eax, ebp
cmp rax, QWORD PTR 8[r12]
jb .L23
mov eax, 0
jmp .L20
.L27:
mov rax, rbx
add rax, QWORD PTR [r12]
add rax, 8
.L20:
pop rbx
.cfi_remember_state
.cfi_def_cfa_offset 40
pop rbp
.cfi_def_cfa_offset 32
pop r12
.cfi_def_cfa_offset 24
pop r13
.cfi_def_cfa_offset 16
pop r14
.cfi_def_cfa_offset 8
ret
.L24:
.cfi_restore_state
mov eax, 0
jmp .L20
.cfi_endproc
The code seems to have too many memory accessing instructions (including stack usage). So by rewriting the function in assembly completely, I managed to improve performance.
.globl _Z4findPK9HashTablePKc
.type _Z4findPK9HashTablePKc, @function
.intel_syntax noprefix
# ------------------------------------------------------------------------------
# Finds an element associated with key in hashTable.
#
# Expects: RDI = constant pointer to a HashTable
# RSI = constant pointer to a string (key)
#
# Returns: RAX = constant pointer to value (DictEntry) or nullptr if there is
# no element with key in hashTable.
# ------------------------------------------------------------------------------
_Z4findPK9HashTablePKc:
push r12
push rbx
mov rbx, rdi # rbx = hashTable
mov rdi, rsi
call QWORD PTR [rbx + 24] # rax = getHash(key)
xor rdx, rdx
div QWORD PTR [rbx] # rdx = rdx:rax % hashTable->size = hash
mov rax, QWORD PTR [rbx + 8] # rax = buckets
# sizeof(Bucket) = 24
lea rdx, [rdx + 2 * rdx]
lea rdx, [rax + 8 * rdx] # &(buckets[hash])
mov r12, QWORD PTR [rbx + 16] # r12 = cmp
mov rbx, QWORD PTR [rdx] # rbx = buckets[hash].data
mov rcx, QWORD PTR [rdx + 8] # rcx = buckets[hash].size
# rcx = rbx + 32 * rcx (sizeof(Pair)) = last + 1 bucket
lea rcx, [4 * rcx]
lea rcx, [rbx + 8 * rcx]
.LOOP_FIND_PAIR:
cmp rbx, rcx
jae .NOT_FOUND_PAIR
mov rdi, QWORD PTR [rbx]
call r12
test eax, eax
jz .FOUND_PAIR
add rbx, 32
jmp .LOOP_FIND_PAIR
.FOUND_PAIR:
lea rax, [rbx + 8]
pop rbx
pop r12
ret
.NOT_FOUND_PAIR:
xor rax, rax
pop rbx
pop r12
ret
Surprisingly, I managed to outperform even -O3's code again.
Version | -O0 | -O1 | -O2 | -O3 |
---|---|---|---|---|
Not optimized | 17.37 | 11.21 | 11.19 | 11.12 |
CRC32 + find() optimized | 8.80 | 8.57 | 8.53 | 8.45 |
Average time is given in seconds.
So we have increased performance (compared to the original version) by
- -O0: 97%
- -O1: 31%
- -O2: 31%
- -O3: 32%
And now program's profile looks like the following:
The last function that takes up the majority of computing time is strcmp()
. Clearly, strcmp_avx2()
(which you can see on the previous diagrams) is already optimized compared to a straightforward implementation of strcmp()
. So it seems, there's nothing one can do in this situation...
But, keeping in mind that we store English words in the hash table, we can benefit from the length of words being comparatively small. Having looked at the dictionary, I found out that there are no words longer or equal to 32 characters. This is why we can simply use YMM registers (introduced with AVX) to store words! Comparing words would be much faster this way.
First, I changed the stored key type of the Hash Table (to __m256i
) and changed insert()
and find()
functions. I should note, that these functions still have const char*
as the parameter.
Alas, performance only got worse compared to CRC32.
Version | -O0 | -O1 | -O2 | -O3 |
---|---|---|---|---|
Not optimized | 17.37 | 11.21 | 11.19 | 11.12 |
CRC32 optimized | 10.53 | 9.30 | 9.30 | 9.14 |
CRC32 + AVX2 optimized | 17.13 | 11.07 | 10.99 | 10.95 |
Fortunately, after changing already rewritten find()
, the performance improved.
.globl _Z4findPK9HashTablePKc
.type _Z4findPK9HashTablePKc, @function
.intel_syntax noprefix
# ------------------------------------------------------------------------------
# Finds an element associated with key in hashTable.
#
# Expects: RDI = constant pointer to a HashTable
# RSI = constant pointer to a string (key)
#
# Returns: RAX = constant pointer to value (DictEntry) or nullptr if there is
# no element with key in hashTable.
# ------------------------------------------------------------------------------
_Z4findPK9HashTablePKc:
push rbx
mov rbx, rdi # rbx = hashTable
mov rdi, rsi
call QWORD PTR [rbx + 24] # rax = getHash(key)
xor rdx, rdx
div QWORD PTR [rbx] # rdx = rdx:rax % hashTable->size = hash
mov rax, QWORD PTR [rbx + 8] # rax = buckets
# sizeof(Bucket) = 24
lea rdx, [rdx + 2 * rdx]
lea rdx, [rax + 8 * rdx] # &(buckets[hash])
mov rbx, QWORD PTR [rdx] # rbx = buckets[hash].data
mov rcx, QWORD PTR [rdx + 8] # rcx = buckets[hash].size
# rcx = rbx + 64 * rcx (sizeof(Pair)) = last + 1 bucket
lea rcx, [8 * rcx]
lea rcx, [rbx + 8 * rcx]
mov rdi, rsi
push rcx
call _Z8strToYMMPKc # ymm0 = strToYMM(key)
pop rcx
.LOOP_FIND_PAIR:
cmp rbx, rcx
jae .NOT_FOUND_PAIR
vpcmpeqb ymm1, ymm0, YMMWORD PTR [rbx]
vpmovmskb eax, ymm1
cmp eax, -1
je .FOUND_PAIR
add rbx, 64
jmp .LOOP_FIND_PAIR
.FOUND_PAIR:
lea rax, [rbx + 32]
pop rbx
ret
.NOT_FOUND_PAIR:
xor rax, rax
pop rbx
ret
Version | -O0 | -O1 | -O2 | -O3 |
---|---|---|---|---|
Not optimized | 17.37 | 11.21 | 11.19 | 11.12 |
CRC32 + AVX2 optimized | 17.13 | 11.07 | 10.99 | 10.95 |
CRC32 + find + AVX2 optimized | 8.36 | 7.72 | 7.51 | 7.48 |
Here is the overall table of execution time:
Version | -O0 | -O1 | -O2 | -O3 |
---|---|---|---|---|
Not optimized | 17.37 | 11.21 | 11.19 | 11.12 |
CRC32 optimized | 10.53 | 9.30 | 9.30 | 9.14 |
CRC32 + find() optimized | 8.80 | 8.57 | 8.53 | 8.45 |
CRC32 + AVX2 optimized | 17.13 | 11.07 | 10.99 | 10.95 |
CRC32 + find + AVX2 optimized | 8.36 | 7.72 | 7.51 | 7.48 |
or in percentages based on the original version:
Version | -O0 | -O1 | -O2 | -O3 |
---|---|---|---|---|
CRC32 optimized | 65 | 21 | 20 | 22 |
CRC32 + find() optimized | 97 | 31 | 31 | 32 |
CRC32 + AVX2 optimized | 1 | 1 | 2 | 2 |
CRC32 + find + AVX2 optimized | 107 | 45 | 49 | 49 |
Just out of curiosity, I checked performance of the first program with and without optimizations.
Version | -O0 | -O1 | -O2 | -O3 |
---|---|---|---|---|
Not optimized | 2.85 | 2.79 | 2.75 | 2.74 |
CRC32 + find + AVX2 optimized | 2.43 (17%) | 2.40 (16%) | 2.32 (18%) | 2.27 (21%) |
So at the end, I have managed to boost performance of Hash Table by 107% (with -O0) and 49% (with -O3)!
The ded32 coefficient 😸 is
boost_coefficient / #asm_lines * 1000 = 2.07 / 50 * 1000 = 41.4 (-O0)
boost_coefficient / #asm_lines * 1000 = 1.49 / 50 * 1000 = 29.8 (-O3)