-
Notifications
You must be signed in to change notification settings - Fork 1
/
reduce.html
244 lines (210 loc) · 7.4 KB
/
reduce.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
<!doctype html>
<html>
<meta charset='utf-8'>
<title>reduce results</title>
<script src='https://cdn.plot.ly/plotly-latest.min.js'></script>
<script type='text/javascript' src='scripts/bench.js'></script>
</html>
<body>
<h1> reduce</h1>
`std::reduce/std::transform_reduce` are two algorithms consistenlty vectorized
by clang<br />
(regardless of the shape of the call - it can be written by hand or via
std::accumulate). <br />
Because of this there is no scalar baseline.
<br /> <br />
There are two interesting variations on reduce:<br />
a) We are reducing to the same type as the element type. <br />
b) We are reducing to a bigger type to deal with a potential overflow.
<br />
<h2> Reducing to the same type </h2>
This is relatively straight forward.<br />
The best results is achived when unrolling 4 times. <br />
In this case we are consistenlty massively winning - at most 5.6 times for
1000 bytes of chars. <br />
I don't know why the big difference in behaviour of `std::reduce` - the
assembly looks <a href="https://godbolt.org/z/tskjMs">identical</a> <br />
and I'm controlling for code alignment. <br />
The only idea I have is that punishment for loop peeling for chars is this
big, <br />
I have seen mutliple times quite poor perfromance of scalar code for chars.
<br /><br />
With respect to code alignment, `unsq_eve` version cares very little - at most
15% for 1000 bytes of chars. <br />
`std::reduce` - not so lucky - at 10k swings up to 44% and this does not seem
to be the scalar code's code fault. <br />
<h4> reducing to the same type, data </h4>
<div id='reduce_same_type'></div>
<script>dynamicEntryPoint('reduce_same_type', {
name: 'sum',
size: 'x',
algorithm: 'selection',
type: 'selection',
sum_type: 'selection',
time: 'y',
padding: 'min',
group: 'intel_9700K_avx2_bmi',
percentage: 100,
}, ['eve::algo::reduce/char/char', 'eve::algo::reduce/short/short', 'eve::algo::reduce/int/int',
'std::reduce/char/char', 'std::reduce/short/short', 'std::reduce/int/int']);
</script>
<h4> reducing to the same type, code alignment, unsq_eve </h4>
<div id='reduce_same_type_code_alignment_unsq_eve'></div>
<script>dynamicEntryPoint('reduce_same_type_code_alignment_unsq_eve', {
name: 'sum',
size: 'x',
algorithm: 'selection',
type: 'selection',
sum_type: 'selection',
time: 'y',
padding: 'minmax',
group: 'intel_9700K_avx2_bmi',
percentage: 100,
}, ['eve::algo::reduce/char/char', 'eve::algo::reduce/short/short', 'eve::algo::reduce/int/int']);
</script>
<h4> reducing to the same type, code alignment, `std::reduce` </h4>
<div id='reduce_same_type_code_alignment_std_reduce'></div>
<script>dynamicEntryPoint('reduce_same_type_code_alignment_std_reduce', {
name: 'sum',
size: 10000,
algorithm: 'selection',
type: 'selection',
sum_type: 'selection',
time: 'y',
padding: 'minmax',
group: 'intel_9700K_avx2_bmi',
percentage: 100,
}, ['std::reduce/char/char', 'std::reduce/short/short', 'std::reduce/int/int']);
</script>
<h2> Reducing to a different type </h2>
When reducing to a different type, we need to somehow convert from the <br />
array type, to the type we want to do our operations in. <br />
`std::reduce` for this generates a really nice <a
href="https://godbolt.org/z/fXLh8T">assembly</a> <br />
(chars reducing to shorts).
<br /><br />
```<br />
vpmovsxbw ymm4, xmmword ptr [rdi + rdx] <br />
vpaddw ymm0, ymm0, ymm4<br />
vpmovsxbw ymm4, xmmword ptr [rdi + rdx + 16]<br />
vpaddw ymm1, ymm1, ymm4<br />
vpmovsxbw ymm4, xmmword ptr [rdi + rdx + 32]<br />
```
<br /><br />
This is essentially: `_mm_cvtepi8_epi16 ` called directly on the address -
<br /><br />
I do the same trick in eve by loading a smaller `eve::wide` and <br />
then calling `eve::convert` on it.
We end up winning primarally because of not peeling the loops. <br />
<h3> reducing 40 bytes </h3>
On 40 bytes not peeling loops gives the most effect.<br />
adding chars to `short` is 3.3 times faster and adding to `int` ~30%. <br />
adding shorts to `int` is 1.7 times faster. <br />
<h4> reducing chars 40 bytes data </h4>
<div id='reduce_chars_40_bytes'></div>
<script>dynamicEntryPoint('reduce_chars_40_bytes', {
name: 'sum',
size: 40,
algorithm: 'selection',
type: 'selection',
sum_type: 'x',
time: 'y',
padding: 'min',
group: 'intel_9700K_avx2_bmi',
percentage: 100,
}, ['std::reduce/char', 'eve::algo::reduce/char']);
</script>
<h4> reducing shorts 40 bytes data </h4>
<div id='reduce_shorts_40_bytes'></div>
<script>dynamicEntryPoint('reduce_shorts_40_bytes', {
name: 'sum',
size: 40,
algorithm: 'selection',
type: 'selection',
sum_type: 'x',
time: 'y',
padding: 'min',
group: 'intel_9700K_avx2_bmi',
percentage: 100,
}, ['std::reduce/short', 'eve::algo::reduce/short']);
</script>
<h3> reducing 1000 bytes </h3>
On 1000 bytes not peeling loops still has an effect, though not that big.
<br />
adding chars to `short` is 1.5 times faster and adding to `int` is roughly the
same.
<br />
adding shorts to `int` is 1.13 times faster.
<br /> <br />
<h4> reducing chars 1000 bytes data </h4>
<div id='reduce_chars_1000_bytes'></div>
<script>dynamicEntryPoint('reduce_chars_1000_bytes', {
name: 'sum',
size: 1000,
algorithm: 'selection',
type: 'selection',
sum_type: 'x',
time: 'y',
padding: 'min',
group: 'intel_9700K_avx2_bmi',
percentage: 100,
}, ['std::reduce/char', 'eve::algo::reduce/char']);
</script>
<h4> reducing shorts 1000 bytes data </h4>
<div id='reduce_shorts_1000_bytes'></div>
<script>dynamicEntryPoint('reduce_shorts_1000_bytes', {
name: 'sum',
size: 1000,
algorithm: 'selection',
type: 'selection',
sum_type: 'x',
time: 'y',
padding: 'min',
group: 'intel_9700K_avx2_bmi',
percentage: 100,
}, ['std::reduce/short', 'eve::algo::reduce/short']);
</script>
<h3> reducing 10000 bytes </h3>
On 10'000 bytes loop peeling stops being important.<br /> <br />
<h4> reducing chars 1000 bytes data </h4>
<div id='reduce_chars_10000_bytes'></div>
<script>dynamicEntryPoint('reduce_chars_10000_bytes', {
name: 'sum',
size: 10000,
algorithm: 'selection',
type: 'selection',
sum_type: 'x',
time: 'y',
padding: 'min',
group: 'intel_9700K_avx2_bmi',
percentage: 100,
}, ['std::reduce/char', 'eve::algo::reduce/char']);
</script>
<div id='reduce_shorts_10000_bytes'></div>
<script>dynamicEntryPoint('reduce_shorts_10000_bytes', {
name: 'sum',
size: 10000,
algorithm: 'selection',
type: 'selection',
sum_type: 'x',
time: 'y',
padding: 'min',
group: 'intel_9700K_avx2_bmi',
percentage: 100,
}, ['std::reduce/short', 'eve::algo::reduce/short']);
</script>
<h2> Total benchmark </h2>
<div id='total_benchmark'></div>
<script>dynamicEntryPoint('total_benchmark', {
name: 'sum',
size: 'x',
algorithm: 'selection',
type: 'selection',
sum_type: 'selection',
time: 'y',
padding: 'min',
group: 'intel_9700K_avx2_bmi',
percentage: 100,
}, ['reduce']);
</script>
</body>