-
Notifications
You must be signed in to change notification settings - Fork 7
/
index6.html
1161 lines (1062 loc) · 128 KB
/
index6.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<!DOCTYPE html>
<head>
<meta charset="utf-8">
<title>Store Halfword Byte-Reverse Indexed</title>
<meta name="author" content="OzLabs">
<link href="https://sthbrx.github.io/rss.xml" type="application/rss+xml" rel="alternate"
title="Store Halfword Byte-Reverse Indexed RSS Feed" />
<!-- http://t.co/dKP3o1e -->
<meta name="HandheldFriendly" content="True">
<meta name="MobileOptimized" content="320">
<meta name="viewport" content="width=device-width, initial-scale=1">
<link href="https://sthbrx.github.io/favicon.png" rel="icon">
<link href="https://sthbrx.github.io/theme/css/main.css" media="screen, projection"
rel="stylesheet" type="text/css">
<link href="//fonts.googleapis.com/css?family=PT+Serif:regular,italic,bold,bolditalic"
rel="stylesheet" type="text/css">
<link href="//fonts.googleapis.com/css?family=PT+Sans:regular,italic,bold,bolditalic"
rel="stylesheet" type="text/css">
<script type="text/javascript">
document.addEventListener('DOMContentLoaded', function() {
var ts = document.createElement('span')
ts.className = 'toggle-sidebar'
ts = document.getElementById('content').appendChild(ts);
ts.addEventListener('click', function(e) {
e.preventDefault();
body = document.getElementsByTagName('body')[0];
bodyClasses = body.classList.toggle('collapse-sidebar');
});
var sections = document.querySelectorAll('aside.sidebar > section');
if (sections.length > 1) {
for (index = 0; index < sections.length; index++) {
section = sections[index];
if ((sections.length >= 3) && index % 3 === 0) {
section.classList.add("first");
}
var count = ((index +1) % 2) ? "odd" : "even";
section.classList.add(count);
}
}
if (sections.length >= 3) {
document.querySelector('aside.sidebar').classList.add('thirds');
}
});
</script>
<script>
(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
})(window,document,'script','//www.google-analytics.com/analytics.js','ga');
ga('create', 'UA-91189608-1', 'auto');
ga('send', 'pageview');
</script>
</head>
<body>
<header role="banner"><hgroup>
<h1><a href="https://sthbrx.github.io/">Store Halfword Byte-Reverse Indexed</a></h1>
<h2>A Power Technical Blog</h2>
</hgroup></header>
<nav role="navigation"><ul class="subscription" data-subscription="rss">
<li><a href="https://sthbrx.github.io/rss.xml" rel="subscribe-rss">RSS</a></li>
</ul>
<ul class="main-navigation">
<li >
<a href="https://sthbrx.github.io/category/cryptography.html">Cryptography</a>
</li>
<li >
<a href="https://sthbrx.github.io/category/development.html">Development</a>
</li>
<li >
<a href="https://sthbrx.github.io/category/education.html">Education</a>
</li>
<li >
<a href="https://sthbrx.github.io/category/openpower.html">OpenPOWER</a>
</li>
<li >
<a href="https://sthbrx.github.io/category/performance.html">Performance</a>
</li>
<li >
<a href="https://sthbrx.github.io/category/petitboot.html">Petitboot</a>
</li>
<li >
<a href="https://sthbrx.github.io/category/snowpatch.html">snowpatch</a>
</li>
<li >
<a href="https://sthbrx.github.io/category/virtualisation-and-emulation.html">Virtualisation and Emulation</a>
</li>
</ul></nav>
<div id="main">
<div id="content">
<div class="blog-index">
<article>
<header>
<h1 class="entry-title">
<a href="https://sthbrx.github.io/blog/2017/03/29/evaluating-cephfs-on-power/">Evaluating CephFS on Power</a>
</h1>
<p class="meta">
<time datetime="2017-03-29T00:00:00+11:00" pubdate>Wed 29 March 2017</time> </p>
</header>
<div class="byline_index">
<span class="byline author vcard">
Posted by <span class="fn">
<a href="https://sthbrx.github.io/author/alastair-dsilva.html">Alastair D'Silva</a>
</span>
</span>
<time datetime="2017-03-29T00:00:00+11:00" pubdate>Wed 29 March 2017</time></div>
<div class="entry-content"><h2>Methodology</h2>
<p>To evaluate CephFS, we will create a ppc64le virtual machine, with sufficient
space to compile the software, as well as 3 sparse 1TB disks to create the
object store.</p>
<p>We will then build & install the Ceph packages, after adding the PowerPC
optimisiations to the code. This is done, as ceph-deploy will fetch prebuilt
packages that do not have the performance patches if the packages are not
installed.</p>
<p>Finally, we will use the ceph-deploy to deploy the instance. We will ceph-deploy
via pip, to avoid file conflicts with the packages that we built.</p>
<p>For more information on what each command does, visit the following tutorial,
upon which which this is based:
<a href="http://palmerville.github.io/2016/04/30/single-node-ceph-install.html">http://palmerville.github.io/2016/04/30/single-node-ceph-install.html</a></p>
<h3>Virtual Machine Config</h3>
<p>Create a virtual machine with at least the following:
- 16GB of memory
- 16 CPUs
- 64GB disk for the root filesystem
- 3 x 1TB for the Ceph object store
- Ubuntu 16.04 default install (only use the 64GB disk, leave the others unpartitioned)</p>
<h3>Initial config</h3>
<ul>
<li>Enable ssh</li>
</ul>
<div class="highlight"><pre><span></span><code> sudo apt install openssh-server
sudo apt update
sudo apt upgrade
sudo reboot
</code></pre></div>
<ul>
<li>Install build tools</li>
</ul>
<div class="highlight"><pre><span></span><code> sudo apt install git debhelper
</code></pre></div>
<h3>Build Ceph</h3>
<ul>
<li>Clone the Ceph repo by following the instructions here: <a href="http://docs.ceph.com/docs/master/install/clone-source/">http://docs.ceph.com/docs/master/install/clone-source/</a></li>
</ul>
<div class="highlight"><pre><span></span><code><span class="w"> </span><span class="nv">mkdir</span><span class="w"> </span>$<span class="nv">HOME</span><span class="o">/</span><span class="nv">src</span>
<span class="w"> </span><span class="nv">cd</span><span class="w"> </span>$<span class="nv">HOME</span><span class="o">/</span><span class="nv">src</span>
<span class="w"> </span><span class="nv">git</span><span class="w"> </span><span class="nv">clone</span><span class="w"> </span><span class="o">--</span><span class="nv">recursive</span><span class="w"> </span><span class="nv">https</span>:<span class="o">//</span><span class="nv">github</span>.<span class="nv">com</span><span class="o">/</span><span class="nv">ceph</span><span class="o">/</span><span class="nv">ceph</span>.<span class="nv">git</span><span class="w"> </span>#<span class="w"> </span><span class="nv">This</span><span class="w"> </span><span class="nv">may</span><span class="w"> </span><span class="nv">take</span><span class="w"> </span><span class="nv">a</span><span class="w"> </span><span class="k">while</span>
<span class="w"> </span><span class="nv">cd</span><span class="w"> </span><span class="nv">ceph</span>
<span class="w"> </span><span class="nv">git</span><span class="w"> </span><span class="nv">checkout</span><span class="w"> </span><span class="nv">master</span>
<span class="w"> </span><span class="nv">git</span><span class="w"> </span><span class="nv">submodule</span><span class="w"> </span><span class="nv">update</span><span class="w"> </span><span class="o">--</span><span class="nv">force</span><span class="w"> </span><span class="o">--</span><span class="nv">init</span><span class="w"> </span><span class="o">--</span><span class="nv">recursive</span>
</code></pre></div>
<ul>
<li>Cherry-pick the Power performance patches:</li>
</ul>
<div class="highlight"><pre><span></span><code> git remote add kestrels https://github.com/kestrels/ceph.git
git fetch --all
git cherry-pick 59bed55a676ebbe3ad97d8ec005c2088553e4e11
</code></pre></div>
<ul>
<li>Install prerequisites</li>
</ul>
<div class="highlight"><pre><span></span><code> ./install-deps.sh
sudo apt install python-requests python-flask resource-agents curl python-cherrypy python3-pip python-django python-dateutil python-djangorestframework
sudo pip3 install ceph-deploy
</code></pre></div>
<ul>
<li>Build the packages as per the instructions: <a href="http://docs.ceph.com/docs/master/install/build-ceph/">http://docs.ceph.com/docs/master/install/build-ceph/</a></li>
</ul>
<div class="highlight"><pre><span></span><code><span class="w"> </span><span class="nx">cd</span><span class="w"> </span><span class="err">$</span><span class="nx">HOME</span><span class="o">/</span><span class="nx">src</span><span class="o">/</span><span class="nx">ceph</span>
<span class="w"> </span><span class="nx">sudo</span><span class="w"> </span><span class="nx">dpkg</span><span class="o">-</span><span class="nx">buildpackage</span><span class="w"> </span><span class="o">-</span><span class="nx">J</span><span class="err">$</span><span class="p">(</span><span class="nx">nproc</span><span class="p">)</span><span class="w"> </span><span class="err">#</span><span class="w"> </span><span class="nx">This</span><span class="w"> </span><span class="nx">will</span><span class="w"> </span><span class="nx">take</span><span class="w"> </span><span class="nx">a</span><span class="w"> </span><span class="nx">couple</span><span class="w"> </span><span class="nx">of</span><span class="w"> </span><span class="nx">hours</span><span class="w"> </span><span class="p">(</span><span class="mi">16</span><span class="w"> </span><span class="nx">cpus</span><span class="p">)</span>
</code></pre></div>
<ul>
<li>Install the packages (note that python3-ceph-argparse will fail, but is safe to ignore)</li>
</ul>
<div class="highlight"><pre><span></span><code> cd $HOME/src
sudo dpkg -i *.deb
</code></pre></div>
<h3>Create the ceph-deploy user</h3>
<div class="highlight"><pre><span></span><code> sudo adduser ceph-deploy
echo "ceph-deploy ALL = (root) NOPASSWD:ALL" | sudo tee /etc/sudoers.d/ceph-deploy
sudo chmod 0440 /etc/sudoers.d/ceph-deploy
</code></pre></div>
<h3>Configure the ceph-deploy user environment</h3>
<div class="highlight"><pre><span></span><code><span class="w"> </span><span class="n">su</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">ceph</span><span class="o">-</span><span class="n">deploy</span>
<span class="w"> </span><span class="n">ssh</span><span class="o">-</span><span class="n">keygen</span>
<span class="w"> </span><span class="n">node</span><span class="o">=</span><span class="n n-Quoted">`hostname`</span>
<span class="w"> </span><span class="n">ssh</span><span class="o">-</span><span class="n">copy</span><span class="o">-</span><span class="n">id</span><span class="w"> </span><span class="n">ceph</span><span class="o">-</span><span class="n">deploy</span><span class="nv">@$node</span>
<span class="w"> </span><span class="n">mkdir</span><span class="w"> </span><span class="n">$HOME</span><span class="o">/</span><span class="n">ceph</span><span class="o">-</span><span class="n">cluster</span>
<span class="w"> </span><span class="n">cd</span><span class="w"> </span><span class="n">$HOME</span><span class="o">/</span><span class="n">ceph</span><span class="o">-</span><span class="n">cluster</span>
<span class="w"> </span><span class="n">ceph</span><span class="o">-</span><span class="n">deploy</span><span class="w"> </span><span class="k">new</span><span class="w"> </span><span class="n">$node</span><span class="w"> </span><span class="c1"># If this fails, remove the bogus 127.0.1.1 entry from /etc/hosts</span>
<span class="w"> </span><span class="n">echo</span><span class="w"> </span><span class="s1">'osd pool default size = 2'</span><span class="w"> </span><span class="o">>></span><span class="w"> </span><span class="n">ceph</span><span class="p">.</span><span class="n">conf</span>
<span class="w"> </span><span class="n">echo</span><span class="w"> </span><span class="s1">'osd crush chooseleaf type = 0'</span><span class="w"> </span><span class="o">>></span><span class="w"> </span><span class="n">ceph</span><span class="p">.</span><span class="n">conf</span>
</code></pre></div>
<h3>Complete the Ceph deployment</h3>
<div class="highlight"><pre><span></span><code><span class="w"> </span>ceph-deploy<span class="w"> </span>install<span class="w"> </span><span class="nv">$node</span>
<span class="w"> </span>ceph-deploy<span class="w"> </span>mon<span class="w"> </span>create-initial
<span class="w"> </span>drives="vda<span class="w"> </span>vdb<span class="w"> </span>vdc"<span class="w"> </span>#<span class="w"> </span>the<span class="w"> </span>1TB<span class="w"> </span>drives<span class="w"> </span>-<span class="w"> </span>check<span class="w"> </span>that<span class="w"> </span>these<span class="w"> </span>are<span class="w"> </span>correct<span class="w"> </span>for<span class="w"> </span>your<span class="w"> </span>system
<span class="w"> </span>for<span class="w"> </span>drive<span class="w"> </span>in<span class="w"> </span><span class="nv">$drives</span>;<span class="w"> </span>do<span class="w"> </span>ceph-deploy<span class="w"> </span>disk<span class="w"> </span>zap<span class="w"> </span><span class="nv">$node</span>:<span class="nv">$drive</span>;<span class="w"> </span>ceph-deploy<span class="w"> </span>osd<span class="w"> </span>prepare<span class="w"> </span><span class="nv">$node</span>:<span class="nv">$drive</span>;<span class="w"> </span>done
<span class="w"> </span>for<span class="w"> </span>drive<span class="w"> </span>in<span class="w"> </span><span class="nv">$drives</span>;<span class="w"> </span>do<span class="w"> </span>ceph-deploy<span class="w"> </span>osd<span class="w"> </span>activate<span class="w"> </span><span class="nv">$node</span>:/dev/<span class="cp">${</span><span class="n">drive</span><span class="cp">}</span>1;<span class="w"> </span>done
<span class="w"> </span>ceph-deploy<span class="w"> </span>admin<span class="w"> </span><span class="nv">$node</span>
<span class="w"> </span>sudo<span class="w"> </span>chmod<span class="w"> </span>+r<span class="w"> </span>/etc/ceph/ceph.client.admin.keyring
<span class="w"> </span>ceph<span class="w"> </span>-s<span class="w"> </span>#<span class="w"> </span>Check<span class="w"> </span>the<span class="w"> </span>state<span class="w"> </span>of<span class="w"> </span>the<span class="w"> </span>cluster
</code></pre></div>
<h3>Configure CephFS</h3>
<div class="highlight"><pre><span></span><code> ceph-deploy mds create $node
ceph osd pool create cephfs_data 128
ceph osd pool create cephfs_metadata 128
ceph fs new cephfs cephfs_metadata cephfs_data
sudo systemctl status ceph\*.service ceph\*.target # Ensure the ceph-osd, ceph-mon & ceph-mds daemons are running
sudo mkdir /mnt/cephfs
key=`grep key ~/ceph-cluster/ceph.client.admin.keyring | cut -d ' ' -f 3`
sudo mount -t ceph $node:6789:/ /mnt/cephfs -o name=admin,secret=$key
</code></pre></div>
<h2>References</h2>
<ol>
<li><a href="http://docs.ceph.com/docs/master/install/clone-source/">http://docs.ceph.com/docs/master/install/clone-source/</a></li>
<li><a href="http://docs.ceph.com/docs/master/install/build-ceph/">http://docs.ceph.com/docs/master/install/build-ceph/</a></li>
<li><a href="http://palmerville.github.io/2016/04/30/single-node-ceph-install.html">http://palmerville.github.io/2016/04/30/single-node-ceph-install.html</a></li>
</ol></div>
</article>
<article>
<header>
<h1 class="entry-title">
<a href="https://sthbrx.github.io/blog/2017/03/24/erasure-coding-for-programmers-part-2/">Erasure Coding for Programmers, Part 2</a>
</h1>
<p class="meta">
<time datetime="2017-03-24T10:08:00+11:00" pubdate>Fri 24 March 2017</time> </p>
</header>
<div class="byline_index">
<span class="byline author vcard">
Posted by <span class="fn">
<a href="https://sthbrx.github.io/author/daniel-axtens.html">Daniel Axtens</a>
</span>
</span>
<time datetime="2017-03-24T10:08:00+11:00" pubdate>Fri 24 March 2017</time></div>
<div class="entry-content"><p>We left <a href="/blog/2017/03/20/erasure-coding-for-programmers-part-1/">part 1</a> having explored GF(2^8) and RAID 6, and asking the question "what does all this have to do with Erasure Codes?"</p>
<p>Basically, the thinking goes "RAID 6 is cool, but what if, instead of two parity disks, we had an arbitrary number of parity disks?"</p>
<p>How would we do that? Well, let's introduce our new best friend: Coding Theory!</p>
<p>Say we want to transmit some data across an error-prone medium. We don't know where the errors might occur, so we add some extra information to allow us to detect and possibly correct for errors. This is a code. Codes are a largish field of engineering, but rather than show off my knowledge about systematic linear block codes, let's press on.</p>
<p>Today, our error-prone medium is an array of inexpensive disks. Now we make this really nice assumption about disks, namely that they are either perfectly reliable or completely missing. In other words, we consider that a disk will either be present or 'erased'. We come up with 'erasure codes' that are able to reconstruct data when it is known to be missing. (This is a slightly different problem to being able to verify and correct data that might or might not be subtly corrupted. Disks also have to deal with this problem, but it is <em>not</em> something erasure codes address!)</p>
<p>The particular code we use is a Reed-Solomon code. The specific details are unimportant, but there's a really good graphical outline of the broad concepts in sections 1 and 3 of <a href="http://jerasure.org/jerasure-2.0/">the Jerasure paper/manual</a>. (Don't go on to section 4.)</p>
<p>That should give you some background on how this works at a pretty basic mathematical level. Implementation is a matter of mapping that maths (matrix multiplication) onto hardware primitives, and making it go fast.</p>
<h2>Scope</h2>
<p>I'm deliberately <em>not</em> covering some pretty vast areas of what would be required to write your own erasure coding library from scratch. I'm not going to talk about how to compose the matricies, how to invert them, or anything like that. I'm not sure how that would be a helpful exercise - ISA-L and jerasure already exist and do that for you.</p>
<p>What I want to cover is an efficient implementation of the some algorithms, once you have the matricies nailed down.</p>
<p>I'm also going to assume your library already provides a generic multiplication function in GF(2^8). That's required to construct the matrices, so it's a pretty safe assumption.</p>
<h2>The beginnings of an API</h2>
<p>Let's make this a bit more concrete.</p>
<p>This will be heavily based on the <a href="https://01.org/intel%C2%AE-storage-acceleration-library-open-source-version/documentation/isa-l-open-source-api">ISA-L API</a> but you probably want to plug into ISA-L anyway, so that shouldn't be a problem.</p>
<p>What I want to do is build up from very basic algorithmic components into something useful.</p>
<p>The first thing we want to do is to be able to is Galois Field multiplication of an entire region of bytes by an arbitrary constant.</p>
<p>We basically want <code>gf_vect_mul(size_t len, <something representing the constant>, unsigned char * src, unsigned char * dest)</code></p>
<h3>Simple and slow approach</h3>
<p>The simplest way is to do something like this:</p>
<div class="highlight"><pre><span></span><code><span class="n">void</span><span class="w"> </span><span class="n">gf_vect_mul_simple</span><span class="p">(</span><span class="n">size_t</span><span class="w"> </span><span class="nf">len</span><span class="p">,</span><span class="w"> </span><span class="n">unsigned</span><span class="w"> </span><span class="nc">char</span><span class="w"> </span><span class="n">c</span><span class="p">,</span><span class="w"> </span><span class="n">unsigned</span><span class="w"> </span><span class="nc">char</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">src</span><span class="p">,</span><span class="w"> </span><span class="n">unsigned</span><span class="w"> </span><span class="nc">char</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">dest</span><span class="p">)</span><span class="w"> </span><span class="err">{</span>
<span class="w"> </span><span class="n">size_t</span><span class="w"> </span><span class="n">i</span><span class="p">;</span>
<span class="w"> </span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">i</span><span class="o">=</span><span class="mi">0</span><span class="p">;</span><span class="w"> </span><span class="n">i</span><span class="o"><</span><span class="nf">len</span><span class="p">;</span><span class="w"> </span><span class="n">i</span><span class="o">++</span><span class="p">)</span><span class="w"> </span><span class="err">{</span>
<span class="w"> </span><span class="n">dest</span><span class="o">[</span><span class="n">i</span><span class="o">]</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">gf_mul</span><span class="p">(</span><span class="n">c</span><span class="p">,</span><span class="w"> </span><span class="n">src</span><span class="o">[</span><span class="n">i</span><span class="o">]</span><span class="p">);</span>
<span class="w"> </span><span class="err">}</span>
<span class="err">}</span>
</code></pre></div>
<p>That does multiplication element by element using the library's supplied <code>gf_mul</code> function, which - as the name suggests - does GF(2^8) multiplication of a scalar by a scalar.</p>
<p>This works. The problem is that it is very, painfully, slow - in the order of a few hundred megabytes per second.</p>
<h3>Going faster</h3>
<p>How can we make this faster?</p>
<p>There are a few things we can try: if you want to explore a whole range of different ways to do this, check out the <a href="http://jerasure.org/gf-complete-1.02/">gf-complete</a> project. I'm going to assume we want to skip right to the end and know what is the fastest we've found.</p>
<p>Cast your mind back to the <a href="https://www.kernel.org/pub/linux/kernel/people/hpa/raid6.pdf">RAID 6 paper</a> (PDF). I talked about in <a href="/blog/2017/03/20/erasure-coding-for-programmers-part-1/">part 1</a>. That had a way of doing an efficient multiplication in GF(2^8) using vector instructions.</p>
<p>To refresh your memory, we split the multiplication into two parts - low bits and high bits, looked them up separately in a lookup table, and joined them with XOR. We then discovered that on modern Power chips, we could do that in one instruction with <code>vpermxor</code>.</p>
<p>So, a very simple way to do this would be:</p>
<ul>
<li>generate the table for <code>a</code></li>
<li>for each 16-byte chunk of our input:<ul>
<li>load the input</li>
<li>do the <code>vpermxor</code> with the table</li>
<li>save it out</li>
</ul>
</li>
</ul>
<p>Generating the tables is reasonably straight-forward, in theory. Recall that the tables are <code>a</code> * {{00},{01},...,{0f}} and <code>a</code> * {{00},{10},..,{f0}} - a couple of loops in C will generate them without difficulty. ISA-L has a function to do this, as does gf-complete in split-table mode, so I won't repeat them here.</p>
<p>So, let's recast our function to take the tables as an input rather than the constant <code>a</code>. Assume we're provided the two tables concatenated into one 32-byte chunk. That would give us:</p>
<div class="highlight"><pre><span></span><code>void gf_vect_mul_v2(size_t len, unsigned char * table, unsigned char * src, unsigned char * dest)
</code></pre></div>
<p>Here's how you would do it in C:</p>
<div class="highlight"><pre><span></span><code><span class="nv">void</span><span class="w"> </span><span class="nv">gf_vect_mul_v2</span><span class="ss">(</span><span class="nv">size_t</span><span class="w"> </span><span class="nv">len</span>,<span class="w"> </span><span class="nv">unsigned</span><span class="w"> </span><span class="nv">char</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="nv">table</span>,<span class="w"> </span><span class="nv">unsigned</span><span class="w"> </span><span class="nv">char</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="nv">src</span>,<span class="w"> </span><span class="nv">unsigned</span><span class="w"> </span><span class="nv">char</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="nv">dest</span><span class="ss">)</span><span class="w"> </span>{
<span class="w"> </span><span class="nv">vector</span><span class="w"> </span><span class="nv">unsigned</span><span class="w"> </span><span class="nv">char</span><span class="w"> </span><span class="nv">tbl1</span>,<span class="w"> </span><span class="nv">tbl2</span>,<span class="w"> </span><span class="nv">in</span>,<span class="w"> </span><span class="nv">out</span><span class="c1">;</span>
<span class="w"> </span><span class="nv">size_t</span><span class="w"> </span><span class="nv">i</span><span class="c1">;</span>
<span class="w"> </span><span class="cm">/* Assume table, src, dest are aligned and len is a multiple of 16 */</span>
<span class="w"> </span><span class="nv">tbl1</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nv">vec_ld</span><span class="ss">(</span><span class="mi">16</span>,<span class="w"> </span><span class="nv">table</span><span class="ss">)</span><span class="c1">;</span>
<span class="w"> </span><span class="nv">tbl2</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nv">vec_ld</span><span class="ss">(</span><span class="mi">0</span>,<span class="w"> </span><span class="nv">table</span><span class="ss">)</span><span class="c1">;</span>
<span class="w"> </span><span class="k">for</span><span class="w"> </span><span class="ss">(</span><span class="nv">i</span><span class="o">=</span><span class="mi">0</span><span class="c1">; i<len; i+=16) {</span>
<span class="w"> </span><span class="nv">in</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nv">vec_ld</span><span class="ss">(</span><span class="nv">i</span>,<span class="w"> </span><span class="ss">(</span><span class="nv">unsigned</span><span class="w"> </span><span class="nv">char</span><span class="w"> </span><span class="o">*</span><span class="ss">)</span><span class="nv">src</span><span class="ss">)</span><span class="c1">;</span>
<span class="w"> </span><span class="nv">__asm__</span><span class="ss">(</span><span class="s2">"vpermxor %0, %1, %2, %3"</span><span class="w"> </span>:<span class="w"> </span><span class="s2">"=v"</span><span class="ss">(</span><span class="nv">out</span><span class="ss">)</span><span class="w"> </span>:<span class="w"> </span><span class="s2">"v"</span><span class="ss">(</span><span class="nv">tbl1</span><span class="ss">)</span>,<span class="w"> </span><span class="s2">"v"</span><span class="ss">(</span><span class="nv">tbl2</span><span class="ss">)</span>,<span class="w"> </span><span class="s2">"v"</span><span class="ss">(</span><span class="nv">in</span><span class="ss">)</span>
<span class="w"> </span><span class="nv">vec_st</span><span class="ss">(</span><span class="nv">out</span>,<span class="w"> </span><span class="nv">i</span>,<span class="w"> </span><span class="ss">(</span><span class="nv">unsigned</span><span class="w"> </span><span class="nv">char</span><span class="w"> </span><span class="o">*</span><span class="ss">)</span><span class="nv">dest</span><span class="ss">)</span><span class="c1">;</span>
<span class="w"> </span>}
}
</code></pre></div>
<p>There's a few quirks to iron out - making sure the table is laid out in the vector register in the way you expect, etc, but that generally works and is quite fast - my Power 8 VM does about 17-18 GB/s with non-cache-contained data with this implementation.</p>
<p>We can go a bit faster by doing larger chunks at a time:</p>
<div class="highlight"><pre><span></span><code><span class="w"> </span><span class="k">for</span><span class="w"> </span><span class="ss">(</span><span class="nv">i</span><span class="o">=</span><span class="mi">0</span><span class="c1">; i<vlen; i+=64) {</span>
<span class="w"> </span><span class="nv">in1</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nv">vec_ld</span><span class="ss">(</span><span class="nv">i</span>,<span class="w"> </span><span class="ss">(</span><span class="nv">unsigned</span><span class="w"> </span><span class="nv">char</span><span class="w"> </span><span class="o">*</span><span class="ss">)</span><span class="nv">src</span><span class="ss">)</span><span class="c1">;</span>
<span class="w"> </span><span class="nv">in2</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nv">vec_ld</span><span class="ss">(</span><span class="nv">i</span><span class="o">+</span><span class="mi">16</span>,<span class="w"> </span><span class="ss">(</span><span class="nv">unsigned</span><span class="w"> </span><span class="nv">char</span><span class="w"> </span><span class="o">*</span><span class="ss">)</span><span class="nv">src</span><span class="ss">)</span><span class="c1">;</span>
<span class="w"> </span><span class="nv">in3</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nv">vec_ld</span><span class="ss">(</span><span class="nv">i</span><span class="o">+</span><span class="mi">32</span>,<span class="w"> </span><span class="ss">(</span><span class="nv">unsigned</span><span class="w"> </span><span class="nv">char</span><span class="w"> </span><span class="o">*</span><span class="ss">)</span><span class="nv">src</span><span class="ss">)</span><span class="c1">;</span>
<span class="w"> </span><span class="nv">in4</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nv">vec_ld</span><span class="ss">(</span><span class="nv">i</span><span class="o">+</span><span class="mi">48</span>,<span class="w"> </span><span class="ss">(</span><span class="nv">unsigned</span><span class="w"> </span><span class="nv">char</span><span class="w"> </span><span class="o">*</span><span class="ss">)</span><span class="nv">src</span><span class="ss">)</span><span class="c1">;</span>
<span class="w"> </span><span class="nv">__asm__</span><span class="ss">(</span><span class="s2">"vpermxor %0, %1, %2, %3"</span><span class="w"> </span>:<span class="w"> </span><span class="s2">"=v"</span><span class="ss">(</span><span class="nv">out1</span><span class="ss">)</span><span class="w"> </span>:<span class="w"> </span><span class="s2">"v"</span><span class="ss">(</span><span class="nv">tbl1</span><span class="ss">)</span>,<span class="w"> </span><span class="s2">"v"</span><span class="ss">(</span><span class="nv">tbl2</span><span class="ss">)</span>,<span class="w"> </span><span class="s2">"v"</span><span class="ss">(</span><span class="nv">in1</span><span class="ss">))</span><span class="c1">;</span>
<span class="w"> </span><span class="nv">__asm__</span><span class="ss">(</span><span class="s2">"vpermxor %0, %1, %2, %3"</span><span class="w"> </span>:<span class="w"> </span><span class="s2">"=v"</span><span class="ss">(</span><span class="nv">out2</span><span class="ss">)</span><span class="w"> </span>:<span class="w"> </span><span class="s2">"v"</span><span class="ss">(</span><span class="nv">tbl1</span><span class="ss">)</span>,<span class="w"> </span><span class="s2">"v"</span><span class="ss">(</span><span class="nv">tbl2</span><span class="ss">)</span>,<span class="w"> </span><span class="s2">"v"</span><span class="ss">(</span><span class="nv">in2</span><span class="ss">))</span><span class="c1">;</span>
<span class="w"> </span><span class="nv">__asm__</span><span class="ss">(</span><span class="s2">"vpermxor %0, %1, %2, %3"</span><span class="w"> </span>:<span class="w"> </span><span class="s2">"=v"</span><span class="ss">(</span><span class="nv">out3</span><span class="ss">)</span><span class="w"> </span>:<span class="w"> </span><span class="s2">"v"</span><span class="ss">(</span><span class="nv">tbl1</span><span class="ss">)</span>,<span class="w"> </span><span class="s2">"v"</span><span class="ss">(</span><span class="nv">tbl2</span><span class="ss">)</span>,<span class="w"> </span><span class="s2">"v"</span><span class="ss">(</span><span class="nv">in3</span><span class="ss">))</span><span class="c1">;</span>
<span class="w"> </span><span class="nv">__asm__</span><span class="ss">(</span><span class="s2">"vpermxor %0, %1, %2, %3"</span><span class="w"> </span>:<span class="w"> </span><span class="s2">"=v"</span><span class="ss">(</span><span class="nv">out4</span><span class="ss">)</span><span class="w"> </span>:<span class="w"> </span><span class="s2">"v"</span><span class="ss">(</span><span class="nv">tbl1</span><span class="ss">)</span>,<span class="w"> </span><span class="s2">"v"</span><span class="ss">(</span><span class="nv">tbl2</span><span class="ss">)</span>,<span class="w"> </span><span class="s2">"v"</span><span class="ss">(</span><span class="nv">in4</span><span class="ss">))</span><span class="c1">;</span>
<span class="w"> </span><span class="nv">vec_st</span><span class="ss">(</span><span class="nv">out1</span>,<span class="w"> </span><span class="nv">i</span>,<span class="w"> </span><span class="ss">(</span><span class="nv">unsigned</span><span class="w"> </span><span class="nv">char</span><span class="w"> </span><span class="o">*</span><span class="ss">)</span><span class="nv">dest</span><span class="ss">)</span><span class="c1">;</span>
<span class="w"> </span><span class="nv">vec_st</span><span class="ss">(</span><span class="nv">out2</span>,<span class="w"> </span><span class="nv">i</span><span class="o">+</span><span class="mi">16</span>,<span class="w"> </span><span class="ss">(</span><span class="nv">unsigned</span><span class="w"> </span><span class="nv">char</span><span class="w"> </span><span class="o">*</span><span class="ss">)</span><span class="nv">dest</span><span class="ss">)</span><span class="c1">;</span>
<span class="w"> </span><span class="nv">vec_st</span><span class="ss">(</span><span class="nv">out3</span>,<span class="w"> </span><span class="nv">i</span><span class="o">+</span><span class="mi">32</span>,<span class="w"> </span><span class="ss">(</span><span class="nv">unsigned</span><span class="w"> </span><span class="nv">char</span><span class="w"> </span><span class="o">*</span><span class="ss">)</span><span class="nv">dest</span><span class="ss">)</span><span class="c1">;</span>
<span class="w"> </span><span class="nv">vec_st</span><span class="ss">(</span><span class="nv">out4</span>,<span class="w"> </span><span class="nv">i</span><span class="o">+</span><span class="mi">48</span>,<span class="w"> </span><span class="ss">(</span><span class="nv">unsigned</span><span class="w"> </span><span class="nv">char</span><span class="w"> </span><span class="o">*</span><span class="ss">)</span><span class="nv">dest</span><span class="ss">)</span><span class="c1">;</span>
<span class="w"> </span>}
</code></pre></div>
<p>This goes at about 23.5 GB/s.</p>
<p>We can go one step further and do the core loop in assembler - that means we control the instruction layout and so on. I tried this: it turns out that for the basic vector multiply loop, if we turn off ASLR and pin to a particular CPU, we can see a improvement of a few percent (and a decrease in variability) over C code.</p>
<h2>Building from vector multiplication</h2>
<p>Once you're comfortable with the core vector multiplication, you can start to build more interesting routines.</p>
<p>A particularly useful one on Power turned out to be the multiply and add routine: like gf_vect_mul, except that rather than overwriting the output, it loads the output and xors the product in. This is a simple extension of the gf_vect_mul function so is left as an exercise to the reader.</p>
<p>The next step would be to start building erasure coding proper. Recall that to get an element of our output, we take a dot product: we take the corresponding input element of each disk, multiply it with the corresponding GF(2^8) coding matrix element and sum all those products. So all we need now is a dot product algorithm.</p>
<p>One approach is the conventional dot product:</p>
<ul>
<li>for each element<ul>
<li>zero accumulator</li>
<li>for each source<ul>
<li>load <code>input[source][element]</code></li>
<li>do GF(2^8) multiplication</li>
<li>xor into accumulator</li>
</ul>
</li>
<li>save accumulator to <code>output[element]</code></li>
</ul>
</li>
</ul>
<p>The other approach is multiply and add:</p>
<ul>
<li>for each source<ul>
<li>for each element<ul>
<li>load <code>input[source][element]</code></li>
<li>do GF(2^8) multiplication</li>
<li>load <code>output[element]</code></li>
<li>xor in product</li>
<li>save <code>output[element]</code></li>
</ul>
</li>
</ul>
</li>
</ul>
<p>The dot product approach has the advantage of fewer writes. The multiply and add approach has the advantage of better cache/prefetch performance. The approach you ultimately go with will probably depend on the characteristics of your machine and the length of data you are dealing with.</p>
<p>For what it's worth, ISA-L ships with only the first approach in x86 assembler, and Jerasure leans heavily towards the second approach.</p>
<p>Once you have a vector dot product sorted, you can build a full erasure coding setup: build your tables with your library, then do a dot product to generate each of your outputs!</p>
<p>In ISA-L, this is implemented something like this:</p>
<div class="highlight"><pre><span></span><code><span class="cm">/*</span>
<span class="cm"> * ec_encode_data_simple(length of each data input, number of inputs,</span>
<span class="cm"> * number of outputs, pre-generated GF(2^8) tables,</span>
<span class="cm"> * input data pointers, output code pointers)</span>
<span class="cm"> */</span>
<span class="nx">void</span><span class="w"> </span><span class="nx">ec_encode_data_simple</span><span class="p">(</span><span class="nx">int</span><span class="w"> </span><span class="nx">len</span><span class="p">,</span><span class="w"> </span><span class="nx">int</span><span class="w"> </span><span class="nx">k</span><span class="p">,</span><span class="w"> </span><span class="nx">int</span><span class="w"> </span><span class="nx">rows</span><span class="p">,</span><span class="w"> </span><span class="nx">unsigned</span><span class="w"> </span><span class="nx">char</span><span class="w"> </span><span class="o">*</span><span class="nx">g_tbls</span><span class="p">,</span>
<span class="w"> </span><span class="nx">unsigned</span><span class="w"> </span><span class="nx">char</span><span class="w"> </span><span class="o">**</span><span class="nx">data</span><span class="p">,</span><span class="w"> </span><span class="nx">unsigned</span><span class="w"> </span><span class="nx">char</span><span class="w"> </span><span class="o">**</span><span class="nx">coding</span><span class="p">)</span>
<span class="p">{</span>
<span class="w"> </span><span class="k">while</span><span class="w"> </span><span class="p">(</span><span class="nx">rows</span><span class="p">)</span><span class="w"> </span><span class="p">{</span>
<span class="w"> </span><span class="nx">gf_vect_dot_prod</span><span class="p">(</span><span class="nx">len</span><span class="p">,</span><span class="w"> </span><span class="nx">k</span><span class="p">,</span><span class="w"> </span><span class="nx">g_tbls</span><span class="p">,</span><span class="w"> </span><span class="nx">data</span><span class="p">,</span><span class="w"> </span><span class="o">*</span><span class="nx">coding</span><span class="p">);</span>
<span class="w"> </span><span class="nx">g_tbls</span><span class="w"> </span><span class="o">+=</span><span class="w"> </span><span class="nx">k</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="mi">32</span><span class="p">;</span>
<span class="w"> </span><span class="nx">coding</span><span class="o">++</span><span class="p">;</span>
<span class="w"> </span><span class="nx">rows</span><span class="o">--</span><span class="p">;</span>
<span class="w"> </span><span class="p">}</span>
<span class="p">}</span>
</code></pre></div>
<h2>Going faster still</h2>
<p>Eagle eyed readers will notice that however we generate an output, we have to read all the input elements. This means that if we're doing a code with 10 data disks and 4 coding disks, we have to read each of the 10 inputs 4 times.</p>
<p>We could do better if we could calculate multiple outputs for each pass through the inputs. This is a little fiddly to implement, but does lead to a speed improvement.</p>
<p>ISA-L is an excellent example here. Intel goes up to 6 outputs at once: the number of outputs you can do is only limited by how many vector registers you have to put the various operands and results in.</p>
<h2>Tips and tricks</h2>
<ul>
<li>
<p>Benchmarking is tricky. I do the following on a bare-metal, idle machine, with ASLR off and pinned to an arbitrary hardware thread. (Code is for the <a href="https://fishshell.com/">fish shell</a>)</p>
<div class="highlight"><pre><span></span><code><span class="n">for</span> <span class="n">x</span> <span class="n">in</span> <span class="p">(</span><span class="n">seq</span> <span class="mi">1</span> <span class="mi">50</span><span class="p">)</span>
<span class="n">setarch</span> <span class="n">ppc64le</span> <span class="o">-</span><span class="n">R</span> <span class="n">taskset</span> <span class="o">-</span><span class="n">c</span> <span class="mi">24</span> <span class="n">erasure_code</span><span class="o">/</span><span class="n">gf_vect_mul_perf</span>
<span class="n">end</span> <span class="p">|</span> <span class="n">awk</span> <span class="s">'/MB/ {sum+=$13} END {print sum/50, "MB/s"}'</span>
</code></pre></div>
</li>
<li>
<p>Debugging is tricky; the more you can do in C and the less you do in assembly, the easier your life will be.</p>
</li>
<li>
<p>Vector code is notoriously alignment-sensitive - if you can't figure out why something is wrong, check alignment. (Pro-tip: ISA-L does <em>not</em> guarantee the alignment of the <code>gftbls</code> parameter, and many of the tests supply an unaligned table from the stack. For testing <code>__attribute__((aligned(16)))</code> is your friend!)</p>
</li>
<li>
<p>Related: GCC is moving towards assignment over vector intrinsics, at least on Power:</p>
<div class="highlight"><pre><span></span><code>vector unsigned char a;
unsigned char * data;
// good, also handles word-aligned data with VSX
a = *(vector unsigned char *)data;
// bad, requires special handling of non-16-byte aligned data
a = vec_ld(0, (unsigned char *) data);
</code></pre></div>
</li>
</ul>
<h2>Conclusion</h2>
<p>Hopefully by this point you're equipped to figure out how your erasure coding library of choice works, and write your own optimised implementation (or maintain an implementation written by someone else).</p>
<p>I've referred to a number of resources throughout this series:</p>
<ul>
<li>ISA-L <a href="https://github.com/01org/isa-l">code</a>, <a href="">API description</a></li>
<li>Jerasure <a href="http://jerasure.org/">code</a>, <a href="http://jerasure.org/jerasure-2.0/">docs</a></li>
<li>gf-complete <a href="http://jerasure.org/">code</a>, <a href="http://jerasure.org/gf-complete-1.02/">docs</a> </li>
<li><a href="https://www.kernel.org/pub/linux/kernel/people/hpa/raid6.pdf">The mathematics of RAID-6</a> (PDF), H. Peter Anvin</li>
</ul>
<p>If you want to go deeper, I also read the following and found them quite helpful in understanding Galois Fields and Reed-Solomon coding:</p>
<ul>
<li><a href="https://ntrs.nasa.gov/archive/nasa/casi.ntrs.nasa.gov/19900019023.pdf">Tutorial on Reed-Solomon Error Correction Coding</a> (PDF), William A. Geisel, NASA</li>
<li><a href="https://ntrs.nasa.gov/archive/nasa/casi.ntrs.nasa.gov/19900019023.pdf">Reed-Solomon error correction</a> (PDF), BBC R&D White Paper WHP 031, C. K. P. Clarke.</li>
</ul>
<p>For a more rigorous mathematical approach to rings and fields, a university mathematics course may be of interest. For more on coding theory, a university course in electronics engineering may be helpful.</p></div>
</article>
<article>
<header>
<h1 class="entry-title">
<a href="https://sthbrx.github.io/blog/2017/03/20/erasure-coding-for-programmers-part-1/">Erasure Coding for Programmers, Part 1</a>
</h1>
<p class="meta">
<time datetime="2017-03-20T10:43:00+11:00" pubdate>Mon 20 March 2017</time> </p>
</header>
<div class="byline_index">
<span class="byline author vcard">
Posted by <span class="fn">
<a href="https://sthbrx.github.io/author/daniel-axtens.html">Daniel Axtens</a>
</span>
</span>
<time datetime="2017-03-20T10:43:00+11:00" pubdate>Mon 20 March 2017</time></div>
<div class="entry-content"><p>Erasure coding is an increasingly popular storage technology - allowing the same level of fault tolerance as replication with a significantly reduced storage footprint.</p>
<p>Increasingly, erasure coding is available 'out of the box' on storage solutions such as Ceph and OpenStack Swift. Normally, you'd just pull in a library like <a href="https://github.com/01org/isa-l">ISA-L</a> or <a href="http://jerasure.org">jerasure</a>, and set some config options, and you'd be done.</p>
<p>This post is not about that. This post is about how I went from knowing nothing about erasure coding to writing POWER optimised routines to make it go fast. (These are in the process of being polished for upstream at the moment.) If you want to understand how erasure coding works under the hood - and in particular if you're interested in writing optimised routines to make it run quickly in your platform - this is for you.</p>
<h2>What are erasure codes anyway?</h2>
<p>I think the easiest way to begin thinking about erasure codes is "RAID 6 on steroids". RAID 6 allows you to have up to 255 data disks and 2 parity disks (called P and Q), thus allowing you to tolerate the failure of up to 2 arbitrary disks without data loss.</p>
<p>Erasure codes allow you to have k data disks and m 'parity' or coding disks. You then have a total of m + k disks, and you can tolerate the failure of up to m without losing data.</p>
<p>The downside of erasure coding is that computing what to put on those parity disks is CPU intensive. Lets look at what we put on them.</p>
<h2>RAID 6</h2>
<p>RAID 6 is the easiest way to get started on understanding erasure codes for a number of reasons. H Peter Anvin's paper on RAID 6 in the Linux kernel is an excellent start, but does dive in a bit quickly to the underlying mathematics. So before reading that, read on!</p>
<h2>Rings and Fields</h2>
<p>As programmers we're pretty comfortable with modular arithmetic - the idea that if you have:</p>
<div class="highlight"><pre><span></span><code>unsigned char a = 255;
a++;
</code></pre></div>
<p>the new value of <code>a</code> will be 0, not 256.</p>
<p>This is an example of an algebraic structure called a <em>ring</em>.</p>
<p>Rings obey certain laws. For our purposes, we'll consider the following incomplete and somewhat simplified list:</p>
<ul>
<li>There is an addition operation.</li>
<li>There is an additive identity (normally called 0), such that 'a + 0 = a'.</li>
<li>Every element has an additive inverse, that is, for every element 'a', there is an element -a such that 'a + (-a) = 0'</li>
<li>There is a multiplication operation.</li>
<li>There is a multiplicative identity (normally called 1), such that 'a * 1 = a'.</li>
</ul>
<p>These operations aren't necessarily addition or multiplication as we might expect from the integers or real numbers. For example, in our modular arithmetic example, we have 'wrap around'. (There are also certain rules the addition and multiplication rules must satisfy - we are glossing over them here.)</p>
<p>One thing a ring doesn't have a 'multiplicative inverse'. The multiplicative inverse of some non-zero element of the ring (call it a), is the value b such that a * b = 1. (Often instead of b we write 'a^-1', but that looks bad in plain text, so we shall stick to b for now.)</p>
<p>We do have some inverses in 'mod 256': the inverse of 3 is 171 as 3 * 171 = 513, and 513 = 1 mod 256, but there is no b such that 2 * b = 1 mod 256.</p>
<p>If every non-zero element of our ring had a multiplicative inverse, we would have what is called a <em>field</em>.</p>
<p>Now, let's look at a the integers modulo 2, that is, 0 and 1.</p>
<p>We have this for addition:</p>
<table>
<thead>
<tr>
<th>+</th>
<th>0</th>
<th>1</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>0</strong></td>
<td>0</td>
<td>1</td>
</tr>
<tr>
<td><strong>1</strong></td>
<td>1</td>
<td>0</td>
</tr>
</tbody>
</table>
<p>Eagle-eyed readers will notice that this is the same as XOR.</p>
<p>For multiplication: </p>
<table>
<thead>
<tr>
<th>*</th>
<th>0</th>
<th>1</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>0</strong></td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td><strong>1</strong></td>
<td>0</td>
<td>1</td>
</tr>
</tbody>
</table>
<p>As we said, a field is a ring where every non-zero element has a multiplicative inverse. As we can see, the integers modulo 2 shown above is a field: it's a ring, and 1 is its own multiplicative inverse.</p>
<p>So this is all well and good, but you can't really do very much in a field with 2 elements. This is sad, so we make bigger fields. For this application, we consider the Galois Field with 256 elements - GF(2^8). This field has some surprising and useful properties.</p>
<p>Remember how we said that integers modulo 256 weren't a field because they didn't have multiplicative inverses? I also just said that GF(2^8) also has 256 elements, but is a field - i.e., it does have inverses! How does that work?</p>
<p>Consider an element in GF(2^8). There are 2 ways to look at an element in GF(2^8). The first is to consider it as an 8-bit number. So, for example, let's take 100. We can express that as as an 8 bit binary number: 0b01100100.</p>
<p>We can write that more explicitly as a sum of powers of 2:</p>
<div class="highlight"><pre><span></span><code><span class="mf">0</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="mf">2</span><span class="o">^</span><span class="mf">7</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="mf">1</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="mf">2</span><span class="o">^</span><span class="mf">6</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="mf">1</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="mf">2</span><span class="o">^</span><span class="mf">5</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="mf">0</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="mf">2</span><span class="o">^</span><span class="mf">4</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="mf">0</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="mf">2</span><span class="o">^</span><span class="mf">3</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="mf">1</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="mf">2</span><span class="o">^</span><span class="mf">2</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="mf">0</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="mf">2</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="mf">0</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="mf">1</span>
<span class="o">=</span><span class="w"> </span><span class="mf">2</span><span class="o">^</span><span class="mf">6</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="mf">2</span><span class="o">^</span><span class="mf">5</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="mf">2</span><span class="o">^</span><span class="mf">2</span>
</code></pre></div>
<p>Now the other way we can look at elements in GF(2^8) is to replace the '2's with 'x's, and consider them as polynomials. Each of our bits then represents the coefficient of a term of a polynomial, that is:</p>
<div class="highlight"><pre><span></span><code><span class="mf">0</span><span class="w"> </span><span class="n">x</span><span class="o">^</span><span class="mf">7</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="mf">1</span><span class="w"> </span><span class="n">x</span><span class="o">^</span><span class="mf">6</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="mf">1</span><span class="w"> </span><span class="n">x</span><span class="o">^</span><span class="mf">5</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="mf">0</span><span class="w"> </span><span class="n">x</span><span class="o">^</span><span class="mf">4</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="mf">0</span><span class="w"> </span><span class="n">x</span><span class="o">^</span><span class="mf">3</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="mf">1</span><span class="w"> </span><span class="n">x</span><span class="o">^</span><span class="mf">2</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="mf">0</span><span class="w"> </span><span class="n">x</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="mf">0</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="mf">1</span>
</code></pre></div>
<p>or more simply</p>
<div class="highlight"><pre><span></span><code>x^6 + x^5 + x^2
</code></pre></div>
<p>Now, and this is <strong>important</strong>: each of the coefficients are elements of the integers modulo 2: x + x = 2x = 0 as 2 mod 2 = 0. There is no concept of 'carrying' in this addition.</p>
<p>Let's try: what's 100 + 79 in GF(2^8)?</p>
<div class="highlight"><pre><span></span><code><span class="mf">100</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="mf">0</span><span class="n">b01100100</span><span class="w"> </span><span class="o">=></span><span class="w"> </span><span class="n">x</span><span class="o">^</span><span class="mf">6</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">x</span><span class="o">^</span><span class="mf">5</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">x</span><span class="o">^</span><span class="mf">2</span>
<span class="w"> </span><span class="mf">79</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="mf">0</span><span class="n">b01001111</span><span class="w"> </span><span class="o">=></span><span class="w"> </span><span class="n">x</span><span class="o">^</span><span class="mf">6</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">x</span><span class="o">^</span><span class="mf">3</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">x</span><span class="o">^</span><span class="mf">2</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">x</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="mf">1</span>
<span class="mf">100</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="mf">79</span><span class="w"> </span><span class="o">=></span><span class="w"> </span><span class="mf">0</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">x</span><span class="o">^</span><span class="mf">5</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">x</span><span class="o">^</span><span class="mf">3</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="mf">0</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">x</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="mf">1</span>
<span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="mf">0</span><span class="n">b00101011</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="mf">43</span>
</code></pre></div>
<p>So, 100 + 79 = 43 in GF(2^8)</p>
<p>You may notice we could have done that much more efficiently: we can add numbers in GF(2^8) by just XORing their binary representations together. Subtraction, amusingly, is the same as addition: 0 + x = x = 0 - x, as -1 is congruent to 1 modulo 2.</p>
<p>So at this point you might be wanting to explore a few additions yourself. Fortuantely there's a lovely tool that will allow you to do that:</p>
<div class="highlight"><pre><span></span><code><span class="n">sudo</span><span class="w"> </span><span class="n">apt</span><span class="w"> </span><span class="n">install</span><span class="w"> </span><span class="n">gf</span><span class="o">-</span><span class="n">complete</span><span class="o">-</span><span class="n">tools</span>
<span class="n">gf_add</span><span class="w"> </span><span class="o">$</span><span class="n">A</span><span class="w"> </span><span class="o">$</span><span class="n">B</span><span class="w"> </span><span class="mi">8</span>
</code></pre></div>
<p>This will give you A + B in GF(2^8).</p>
<div class="highlight"><pre><span></span><code>> gf_add 100 79 8
43
</code></pre></div>
<p>Excellent!</p>
<p>So, hold on to your hats, as this is where things get really weird. In modular arithmetic example, we considered the elements of our ring to be numbers, and we performed our addition and multiplication modulo 256. In GF(2^8), we consider our elements as polynomials and we perform our addition and multiplication modulo a polynomial. There is one conventional polynomial used in applications:</p>
<div class="highlight"><pre><span></span><code><span class="mf">0</span><span class="n">x11d</span><span class="w"> </span><span class="o">=></span><span class="w"> </span><span class="mf">0</span><span class="n">b1</span><span class="w"> </span><span class="mf">0001</span><span class="w"> </span><span class="mf">1101</span><span class="w"> </span><span class="o">=></span><span class="w"> </span><span class="n">x</span><span class="o">^</span><span class="mf">8</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">x</span><span class="o">^</span><span class="mf">4</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">x</span><span class="o">^</span><span class="mf">3</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">x</span><span class="o">^</span><span class="mf">2</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="mf">1</span>
</code></pre></div>
<p>It is possible to use other polynomials if they satisfy particular requirements, but for our applications we don't need to worry as we will always use 0x11d. I am not going to attempt to explain anything about this polynomial - take it as an article of faith.</p>
<p>So when we multiply two numbers, we multiply their polynomial representations. Then, to find out what that is modulo 0x11d, we do polynomial long division by 0x11d, and take the remainder.</p>
<p>Some examples will help.</p>
<p>Let's multiply 100 by 3.</p>
<div class="highlight"><pre><span></span><code><span class="mf">100</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="mf">0</span><span class="n">b01100100</span><span class="w"> </span><span class="o">=></span><span class="w"> </span><span class="n">x</span><span class="o">^</span><span class="mf">6</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">x</span><span class="o">^</span><span class="mf">5</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">x</span><span class="o">^</span><span class="mf">2</span>
<span class="w"> </span><span class="mf">3</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="mf">0</span><span class="n">b00000011</span><span class="w"> </span><span class="o">=></span><span class="w"> </span><span class="n">x</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="mf">1</span>
<span class="p">(</span><span class="n">x</span><span class="o">^</span><span class="mf">6</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">x</span><span class="o">^</span><span class="mf">5</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">x</span><span class="o">^</span><span class="mf">2</span><span class="p">)(</span><span class="n">x</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="mf">1</span><span class="p">)</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">x</span><span class="o">^</span><span class="mf">7</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">x</span><span class="o">^</span><span class="mf">6</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">x</span><span class="o">^</span><span class="mf">3</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">x</span><span class="o">^</span><span class="mf">6</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">x</span><span class="o">^</span><span class="mf">5</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">x</span><span class="o">^</span><span class="mf">2</span>
<span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">x</span><span class="o">^</span><span class="mf">7</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">x</span><span class="o">^</span><span class="mf">5</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">x</span><span class="o">^</span><span class="mf">3</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">x</span><span class="o">^</span><span class="mf">2</span>
</code></pre></div>
<p>Notice that some of the terms have disappeared: x^6 + x^6 = 0.</p>
<p>The degree (the largest power of a term) is 7. 7 is less than the degree of 0x11d, which is 8, so we don't need to do anything: the remainder modulo 0x11d is simply x^7 + x^5 + x^3 + x^2.</p>
<p>In binary form, that is 0b10101100 = 172, so 100 * 3 = 172 in GF(2^8).</p>
<p>Fortunately <code>gf-complete-tools</code> also allows us to check multiplications:</p>
<div class="highlight"><pre><span></span><code>> gf_mult 100 3 8
172
</code></pre></div>
<p>Excellent!</p>
<p>Now let's see what happens if we multiply by a larger number. Let's multiply 100 by 5.</p>
<div class="highlight"><pre><span></span><code><span class="mf">100</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="mf">0</span><span class="n">b01100100</span><span class="w"> </span><span class="o">=></span><span class="w"> </span><span class="n">x</span><span class="o">^</span><span class="mf">6</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">x</span><span class="o">^</span><span class="mf">5</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">x</span><span class="o">^</span><span class="mf">2</span>
<span class="w"> </span><span class="mf">5</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="mf">0</span><span class="n">b00000101</span><span class="w"> </span><span class="o">=></span><span class="w"> </span><span class="n">x</span><span class="o">^</span><span class="mf">2</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="mf">1</span>
<span class="p">(</span><span class="n">x</span><span class="o">^</span><span class="mf">6</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">x</span><span class="o">^</span><span class="mf">5</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">x</span><span class="o">^</span><span class="mf">2</span><span class="p">)(</span><span class="n">x</span><span class="o">^</span><span class="mf">2</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="mf">1</span><span class="p">)</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">x</span><span class="o">^</span><span class="mf">8</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">x</span><span class="o">^</span><span class="mf">7</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">x</span><span class="o">^</span><span class="mf">4</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">x</span><span class="o">^</span><span class="mf">6</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">x</span><span class="o">^</span><span class="mf">5</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">x</span><span class="o">^</span><span class="mf">2</span>
<span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">x</span><span class="o">^</span><span class="mf">8</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">x</span><span class="o">^</span><span class="mf">7</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">x</span><span class="o">^</span><span class="mf">6</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">x</span><span class="o">^</span><span class="mf">5</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">x</span><span class="o">^</span><span class="mf">4</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">x</span><span class="o">^</span><span class="mf">2</span>
</code></pre></div>
<p>Here we have an x^8 term, so we have a degree of 8. This means will get a different remainder when we divide by our polynomial. We do this with polynomial long division, which you will hopefully remember if you did some solid algebra in high school.</p>
<div class="highlight"><pre><span></span><code><span class="c"> 1</span>
<span class="c"> </span><span class="nb">---------------------------------------------</span>
<span class="c">x^8 </span><span class="nb">+</span><span class="c"> x^4 </span><span class="nb">+</span><span class="c"> x^3 </span><span class="nb">+</span><span class="c"> x^2 </span><span class="nb">+</span><span class="c"> 1 | x^8 </span><span class="nb">+</span><span class="c"> x^7 </span><span class="nb">+</span><span class="c"> x^6 </span><span class="nb">+</span><span class="c"> x^5 </span><span class="nb">+</span><span class="c"> x^4 </span><span class="nb">+</span><span class="c"> x^2</span>
<span class="c"> </span><span class="nb">-</span><span class="c"> x^8 </span><span class="nb">+</span><span class="c"> x^4 </span><span class="nb">+</span><span class="c"> x^3 </span><span class="nb">+</span><span class="c"> x^2 </span><span class="nb">+</span><span class="c"> 1</span>
<span class="c"> </span><span class="nb">-------------------------------------------</span>
<span class="c"> = x^7 </span><span class="nb">+</span><span class="c"> x^6 </span><span class="nb">+</span><span class="c"> x^5 </span><span class="nb">+</span><span class="c"> x^3 </span><span class="nb">+</span><span class="c"> 1</span>
</code></pre></div>
<p>So we have that our original polynomial (x^8 + x^4 + x^3 + x^2 + 1) is congruent to (x^7 + x^6 + x^5 + x^3 + 1) modulo the polynomial 0x11d.
Looking at the binary representation of that new polynomial, we have 0b11101001 = 233.</p>
<p>Sure enough:</p>
<div class="highlight"><pre><span></span><code>> gf_mult 100 5 8
233
</code></pre></div>
<p>Just to solidify the polynomial long division a bit, let's try a slightly larger example, 100 * 9:</p>
<div class="highlight"><pre><span></span><code><span class="mf">100</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="mf">0</span><span class="n">b01100100</span><span class="w"> </span><span class="o">=></span><span class="w"> </span><span class="n">x</span><span class="o">^</span><span class="mf">6</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">x</span><span class="o">^</span><span class="mf">5</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">x</span><span class="o">^</span><span class="mf">2</span>
<span class="w"> </span><span class="mf">9</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="mf">0</span><span class="n">b00001001</span><span class="w"> </span><span class="o">=></span><span class="w"> </span><span class="n">x</span><span class="o">^</span><span class="mf">3</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="mf">1</span>
<span class="p">(</span><span class="n">x</span><span class="o">^</span><span class="mf">6</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">x</span><span class="o">^</span><span class="mf">5</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">x</span><span class="o">^</span><span class="mf">2</span><span class="p">)(</span><span class="n">x</span><span class="o">^</span><span class="mf">3</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="mf">1</span><span class="p">)</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">x</span><span class="o">^</span><span class="mf">9</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">x</span><span class="o">^</span><span class="mf">8</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">x</span><span class="o">^</span><span class="mf">5</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">x</span><span class="o">^</span><span class="mf">6</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">x</span><span class="o">^</span><span class="mf">5</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">x</span><span class="o">^</span><span class="mf">2</span>
<span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">x</span><span class="o">^</span><span class="mf">9</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">x</span><span class="o">^</span><span class="mf">8</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">x</span><span class="o">^</span><span class="mf">6</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">x</span><span class="o">^</span><span class="mf">2</span>
</code></pre></div>
<p>Doing long division to reduce our result:</p>
<div class="highlight"><pre><span></span><code><span class="c"> x</span>
<span class="c"> </span><span class="nb">-----------------------------------</span>
<span class="c">x^8 </span><span class="nb">+</span><span class="c"> x^4 </span><span class="nb">+</span><span class="c"> x^3 </span><span class="nb">+</span><span class="c"> x^2 </span><span class="nb">+</span><span class="c"> 1 | x^9 </span><span class="nb">+</span><span class="c"> x^8 </span><span class="nb">+</span><span class="c"> x^6 </span><span class="nb">+</span><span class="c"> x^2</span>
<span class="c"> </span><span class="nb">-</span><span class="c"> x^9 </span><span class="nb">+</span><span class="c"> x^5 </span><span class="nb">+</span><span class="c"> x^4 </span><span class="nb">+</span><span class="c"> x^3 </span><span class="nb">+</span><span class="c"> x</span>
<span class="c"> </span><span class="nb">-------------------------------------------------</span>
<span class="c"> = x^8 </span><span class="nb">+</span><span class="c"> x^6 </span><span class="nb">+</span><span class="c"> x^5 </span><span class="nb">+</span><span class="c"> x^4 </span><span class="nb">+</span><span class="c"> x^3 </span><span class="nb">+</span><span class="c"> x^2 </span><span class="nb">+</span><span class="c"> x</span>
</code></pre></div>
<p>We still have a polynomial of degree 8, so we can do another step:</p>
<div class="highlight"><pre><span></span><code> x + 1
-----------------------------------
x^8 + x^4 + x^3 + x^2 + 1 | x^9 + x^8 + x^6 + x^2
- x^9 + x^5 + x^4 + x^3 + x
-------------------------------------------------
= x^8 + x^6 + x^5 + x^4 + x^3 + x^2 + x
- x^8 + x^4 + x^3 + x^2 + 1
-----------------------------------------------
= x^6 + x^5 + x + 1
</code></pre></div>
<p>We now have a polynomial of degree less than 8 that is congruent to our original polynomial modulo 0x11d, and the binary form is 0x01100011 = 99.</p>
<div class="highlight"><pre><span></span><code>> gf_mult 100 9 8
99
</code></pre></div>
<p>This process can be done more efficiently, of course - but understanding what is going on will make you <em>much</em> more comfortable with what is going on!</p>
<p>I will not try to convince you that all multiplicative inverses exist in this magic shadow land of GF(2^8), but it's important for the rest of the algorithms to work that they do exist. Trust me on this.</p>
<h2>Back to RAID 6</h2>
<p>Equipped with this knowledge, you are ready to take on <a href="https://www.kernel.org/pub/linux/kernel/people/hpa/raid6.pdf">RAID6 in the kernel</a> (PDF) sections 1 - 2.</p>
<p>Pause when you get to section 3 - this snippet is a bit magic and benefits from some explanation:</p>
<blockquote>
<p>Multiplication by {02} for a single byte can be implemeted using the C code:</p>
</blockquote>
<div class="highlight"><pre><span></span><code>uint8_t c, cc;
cc = (c << 1) ^ ((c & 0x80) ? 0x1d : 0);
</code></pre></div>
<p>How does this work? Well:</p>
<p>Say you have a binary number 0bNMMM MMMM. Mutiplication by 2 gives you 0bNMMMMMMM0, which is 9 bits. Now, there are two cases to consider.</p>
<p>If your leading bit (N) is 0, your product doesn't have an x^8 term, so we don't need to reduce it modulo the irreducible polynomial.</p>
<p>If your leading bit is 1 however, your product is x^8 + something, which does need to be reduced. Fortunately, because we took an 8 bit number and multiplied it by 2, the largest term is x^8, so we only need to reduce it once. So we xor our number with our polynomial to subtract it.</p>
<p>We implement this by letting the top bit overflow out and then xoring the lower 8 bits with the low 8 bits of the polynomial (0x1d)</p>
<p>So, back to the original statement:</p>
<div class="highlight"><pre><span></span><code><span class="ss">(</span><span class="nv">c</span><span class="w"> </span><span class="o"><<</span><span class="w"> </span><span class="mi">1</span><span class="ss">)</span><span class="w"> </span><span class="o">^</span><span class="w"> </span><span class="ss">((</span><span class="nv">c</span><span class="w"> </span><span class="o">&</span><span class="w"> </span><span class="mi">0</span><span class="nv">x80</span><span class="ss">)</span><span class="w"> </span>?<span class="w"> </span><span class="mi">0</span><span class="nv">x1d</span><span class="w"> </span>:<span class="w"> </span><span class="mi">0</span><span class="ss">)</span>
<span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="o">|</span>
<span class="w"> </span><span class="o">></span><span class="w"> </span><span class="nv">multiply</span><span class="w"> </span><span class="nv">by</span><span class="w"> </span><span class="mi">2</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="o">|</span>
<span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="o">|</span>
<span class="w"> </span><span class="o">></span><span class="w"> </span><span class="nv">is</span><span class="w"> </span><span class="nv">the</span><span class="w"> </span><span class="nv">high</span><span class="w"> </span><span class="nv">bit</span><span class="w"> </span><span class="nv">set</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="nv">will</span><span class="w"> </span><span class="nv">the</span><span class="w"> </span><span class="nv">product</span><span class="w"> </span><span class="nv">have</span><span class="w"> </span><span class="nv">an</span><span class="w"> </span><span class="nv">x</span><span class="o">^</span><span class="mi">8</span><span class="w"> </span><span class="nv">term</span>?
<span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="o">|</span>
<span class="w"> </span><span class="o">></span><span class="w"> </span><span class="k">if</span><span class="w"> </span><span class="nv">so</span>,<span class="w"> </span><span class="nv">reduce</span><span class="w"> </span><span class="nv">by</span><span class="w"> </span><span class="nv">the</span><span class="w"> </span><span class="nv">polynomial</span>
<span class="w"> </span><span class="o">|</span>
<span class="w"> </span><span class="o">></span><span class="w"> </span><span class="nv">otherwise</span>,<span class="w"> </span><span class="nv">leave</span><span class="w"> </span><span class="nv">alone</span>
</code></pre></div>
<p>Hopefully that makes sense.</p>
<h3>Key points</h3>
<p>It's critical you understand the section on Altivec (the vperm stuff), so let's cover it in a bit more detail.</p>
<p>Say you want to do A * V, where A is a constant and V is an 8-bit variable. We can express V as V_a + V_b, where V_a is the top 4 bits of V, and V_b is the bottom 4 bits. A * V = A * V_a + A * V_b</p>
<p>We can then make lookup tables for multiplication by A.</p>
<p>If we did this in the most obvious way, we would need a 256 entry lookup table. But by splitting things into the top and bottom halves, we can reduce that to two 16 entry tables. For example, say A = 02.</p>
<table>
<thead>
<tr>
<th>V_a</th>
<th>A * V_a</th>
</tr>
</thead>
<tbody>
<tr>
<td>00</td>
<td>00</td>
</tr>
<tr>
<td>01</td>
<td>02</td>
</tr>
<tr>
<td>02</td>
<td>04</td>
</tr>
<tr>
<td>...</td>
<td>...</td>
</tr>
<tr>
<td>0f</td>
<td>1e</td>
</tr>
</tbody>
</table>
<table>
<thead>
<tr>
<th>V_b</th>
<th>A * V_b</th>
</tr>
</thead>
<tbody>
<tr>
<td>00</td>
<td>00</td>
</tr>
<tr>
<td>10</td>
<td>20</td>
</tr>
<tr>
<td>20</td>
<td>40</td>
</tr>
<tr>
<td>...</td>
<td>...</td>
</tr>
<tr>
<td>f0</td>
<td>fd</td>
</tr>
</tbody>
</table>
<p>We then use vperm to look up entries in these tables and vxor to combine our results.</p>
<p>So - and this is a key point - for each A value we wish to multiply by, we need to generate a new lookup table.</p>
<p>So if we wanted A = 03:</p>
<table>
<thead>
<tr>
<th>V_a</th>
<th>A * V_a</th>
</tr>
</thead>
<tbody>
<tr>
<td>00</td>
<td>00</td>
</tr>
<tr>
<td>01</td>
<td>03</td>
</tr>
<tr>
<td>02</td>
<td>06</td>
</tr>
<tr>
<td>...</td>
<td>...</td>
</tr>
<tr>
<td>0f</td>
<td>11</td>
</tr>
</tbody>
</table>
<table>
<thead>
<tr>
<th>V_b</th>
<th>A * V_b</th>
</tr>
</thead>
<tbody>
<tr>
<td>00</td>
<td>00</td>
</tr>
<tr>
<td>10</td>
<td>30</td>
</tr>
<tr>
<td>20</td>
<td>60</td>
</tr>
<tr>
<td>...</td>
<td>...</td>
</tr>
<tr>
<td>f0</td>
<td>0d</td>
</tr>
</tbody>
</table>
<p>One final thing is that Power8 adds a vpermxor instruction, so we can reduce the entire 4 instruction sequence in the paper:</p>
<div class="highlight"><pre><span></span><code>vsrb v1, v0, v14
vperm v2, v12, v12, v0
vperm v1, v13, v13, v1
vxor v1, v2, v1
</code></pre></div>
<p>to 1 vpermxor:</p>
<div class="highlight"><pre><span></span><code>vpermxor v1, v12, v13, v0
</code></pre></div>
<p>Isn't POWER grand?</p>
<h2>OK, but how does this relate to erasure codes?</h2>
<p>I'm glad you asked.</p>
<p>Galois Field arithmetic, and its application in RAID 6 is the basis for erasure coding. (It's also the basis for CRCs - two for the price of one!)</p>
<p>But, that's all to come in part 2, which will definitely be published before 7 April!</p>
<p>Many thanks to Sarah Axtens who reviewed the mathematical content of this post and suggested significant improvements. All errors and gross oversimplifications remain my own. Thanks also to the OzLabs crew for their feedback and comments.</p></div>
</article>
<article>
<header>
<h1 class="entry-title">
<a href="https://sthbrx.github.io/blog/2017/02/13/high-power-lustre/">High Power Lustre</a>
</h1>
<p class="meta">
<time datetime="2017-02-13T16:29:00+11:00" pubdate>Mon 13 February 2017</time> </p>
</header>
<div class="byline_index">
<span class="byline author vcard">
Posted by <span class="fn">
<a href="https://sthbrx.github.io/author/daniel-axtens.html">Daniel Axtens</a>,
<a href="https://sthbrx.github.io/author/rashmica-gupta.html">Rashmica Gupta</a>
</span>
</span>
<time datetime="2017-02-13T16:29:00+11:00" pubdate>Mon 13 February 2017</time></div>
<div class="entry-content"><p>(Most of the hard work here was done by fellow blogger Rashmica - I just verified her instructions and wrote up this post.)</p>
<p><a href="http://lustre.org/">Lustre</a> is a high-performance clustered file system. Traditionally the Lustre client and server have run on x86, but both the server and client will also work on Power. Here's how to get them running.</p>
<h1>Server</h1>
<p>Lustre normally requires a patched 'enterprise' kernel - normally an old RHEL, CentOS or SUSE kernel. We tested with a CentOS 7.3 kernel. We tried to follow <a href="https://wiki.hpdd.intel.com/pages/viewpage.action?pageId=52104622">the Intel instructions</a> for building the kernel as much as possible - any deviations we had to make are listed below.</p>
<h2>Setup quirks</h2>
<p>We are told to edit <code>~/kernel/rpmbuild/SPEC/kernel.spec</code>. This doesn't exist because the directory is <code>SPECS</code> not <code>SPEC</code>: you need to edit <code>~/kernel/rpmbuild/SPECS/kernel.spec</code>.</p>
<p>I also found there was an extra quote mark in the supplied patch script after <code>-lustre.patch</code>. I removed that and ran this instead:</p>
<div class="highlight"><pre><span></span><code>for<span class="w"> </span>patch<span class="w"> </span>in<span class="w"> </span>$(<span class="err"><</span>"3.10-rhel7.series");<span class="w"> </span>do<span class="w"> </span>\
<span class="w"> </span>patch_file="<span class="nv">$HOME</span>/lustre-release/lustre/kernel_patches/patches/<span class="cp">${</span><span class="n">patch</span><span class="cp">}</span>"<span class="w"> </span>\
<span class="w"> </span>cat<span class="w"> </span>"<span class="cp">${</span><span class="n">patch_file</span><span class="cp">}</span>"<span class="w"> </span>>><span class="w"> </span><span class="nv">$HOME</span>/lustre-kernel-x86_64-lustre.patch<span class="w"> </span>\
done
</code></pre></div>
<p>The fact that there is 'x86_64' in the patch name doesn't matter as you're about to copy it under a different name to a place where it will be included by the spec file.</p>
<h2>Building for ppc64le</h2>
<p>Building for ppc64le was reasonably straight-forward. I had one small issue:</p>
<div class="highlight"><pre><span></span><code><span class="p">[</span><span class="n">build</span><span class="err">@</span><span class="n">dja</span><span class="o">-</span><span class="n">centos</span><span class="o">-</span><span class="n">guest</span><span class="w"> </span><span class="n">rpmbuild</span><span class="p">]</span><span class="o">$</span><span class="w"> </span><span class="n">rpmbuild</span><span class="w"> </span><span class="o">-</span><span class="n">bp</span><span class="w"> </span><span class="o">--</span><span class="n">target</span><span class="o">=</span><span class="err">`</span><span class="n">uname</span><span class="w"> </span><span class="o">-</span><span class="n">m</span><span class="err">`</span><span class="w"> </span><span class="o">./</span><span class="n">SPECS</span><span class="o">/</span><span class="n">kernel</span><span class="o">.</span><span class="n">spec</span>
<span class="n">Building</span><span class="w"> </span><span class="n">target</span><span class="w"> </span><span class="n">platforms</span><span class="p">:</span><span class="w"> </span><span class="n">ppc64le</span>
<span class="n">Building</span><span class="w"> </span><span class="k">for</span><span class="w"> </span><span class="n">target</span><span class="w"> </span><span class="n">ppc64le</span>
<span class="n">error</span><span class="p">:</span><span class="w"> </span><span class="n">Failed</span><span class="w"> </span><span class="n">build</span><span class="w"> </span><span class="n">dependencies</span><span class="p">:</span>
<span class="w"> </span><span class="n">net</span><span class="o">-</span><span class="n">tools</span><span class="w"> </span><span class="k">is</span><span class="w"> </span><span class="n">needed</span><span class="w"> </span><span class="n">by</span><span class="w"> </span><span class="n">kernel</span><span class="o">-</span><span class="mf">3.10</span><span class="o">.</span><span class="mi">0</span><span class="o">-</span><span class="mf">327.36</span><span class="o">.</span><span class="mf">3.</span><span class="n">el7</span><span class="o">.</span><span class="n">ppc64le</span>
</code></pre></div>
<p>Fixing this was as simple as a <code>yum install net-tools</code>.</p>
<p>This was sufficient to build the kernel RPMs. I installed them and booted to my patched kernel - so far so good!</p>
<h1>Building the client packages: CentOS</h1>
<p>I then tried to build and install the RPMs from <a href="https://git.hpdd.intel.com/?p=fs/lustre-release.git;a=summary"><code>lustre-release</code></a>. This repository provides the sources required to build the client and utility binaries.</p>
<p><code>./configure</code> and <code>make</code> succeeded, but when I went to install the packages with <code>rpm</code>, I found I was missing some dependencies:</p>
<div class="highlight"><pre><span></span><code><span class="n">error</span><span class="o">:</span><span class="w"> </span><span class="n">Failed</span><span class="w"> </span><span class="n">dependencies</span><span class="o">:</span>
<span class="w"> </span><span class="n">ldiskfsprogs</span><span class="w"> </span><span class="o">>=</span><span class="w"> </span><span class="mf">1.42</span><span class="o">.</span><span class="mi">7</span><span class="o">.</span><span class="na">wc1</span><span class="w"> </span><span class="k">is</span><span class="w"> </span><span class="n">needed</span><span class="w"> </span><span class="n">by</span><span class="w"> </span><span class="n">kmod</span><span class="o">-</span><span class="n">lustre</span><span class="o">-</span><span class="n">osd</span><span class="o">-</span><span class="n">ldiskfs</span><span class="o">-</span><span class="mf">2.9</span><span class="o">.</span><span class="mi">52</span><span class="n">_60_g1d2fbad_dirty</span><span class="o">-</span><span class="mi">1</span><span class="o">.</span><span class="na">el7</span><span class="o">.</span><span class="na">centos</span><span class="o">.</span><span class="na">ppc64le</span>
<span class="w"> </span><span class="n">sg3_utils</span><span class="w"> </span><span class="k">is</span><span class="w"> </span><span class="n">needed</span><span class="w"> </span><span class="n">by</span><span class="w"> </span><span class="n">lustre</span><span class="o">-</span><span class="n">iokit</span><span class="o">-</span><span class="mf">2.9</span><span class="o">.</span><span class="mi">52</span><span class="n">_60_g1d2fbad_dirty</span><span class="o">-</span><span class="mi">1</span><span class="o">.</span><span class="na">el7</span><span class="o">.</span><span class="na">centos</span><span class="o">.</span><span class="na">ppc64le</span>
<span class="w"> </span><span class="n">attr</span><span class="w"> </span><span class="k">is</span><span class="w"> </span><span class="n">needed</span><span class="w"> </span><span class="n">by</span><span class="w"> </span><span class="n">lustre</span><span class="o">-</span><span class="n">tests</span><span class="o">-</span><span class="mf">2.9</span><span class="o">.</span><span class="mi">52</span><span class="n">_60_g1d2fbad_dirty</span><span class="o">-</span><span class="mi">1</span><span class="o">.</span><span class="na">el7</span><span class="o">.</span><span class="na">centos</span><span class="o">.</span><span class="na">ppc64le</span>
<span class="w"> </span><span class="n">lsof</span><span class="w"> </span><span class="k">is</span><span class="w"> </span><span class="n">needed</span><span class="w"> </span><span class="n">by</span><span class="w"> </span><span class="n">lustre</span><span class="o">-</span><span class="n">tests</span><span class="o">-</span><span class="mf">2.9</span><span class="o">.</span><span class="mi">52</span><span class="n">_60_g1d2fbad_dirty</span><span class="o">-</span><span class="mi">1</span><span class="o">.</span><span class="na">el7</span><span class="o">.</span><span class="na">centos</span><span class="o">.</span><span class="na">ppc64le</span>
</code></pre></div>
<p>I was able to install <code>sg3_utils</code>, <code>attr</code> and <code>lsof</code>, but I was still missing <code>ldiskfsprogs</code>.</p>
<p>It seems we need the lustre-patched version of <code>e2fsprogs</code> - I found a <a href="https://groups.google.com/forum/#!topic/lustre-discuss-list/U93Ja6Xkxfk">mailing list post</a> to that effect.</p>
<p>So, following the instructions on the walkthrough, I grabbed <a href="https://downloads.hpdd.intel.com/public/e2fsprogs/latest/el7/SRPMS/">the SRPM</a> and installed the dependencies: <code>yum install -y texinfo libblkid-devel libuuid-devel</code></p>
<p>I then tried <code>rpmbuild -ba SPECS/e2fsprogs-RHEL-7.spec</code>. This built but failed tests. Some failed because I ran out of disk space - they were using 10s of gigabytes. I found that there were some comments in the spec file about this with suggested tests to disable, so I did that. Even with that fix, I was still failing two tests:</p>
<ul>
<li><code>f_pgsize_gt_blksize</code>: Intel added this to their fork, and no equivalent exists in the master e2fsprogs branches. This relates to Intel specific assumptions about page sizes which don't hold on Power.</li>
<li><code>f_eofblocks</code>: This may need fixing for large page sizes, see <a href="https://jira.hpdd.intel.com/browse/LU-4677?focusedCommentId=78814&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-78814">this bug</a>.</li>
</ul>
<p>I disabled the tests by adding the following two lines to the spec file, just before <code>make %{?_smp_mflags} check</code>.</p>
<div class="highlight"><pre><span></span><code>rm -rf tests/f_pgsize_gt_blksize
rm -rf tests/f_eofblocks
</code></pre></div>
<p>With those tests disabled I was able to build the packages successfully. I installed them with <code>yum localinstall *1.42.13.wc5*</code> (I needed that rather weird pattern to pick up important RPMs that didn't fit the <code>e2fs*</code> pattern - things like <code>libcom_err</code> and <code>libss</code>)</p>
<p>Following that I went back to the <code>lustre-release</code> build products and was able to successfully run <code>yum localinstall *ppc64le.rpm</code>!</p>
<h1>Testing the server</h1>
<p>After disabling SELinux and rebooting, I ran the test script:</p>
<div class="highlight"><pre><span></span><code>sudo /usr/lib64/lustre/tests/llmount.sh
</code></pre></div>
<p>This spat out one scary warning:</p>
<div class="highlight"><pre><span></span><code><span class="n">mount</span><span class="o">.</span><span class="n">lustre</span><span class="w"> </span><span class="n">FATAL</span><span class="p">:</span><span class="w"> </span><span class="n">unhandled</span><span class="o">/</span><span class="n">unloaded</span><span class="w"> </span><span class="n">fs</span><span class="w"> </span><span class="n">type</span><span class="w"> </span><span class="mi">0</span><span class="w"> </span><span class="s1">'ext3'</span>
</code></pre></div>
<p>The test did seem to succeed overall, and it would seem that is a <a href="https://jira.hpdd.intel.com/browse/LU-9059">known problem</a>, so I pressed on undeterred.</p>
<p>I then attached a couple of virtual harddrives for the metadata and object store volumes, and having set them up, proceeded to try to mount my freshly minted lustre volume from some clients.</p>
<h1>Testing with a ppc64le client</h1>
<p>My first step was to test whether another ppc64le machine would work as a client.</p>
<p>I tried with an existing Ubuntu 16.04 VM that I use for much of my day to day development.</p>
<p>A quick google suggested that I could grab the <code>lustre-release</code> repository and run <code>make debs</code> to get Debian packages for my system.</p>
<p>I needed the following dependencies:</p>
<div class="highlight"><pre><span></span><code>sudo apt install module-assistant debhelper dpatch libsnmp-dev quilt
</code></pre></div>
<p>With those the packages built successfully, and could be easily installed:</p>
<div class="highlight"><pre><span></span><code>dpkg -i lustre-client-modules-4.4.0-57-generic_2.9.52-60-g1d2fbad-dirty-1_ppc64el.deblustre-utils_2.9.52-60-g1d2fbad-dirty-1_ppc64el.deb
</code></pre></div>
<p>I tried to connect to the server:</p>
<div class="highlight"><pre><span></span><code><span class="n">sudo</span><span class="w"> </span><span class="n">mount</span><span class="w"> </span><span class="o">-</span><span class="n">t</span><span class="w"> </span><span class="n">lustre</span><span class="w"> </span><span class="err">$</span><span class="n">SERVER_IP</span><span class="nv">@tcp</span><span class="err">:</span><span class="o">/</span><span class="n">lustre</span><span class="w"> </span><span class="o">/</span><span class="n">lustre</span><span class="o">/</span>
</code></pre></div>
<p>Initially I wasn't able to connect to the server at all. I remembered that (unlike Ubuntu), CentOS comes with quite an aggressive firewall by default. I ran the following on the server:</p>
<div class="highlight"><pre><span></span><code>systemctl stop firewalld
</code></pre></div>
<p>And voila! I was able to connect, mount the lustre volume, and successfully read and write to it. This is very much an over-the-top hack - I should have poked holes in the firewall to allow just the ports lustre needed. This is left as an exercise for the reader.</p>
<h1>Testing with an x86_64 client</h1>
<p>I then tried to run <code>make debs</code> on my Ubuntu 16.10 x86_64 laptop.</p>
<p>This did not go well - I got the following error:</p>
<div class="highlight"><pre><span></span><code>liblustreapi.c: In function ‘llapi_get_poollist’:
liblustreapi.c:1201:3: error: ‘readdir_r’ is deprecated [-Werror=deprecated-declarations]
</code></pre></div>
<p>This looks like one of the new errors introduced in recent GCC versions, and is <a href="https://jira.hpdd.intel.com/browse/LU-8724?focusedCommentId=175244&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-175244">a known bug</a>. To work around it, I found the following stanza in a <code>lustre/autoconf/lustre-core.m4</code>, and removed the <code>-Werror</code>:</p>
<div class="highlight"><pre><span></span><code>AS_IF([test $target_cpu == "i686" -o $target_cpu == "x86_64"],
[CFLAGS="$CFLAGS -Wall -Werror"])
</code></pre></div>
<p>Even this wasn't enough: I got the following errors:</p>
<div class="highlight"><pre><span></span><code><span class="o">/</span><span class="nx">home</span><span class="o">/</span><span class="nx">dja</span><span class="o">/</span><span class="nx">dev</span><span class="o">/</span><span class="nx">lustre</span><span class="o">-</span><span class="nx">release</span><span class="o">/</span><span class="nx">debian</span><span class="o">/</span><span class="nx">tmp</span><span class="o">/</span><span class="nx">modules</span><span class="o">-</span><span class="nx">deb</span><span class="o">/</span><span class="nx">usr_src</span><span class="o">/</span><span class="nx">modules</span><span class="o">/</span><span class="nx">lustre</span><span class="o">/</span><span class="nx">lustre</span><span class="o">/</span><span class="nx">llite</span><span class="o">/</span><span class="nx">dcache</span><span class="p">.</span><span class="nx">c</span><span class="p">:</span><span class="mi">387</span><span class="p">:</span><span class="mi">22</span><span class="p">:</span><span class="w"> </span><span class="nx">error</span><span class="p">:</span><span class="w"> </span><span class="nx">initialization</span><span class="w"> </span><span class="nx">from</span><span class="w"> </span><span class="nx">incompatible</span><span class="w"> </span><span class="nx">pointer</span><span class="w"> </span><span class="k">type</span><span class="w"> </span><span class="p">[</span><span class="o">-</span><span class="nx">Werror</span><span class="p">=</span><span class="nx">incompatible</span><span class="o">-</span><span class="nx">pointer</span><span class="o">-</span><span class="nx">types</span><span class="p">]</span>
<span class="w"> </span><span class="p">.</span><span class="nx">d_compare</span><span class="w"> </span><span class="p">=</span><span class="w"> </span><span class="nx">ll_dcompare</span><span class="p">,</span>
<span class="w"> </span><span class="o">^~~~~~~~~~~</span>
<span class="o">/</span><span class="nx">home</span><span class="o">/</span><span class="nx">dja</span><span class="o">/</span><span class="nx">dev</span><span class="o">/</span><span class="nx">lustre</span><span class="o">-</span><span class="nx">release</span><span class="o">/</span><span class="nx">debian</span><span class="o">/</span><span class="nx">tmp</span><span class="o">/</span><span class="nx">modules</span><span class="o">-</span><span class="nx">deb</span><span class="o">/</span><span class="nx">usr_src</span><span class="o">/</span><span class="nx">modules</span><span class="o">/</span><span class="nx">lustre</span><span class="o">/</span><span class="nx">lustre</span><span class="o">/</span><span class="nx">llite</span><span class="o">/</span><span class="nx">dcache</span><span class="p">.</span><span class="nx">c</span><span class="p">:</span><span class="mi">387</span><span class="p">:</span><span class="mi">22</span><span class="p">:</span><span class="w"> </span><span class="nx">note</span><span class="p">:</span><span class="w"> </span><span class="p">(</span><span class="nx">near</span><span class="w"> </span><span class="nx">initialization</span><span class="w"> </span><span class="k">for</span><span class="w"> </span><span class="err">‘</span><span class="nx">ll_d_ops</span><span class="p">.</span><span class="nx">d_compare</span><span class="err">’</span><span class="p">)</span>
</code></pre></div>
<p>I figured this was probably because Ubuntu 16.10 has a 4.8 kernel, and Ubuntu 16.04 has a 4.4 kernel. Work on supporting 4.8 <a href="https://jira.hpdd.intel.com/browse/LU-9003">is ongoing</a>.</p>
<p>Sure enough, when I fired up a 16.04 x86_64 VM with a 4.4 kernel, I was able to build and install fine.</p>
<p>Connecting didn't work first time - the guest failed to mount, but I did get the following helpful error on the server:</p>
<div class="highlight"><pre><span></span><code><span class="n">LNetError</span><span class="o">:</span><span class="w"> </span><span class="mi">2595</span><span class="o">:</span><span class="mi">0</span><span class="o">:(</span><span class="n">acceptor</span><span class="o">.</span><span class="na">c</span><span class="o">:</span><span class="mi">406</span><span class="o">:</span><span class="n">lnet_acceptor</span><span class="o">())</span><span class="w"> </span><span class="n">Refusing</span><span class="w"> </span><span class="n">connection</span><span class="w"> </span><span class="n">from</span><span class="w"> </span><span class="mf">10.61</span><span class="o">.</span><span class="mf">2.227</span><span class="o">:</span><span class="w"> </span><span class="n">insecure</span><span class="w"> </span><span class="n">port</span><span class="w"> </span><span class="mi">1024</span>
</code></pre></div>
<p>Refusing insecure port 1024 made me thing that perhaps the NATing that qemu was performing for me was interfering - perhaps the server expected to get a connection where the source port was privileged, and qemu wouldn't be able to do that with NAT.</p>
<p>Sure enough, switching NAT to bridging was enough to get the x86 VM to talk to the ppc64le server. I verified that <code>ls</code>, reading and writing all succeeded.</p>
<h1>Next steps</h1>
<p>The obvious next steps are following up the disabled tests in e2fsprogs, and doing a lot of internal performance and functionality testing.</p>
<p>Happily, it looks like Lustre might be in the mainline kernel before too long - parts have already started to go in to staging. This will make our lives a lot easier: for example, the breakage between 4.4 and 4.8 would probably have already been picked up and fixed if it was the main kernel tree rather than an out-of-tree patch set.</p>
<p>In the long run, we'd like to make Lustre on Power just as easy as Lustre on x86. (And, of course, more performant!) We'll keep you up to date!</p>
<p>(Thanks to fellow bloggers Daniel Black and Andrew Donnellan for useful feedback on this post.)</p></div>
</article>
<article>
<header>
<h1 class="entry-title">
<a href="https://sthbrx.github.io/blog/2017/02/01/namd-on-nvlink/">NAMD on NVLink</a>
</h1>
<p class="meta">
<time datetime="2017-02-01T08:32:00+11:00" pubdate>Wed 01 February 2017</time> </p>
</header>
<div class="byline_index">
<span class="byline author vcard">
Posted by <span class="fn">
<a href="https://sthbrx.github.io/author/daniel-axtens.html">Daniel Axtens</a>,
<a href="https://sthbrx.github.io/author/rashmica-gupta.html">Rashmica Gupta</a>,
<a href="https://sthbrx.github.io/author/daniel-black.html">Daniel Black</a>
</span>
</span>
<time datetime="2017-02-01T08:32:00+11:00" pubdate>Wed 01 February 2017</time></div>
<div class="entry-content"><p>NAMD is a molecular dynamics program that can use GPU acceleration to speed up its calculations. Recent OpenPOWER machines like the IBM Power Systems S822LC for High Performance Computing (Minsky) come with a new interconnect for GPUs called NVLink, which offers extremely high bandwidth to a number of very powerful Nvidia Pascal P100 GPUs. So they're ideal machines for this sort of workload.</p>
<p>Here's how to set up NAMD 2.12 on your Minsky, and how to debug some common issues. We've targeted this script for CentOS, but we've successfully compiled NAMD on Ubuntu as well.</p>
<h2>Prerequisites</h2>
<h3>GPU Drivers and CUDA</h3>
<p>Firstly, you'll need CUDA and the NVidia drivers.</p>
<p>You can install CUDA by following the instructions on NVidia's <a href="https://developer.nvidia.com/cuda-downloads">CUDA Downloads</a> page.</p>
<div class="highlight"><pre><span></span><code><span class="n">yum</span><span class="w"> </span><span class="n">install</span><span class="w"> </span><span class="n">epel</span><span class="o">-</span><span class="n">release</span>
<span class="n">yum</span><span class="w"> </span><span class="n">install</span><span class="w"> </span><span class="n">dkms</span>