BLIS evaluation

Practical

weekly update meetings at Thu 13:15 UTC (via Zoom: https://tiny.cc/eb_conf_call)
collect info & scripts in https://github.com/easybuilders/blis-eval
- just push to main branch, no PRs needed

In scope

BLIS + libFLAME (LAPACK)
gobff vs foss
iibff vs intel
also FFTW?

Notes meeting 20210318

Åke
- BLAS testing
  - IEEE signalling issue reported to BLIS: https://github.com/flame/blis/issues/486
    - Only happens on Broadwell (BLIS compiles for haswell)
    - Skylake and AMD EPYC (zens2) are ok
- LAPACK testing (using BLIS only)
  - More results from Skylake and AMD EPYC (zen2)
    - Skylake has more errors than Broadwell
    - AMD (zen2) has the same amount as Broadwell
- LAPACK testing with libFlame (and refblas)
  - libFLAME doesn't contain all the functions needed, so have to link with reflapack lib too.
  - first test xlintsts < stest.in causes "Segmentation fault - invalid memory reference."
    - https://github.com/flame/libflame/issues/46
Sam
- CP2K with goblf (BLIS, LAPACK, no libFLAME):
  - fixes all extra failed tests (summary now looks exactly the same as with foss)
  - performance tests underway - need to use BLIS_NUM_THREADS?

Notes meeting 20210311

Kenneth
- failing numpy tests
  - build BLIS differently doesn't help, same tests fail
    - run make test (rather than make check, which is only a minimal test suite)
    - toolchainopts = {'optarch': False, 'vectorize': False, 'lowopt': True, 'strict': True}
    - buildopts = 'ENABLE_VERBOSE=yes' (verbose build output)
  - SciPy-bundle changes
    - toolchainopts = {'vectorize': False}
  - should also try with:
    - 'noopt': True => -O0
    - 'debug': True
  - cause of these numerical problems is unclear...
    - could use GCC's address sanitizer feature (asam) to find uninitialized variables
  - FlexiBLAS: took a brief look at this, have something work on top of OpenBLAS+BLIS
Sam
- CP2K
  - see https://github.com/easybuilders/blis-eval/blob/main/apps/cp2k/debug.md for GDB session to deep-dive into segfaulting CP2K run
  - looks like it may be uninitialized value in libFLAME?
  - should try to:
    - also build libFLAME with stricter compilation options (incl. noopt)
    - try using LAPACK rather than libFLAME (goblf/2020b)
      - Åke has some patches for LAPACK that fix correctness issues
      - see also https://github.com/akesandgren/lapack.git
      - also run LAPACK test suite on top of libFLAME: https://github.com/easybuilders/blis-eval/tree/main/ake/blas-correctness-test
    - same for numpy?
- BLAS-Tester
  - we need to control sizes for the matrices (-N) and target FLOPS (-F)
  - see https://github.com/easybuilders/blis-eval/blob/main/jpecar/run-blas-tester.sh
  - Jure is not checking for failing tests yet, will do
  - current runs vary a lot, too small?
Sebastian
- BLAS-3 results on Zen2 with gobff/2020b vs foss/2020b + gomkl/2020b
  - single-threaded
    - BLIS and OpenBLAS are very close
    - large gap with MKL (+20% slower), but a lot better than with older MKL versions
    - see https://github.com/easybuilders/blis-eval/blob/main/low-level/blas3/plots/jurecadc_zen2/l3_perf_zen2_nt1.pdf
    - clearly better than https://github.com/flame/blis/blob/master/docs/graphs/large/l3_perf_zen2_nt1.pdf
  - multi-threaded (single socket)
    - BLIS is a lot better than OpenBLAS, and to lesser extent better than MKL
    - https://github.com/easybuilders/blis-eval/blob/main/low-level/blas3/plots/jurecadc_zen2/l3_perf_zen2_jc4ic4jr4_nt64.pdf
what is our end goal for this?
- EB Tech Talk?
- paper?
- systems with diff. archs: x86_64 (Intel, AMD), Arm64 (Graviton2, A64FX), POWER9
- low-level benchmarks + apps like CP2K, numpy (once we figure out correctness testing)

Notes meeting 20210304

Kenneth (pre-meeting notes)
- BLIS test step:
  - we run make check, which only runs a lightweight test (checkblis-fast checkblas, <1min)
  - we should run make test, which runs a slightly longer test (~5min on Haswell)
  - see also https://github.com/flame/blis/blob/master/docs/BuildSystem.md#step-3b-testing-optional + https://github.com/flame/blis/blob/master/docs/Testsuite.md
- numpy (see https://github.com/easybuilders/blis-eval/tree/main/apps/python])
  - handful of tests in numpy test suite fail with gobff/2020b and iibff/2020b
    - not with foss/2020b or intel/2020b, so due to BLIS+libFLAME?
    - same problem with numpy 1.19.4 (SciPy-bundle 2020.11) and latest numpy 1.20.1
    - same problem with gobff/2020.11 (BLIS version of foss/2020a)
    - same problem on Intel Skylake and AMD Rome
    - Åke: will test without -fno-math-errno to see if that makes a difference
  - relevant links:
    - numpy support for BLIS: https://github.com/numpy/numpy/issues/7372
    - (closed) issues reporting similar test failures (TestRandomDist.test_multivariate_normal[eigh]): https://github.com/numpy/numpy/issues/15546, https://github.com/numpy/numpy/issues/16567
  - TODO:
    - Are others seeing the same test failures?
      - see https://github.com/easybuilders/blis-eval/blob/main/apps/python/README.md
    - Installing with --skip-test-step, reproduce problem outside of numpy tests?
    - Open issue in numpy and/or BLIS repo(s)?
larger matrix sizes for dgemm tests on JUWELS Skylake by Sebastian
- see https://github.com/easybuilders/blis-eval/blob/main/low-level/dgemm/eval/juwels/dgemm-juwels.ipynb
- added results for AMD Rome 7742
  - Single core performance good for all 3 implementations ~50 Gflops
  - BLIS is fastest at socket/node level
BLAS 3 BLIS tests by Sebastian
- https://github.com/flame/blis/tree/master/test/3
- running on JUWELS (Skylake), still in progress, need to make plots
Sam: testing CP2K
- regression tests: more failures with gobff/2020b than with foss/2020b (?) 80 tests fail with segmentation faults, see https://github.com/easybuilders/blis-eval/tree/main/apps/cp2k
Åke: more correctness testing -- no progress this week
BLAS-Tester tool: see notes from Åke below
- Sam ran them, correctness tests all passed, performance is a mixed bag
- Jure will run them too
stuff to look into:
- HPL on a couple of systems => Bart: still working on this, need to figure out optimal MPI/OpenMP configuration for BLIS on Intel.
- script to collect system info => Jure: https://github.com/easybuilders/blis-eval/blob/main/jpecar/checkenv.sh everyone should run it and see if it works. Can also use lstopo output

Notes meeting 20210225

numpy tests by Kenneth, see https://github.com/easybuilders/blis-eval/tree/main/apps/python
- failing numpy tests
  - 5 tests fail with gobff/2020b
  - 8 tests fail with iibff/2020b
- need to double check how numpy was compiled on top of BLIS...
- also check with numpy built without optimizations (-O0)
larger matrix sizes for dgemm tests on JUWELS Skylake by Sebastian
- see https://github.com/easybuilders/blis-eval/blob/main/low-level/dgemm/eval/juwels/dgemm-juwels.ipynb
BLAS 3 BLIS tests by Sebastian
- https://github.com/flame/blis/tree/master/test/3
- running on JUWELS (Skylake), still running (~1h)
Sam: testing CP2K
- regression tests: same failures with gobff/2020b as with foss/2020b (?)
Åke: more correctness testing
- see https://github.com/easybuilders/blis-eval/tree/main/ake/blas-correctness-test
- BLIS results look pretty good, better than OpenBLAS
- need to take a detailed look at how bad the failures are
BLAS-Tester tool: see notes from Åke below
does it make sense to compile BLIS with -march=native enabled?
take a closer look at FlexiBLAS
- https://github.com/mpimd-csc/flexiblas - https://www.mpi-magdeburg.mpg.de/projects/flexiblas
- seems like a good choice for BLAS/LAPACK lib in toolchains?
- has some cool features, like profiling
stuff to look into:
- look into Åke's BLAS/LAPACK correctness stuff on AMD Rome => Kenneth
- check failing numpy tests => Kenneth
- BLAS 3 tests on JUWELS (Skylake, AMD Rome) => Sebastian
- study failing LAPACK tests a bit better + open issue to BLIS on this => Åke
- CP2K => Sam?
- HPL on a couple of systems => Bart
  - stick to single socket to remove effects of interconnect, etc.
  - HPL parameters need to be tweaked for different BLAS libraries...
- script to collect system info => Jure

Notes meeting 20210218

new BLIS-based toolchains
- BLIS moved to GCCcore because it doesn't like being built with Intel compilers (see https://github.com/flame/blis/pull/372)
- gobff/2020b, iibff/2020b (+ gomkl/2020b), to be included with EasyBuild v4.3.3
BLAS test suite (Åke)
- tested with gobff/2020b
- BLAS tests suggest that BLIS isn't fully IEEE754 compliant?
- unclear whether this also happens with OpenBLAS or MKL
- Åke needs to refresh things a bit, perhaps reach out to BLIS developers
- see https://github.com/easybuilders/blis-eval/tree/main/ake/blas-correctness-test

Sam tested https://github.com/xianyi/BLAS-Tester

ran into linking errors when using BLIS

gcc -I./include -DAdd_  -DStringSunStyle -DATL_OS_Linux  -DTHREADNUM=4  -DF77_INTEGER=int -fopenmp -m64 -O3 -o ./bin/xsl1blastst sl1blastst.o ATL_sf77rotg.o ATL_sf77rot.o ATL_sf77rotmg.o ATL_sf77rotm.o ATL_sf77swap.o ATL_sf77scal.o ATL_sf77copy.o ATL_sf77axpy.o ATL_sf77dot.o ATL_sdsf77dot.o ATL_dsf77dot.o ATL_sf77nrm2.o ATL_sf77asum.o ATL_sf77amax.o ATL_sf77rotgf.o ATL_sf77rotf.o ATL_sf77rotmgf.o ATL_sf77rotmf.o ATL_sf77swapf.o ATL_sf77scalf.o ATL_sf77copyf.o ATL_sf77axpyf.o ATL_sf77dotf.o ATL_sdsf77dotf.o       ATL_dsf77dotf.o ATL_sf77nrm2f.o ATL_sf77asumf.o ATL_sf77amaxf.o ATL_sf77aminf.o ATL_flushcache.o ATL_sinfnrm.o ATL_rand.o ATL_svdiff.o ATL_sf77amin.o  ./refblas/librefblas.a /apps/brussel/CO7/skylake/software/BLIS/0.8.0-GCCcore-10.2.0/lib/libblis.so  -lm -lgfortran -lpthread ATL_sf77amin.o:
ATL_f77amin.c:function OPENBLAS_sf77amin: error: undefined reference to 'isamin_'
collect2: error: ld returned 1 exit status
make: *** [xsl1blastst] Error 1

Åke may be able to help with that...
- Use NO_EXTENSION=1
- And one can set TEST_BLAS=-lblis to make it simpler

Sebastian starting with low-level benchmarks on JUWELS (Skylake partition)
- see https://github.com/easybuilders/blis-eval/tree/main/low-level/dgemm
- strange fluctuations with OpenBLAS on full node?
Sam is looking into building CP2K with gobff
- already includes a regression test
- default: popt, should also look into psmp

Notes meeting 20210210

Tasks

correctness checking
- run netlib BLAS/LAPACK tests (Åke)
- netlib BLAS tests with BLIS
- netlib LAPACK tests with BLIS+LAPACK
- netlib LAPACK tests with BLIS+libFLAME
- ~~also https://github.com/xianyi/BLAS-Tester (Sam)~~ does not work with BLIS
low-level performance testing (Sebastian)
- benchmark specific BLAS functions like dgemm
- https://github.com/flame/blis/tree/master/test/3
- compare to OpenBLAS/MKL
- Sebastian has a tool for interactive evaluation of BLAS/LAPACK functions
  - requires Python 2
  - see https://github.com/HPAC/ELAPS
gearshift FFTW benchmark (ask Miguel?)
- Kenneth: see also PR for Christian with FFTW app

Toolchains

Sebastian, Kenneth
gobff/2020a + 2020b (PR is ready)
- foss with OpenBLAS replaced by BLIS+libFLAME+FFTW
- compare with foss + gomkl
- - custom gobff-amd (patched BLIS+libFLAME+FFTW)
iibff
- intel with MKL replaced by BLIS+libFLAME+FFTW
FFTW 3.3.9 is out

Test systems

TODO: collect exact hardware info per site in blis-eval
- CPU model numbers, see lscpu output
- memory channels (hwloc?, sudo dmidecode -t memory)
- STREAM benchmark results
  - see Åke custom version (more exact timings)
AMD Rome
- HPC-UGent (doduo): Rome
- EMBL (Jure): Rome + Napels
- Compute Canada (Bart): Rome (single-node)
- JSC: Rome
- Azure (Davide): various Rome SKUs (124-core, 120 usable)
Intel
- HPC-UGent (Kenneth): Haswell, Skylake, Cascade Lake
- VUB (Sam): Ivy Bridge, Haswell, Broadwell, Skylake
- EMBL (Jure): Skylake
- SURF: Cascade Lake
- Compute Canada: same, KNL
- Umeå (Åke): Broadwell, Skylake, (KNL)
- JSC: Skylake
- Azure (Davide): various (incl. special)
other
- Arm (Kenneth @ AWS)
- POWER9 (Kenneth?, via UBirm.)
Bart: 6248 vs 6248R makes a big difference...

Applications

HPL (Bart)
CP2K (Sam, Robert)
- Sam has some experience with this
- h2o_128 benchmark included in CP2K
VASP
- too dependent on their shitty code
- fair amount in BLAS, most in FFTW
- Åke: may not be a good fit for this effort...
- Åke has a test suite (correctness) + benchmarks (with some scientific validation)
  - specific to VASP 5.4.4
  - based on https://www.nsc.liu.se/~pla/vasptest/ by Peter Larsson (Åke has changes on top)
numpy/scipy test suites (Kenneth)
QuantumESPRESSO (Robert, Sebastian)
- standard benchmarks

Notes

previous experiments by Bart

Some HPL results (could be improved upon)
(LAPACK params)  N      NB     P     Q            seconds              GFLOPS    (CPU, BLAS lib)
----------------------------------------------------------------------------------------------------
WR11C2R4      128000   384     8     8             678.88              2.059e+03 (7452 MKL2020.1)
WR12R2R4      177000   192     8     8            1528.47              2.419e+03 (7452,MKL2020.0,MKL_DEBUG_CPU_TYPE=5)
WR12R2R4      168960   232     4     4            1370.64             2.3461e+03 (7452, AMD BLIS)
WR12R2R4      177000   232     4     4            1629.23             2.2691e+03 (7452, OpenBLAS)

newer MKL versions have custom kernels for AMD Rome
$MKL_DEBUG_CPU_TYPE no longer works with MKL 2020.1 (and is generally unsafe on AMD Rome)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BLIS evaluation

BLIS evaluation

Practical

In scope

Notes meeting 20210318

Notes meeting 20210311

Notes meeting 20210304

Notes meeting 20210225

Notes meeting 20210218

Notes meeting 20210210

Tasks

Toolchains

Test systems

Applications

Notes

Clone this wiki locally