Skip to content

BLIS evaluation

Sam Moors edited this page Mar 18, 2021 · 23 revisions

BLIS evaluation

Practical

In scope

  • BLIS + libFLAME (LAPACK)
  • gobff vs foss
  • iibff vs intel
  • also FFTW?

Notes meeting 20210318

  • Åke

    • BLAS testing
    • LAPACK testing (using BLIS only)
      • More results from Skylake and AMD EPYC (zen2)
        • Skylake has more errors than Broadwell
        • AMD (zen2) has the same amount as Broadwell
    • LAPACK testing with libFlame (and refblas)
      • libFLAME doesn't contain all the functions needed, so have to link with reflapack lib too.
      • first test xlintsts < stest.in causes "Segmentation fault - invalid memory reference."
  • Sam

    • CP2K with goblf (BLIS, LAPACK, no libFLAME):
      • fixes all extra failed tests (summary now looks exactly the same as with foss)
      • performance tests underway - need to use BLIS_NUM_THREADS?

Notes meeting 20210311


Notes meeting 20210304


Notes meeting 20210225


Notes meeting 20210218

  • new BLIS-based toolchains

    • BLIS moved to GCCcore because it doesn't like being built with Intel compilers (see https://github.com/flame/blis/pull/372)
    • gobff/2020b, iibff/2020b (+ gomkl/2020b), to be included with EasyBuild v4.3.3
  • BLAS test suite (Åke)

  • Sam tested https://github.com/xianyi/BLAS-Tester

    • ran into linking errors when using BLIS
      gcc -I./include -DAdd_  -DStringSunStyle -DATL_OS_Linux  -DTHREADNUM=4  -DF77_INTEGER=int -fopenmp -m64 -O3 -o ./bin/xsl1blastst sl1blastst.o ATL_sf77rotg.o ATL_sf77rot.o ATL_sf77rotmg.o ATL_sf77rotm.o ATL_sf77swap.o ATL_sf77scal.o ATL_sf77copy.o ATL_sf77axpy.o ATL_sf77dot.o ATL_sdsf77dot.o ATL_dsf77dot.o ATL_sf77nrm2.o ATL_sf77asum.o ATL_sf77amax.o ATL_sf77rotgf.o ATL_sf77rotf.o ATL_sf77rotmgf.o ATL_sf77rotmf.o ATL_sf77swapf.o ATL_sf77scalf.o ATL_sf77copyf.o ATL_sf77axpyf.o ATL_sf77dotf.o ATL_sdsf77dotf.o       ATL_dsf77dotf.o ATL_sf77nrm2f.o ATL_sf77asumf.o ATL_sf77amaxf.o ATL_sf77aminf.o ATL_flushcache.o ATL_sinfnrm.o ATL_rand.o ATL_svdiff.o ATL_sf77amin.o  ./refblas/librefblas.a /apps/brussel/CO7/skylake/software/BLIS/0.8.0-GCCcore-10.2.0/lib/libblis.so  -lm -lgfortran -lpthread ATL_sf77amin.o:
      ATL_f77amin.c:function OPENBLAS_sf77amin: error: undefined reference to 'isamin_'
      collect2: error: ld returned 1 exit status
      make: *** [xsl1blastst] Error 1
      
    • Åke may be able to help with that...
      • Use NO_EXTENSION=1
      • And one can set TEST_BLAS=-lblis to make it simpler
  • Sebastian starting with low-level benchmarks on JUWELS (Skylake partition)

  • Sam is looking into building CP2K with gobff

    • already includes a regression test
    • default: popt, should also look into psmp

Notes meeting 20210210

Tasks

  • correctness checking

    • run netlib BLAS/LAPACK tests (Åke)
    • netlib BLAS tests with BLIS
    • netlib LAPACK tests with BLIS+LAPACK
    • netlib LAPACK tests with BLIS+libFLAME
    • also https://github.com/xianyi/BLAS-Tester (Sam) does not work with BLIS
  • low-level performance testing (Sebastian)

  • gearshift FFTW benchmark (ask Miguel?)

    • Kenneth: see also PR for Christian with FFTW app

Toolchains

  • Sebastian, Kenneth
  • gobff/2020a + 2020b (PR is ready)
    • foss with OpenBLAS replaced by BLIS+libFLAME+FFTW
    • compare with foss + gomkl
      • custom gobff-amd (patched BLIS+libFLAME+FFTW)
  • iibff
    • intel with MKL replaced by BLIS+libFLAME+FFTW
  • FFTW 3.3.9 is out

Test systems

  • TODO: collect exact hardware info per site in blis-eval

    • CPU model numbers, see lscpu output
    • memory channels (hwloc?, sudo dmidecode -t memory)
    • STREAM benchmark results
      • see Åke custom version (more exact timings)
  • AMD Rome

    • HPC-UGent (doduo): Rome
    • EMBL (Jure): Rome + Napels
    • Compute Canada (Bart): Rome (single-node)
    • JSC: Rome
    • Azure (Davide): various Rome SKUs (124-core, 120 usable)
  • Intel

    • HPC-UGent (Kenneth): Haswell, Skylake, Cascade Lake
    • VUB (Sam): Ivy Bridge, Haswell, Broadwell, Skylake
    • EMBL (Jure): Skylake
    • SURF: Cascade Lake
    • Compute Canada: same, KNL
    • Umeå (Åke): Broadwell, Skylake, (KNL)
    • JSC: Skylake
    • Azure (Davide): various (incl. special)
  • other

    • Arm (Kenneth @ AWS)
    • POWER9 (Kenneth?, via UBirm.)
  • Bart: 6248 vs 6248R makes a big difference...

Applications

  • HPL (Bart)
  • CP2K (Sam, Robert)
    • Sam has some experience with this
    • h2o_128 benchmark included in CP2K
  • VASP
    • too dependent on their shitty code
    • fair amount in BLAS, most in FFTW
    • Åke: may not be a good fit for this effort...
    • Åke has a test suite (correctness) + benchmarks (with some scientific validation)
  • numpy/scipy test suites (Kenneth)
  • QuantumESPRESSO (Robert, Sebastian)
    • standard benchmarks

Notes

  • previous experiments by Bart
    Some HPL results (could be improved upon)
    (LAPACK params)  N      NB     P     Q            seconds              GFLOPS    (CPU, BLAS lib)
    ----------------------------------------------------------------------------------------------------
    WR11C2R4      128000   384     8     8             678.88              2.059e+03 (7452 MKL2020.1)
    WR12R2R4      177000   192     8     8            1528.47              2.419e+03 (7452,MKL2020.0,MKL_DEBUG_CPU_TYPE=5)
    WR12R2R4      168960   232     4     4            1370.64             2.3461e+03 (7452, AMD BLIS)
    WR12R2R4      177000   232     4     4            1629.23             2.2691e+03 (7452, OpenBLAS)
    
    • newer MKL versions have custom kernels for AMD Rome
    • $MKL_DEBUG_CPU_TYPE no longer works with MKL 2020.1 (and is generally unsafe on AMD Rome)
Clone this wiki locally