Skip to content

Compiler Flags for Different Architectures

plavin edited this page Apr 9, 2019 · 19 revisions

Last updated: 3/29/19

This page lists flags that are used to compile Spatter and STREAM comparisons on different architectures.

STREAM

Some general notes for STREAM can be found at this blog post:

Additionally:

  • ICC generally will generate the best quality code for STREAM and Spatter on Intel architectures.
  • Streaming loads/stores may be needed to increase performance to "peak" performance.

STREAM flags per architecture

Common flags for Intel compilers with OpenMP backend: -Ofast -qopenmp -qopenmp-link=static -fargument-noalias

TBD - when do we use -ffreestanding?

Note that in many cases, you can check for vectorized instructions by generating the assembly with the -S flag or by using objdump -d <compiled_app> to look at the assembly code. As mentioned in this StackOverflow post, you want to look for instructions with names like vgatherpf0qpd.

Architecture Short Name Compiler Flags Notes
Sandy Bridge SNB icc -march=sandybridge
Broadwell BDW icc -march=broadwell
Skylake SKL icc -march=skylake
Skylake with AVX512 SKL icc -march=skylake-avx512
cce -hvector2 or -hvector3 moderate or aggressive vectorization
-hvector1 or -hscalar1/2/3 limited automatic vectorization
Knight's Landing with AVX512 and MCDRAM KNL icpc icpc -xCOMMON-AVX512 Compilation notes
Power9 PWR9 codexl Use xlc_r to create thread-safe version of Spatter
-qtune=pwr9 Tune for Power9 arch (auto tunes for arch where compiled)
-qsimd=auto Implied for -O3 or higher opt level
-qenablevmx Enable vector generation
-qhot=vector
ARM TX2 TX2 armclang -O3 -mcpu=native Let compiler decide based on host
-O3 -mcpu=thunderx2t99
gcc -ftree-vectorize

Intel-specific flags

To use HBM on KNL:

#Check mem settings
numactl -H 
#Run on NUMA mem region 1 (HBM)
numactl --membind 1 ./run-app

How do we know if code has been vectorized with a specific compiler?

Returns info on which loops were vectorized and why: -qopt-report=1 -qopt-report-phase=vec

Returns info on loops that were not vectorized and why: -qopt-report-phase=vec,loop -qopt-report=2

You can also use the following high-level flag: -vec-report=3

[CodeXL]

-qreport or -qlist flags can be used to generate high-order transformation (HOT) reports or print an object listing of the code.

Armclang

Using the ARMHPC compiler, we can also print out the vectorization report: -Rpass=loop-vectorize -Rpass-analysis=loop-vectorize -Rpass-missed=loop-vectorize

As an example:

armclang -O3 -Rpass=loop-vectorize -Rpass-analysis=loop-vectorize -Rpass-missed=loop-vectorize example.c -gline-tables-only 2> vecreport.txt

Alternatively, you will need to use the armllvm-objdump with the correct disassemble flags or you can use the -S flag to generate the assembly code during compilation. #From the basic SVE example; ld1w and st1w are SVE instructions $armllvm-objdump -disassemble -mattr=+sve example &> example.dis #Sample output from example.dis - ld1w and st1w are both SVE instructions 400898: a0 42 48 a5 ld1w { z0.s }, p0/z, [x21, x8, lsl #2] 40089c: c1 42 48 a5 ld1w { z1.s }, p0/z, [x22, x8, lsl #2] 4008a0: 00 04 a1 04 sub z0.s, z0.s, z1.s 4008a4: e0 42 48 e5 st1w { z0.s }, p0, [x23, x8, lsl #2]

#Option 2 - generate assembly during compilation $armclang -O3 -S --target=aarch64-arm-none-eabi -march=armv8-a+sve -o example.s example.c #Sample output from example.s .LBB1_3: // =>This Inner Loop Header: Depth=1 ld1w { z0.s }, p0/z, [x21, x8, lsl #2] ld1w { z1.s }, p0/z, [x22, x8, lsl #2] sub z0.s, z0.s, z1.s st1w { z0.s }, p0, [x23, x8, lsl #2]

Cray Compiler - CCE

-fopt-info-missed-vec or -fopt-info-vec-missed=vec.miss to print to a file

-Rpass-analysis=loop-vectorize -Rpass=loop-vectorize -Rpass-missed=loop-vectorize