- Paolo Savini (Embecosm)
- Helene Chelin (Embecosm)
- Jeremy Bennett (Embecosm)
- Hugh O'Keeffe (Ashling)
- Nadim Shehayed (Ashling)
- Daniel Barboza (Ventana)
- WP1:
- Get baseline scores for memory operation benchmarks.
- The vector memcpy benchmark shows a steep increase of execution time with the size of the data. (See details below)
- Identify the most promising load/store vector instruction to optimize.
- Deferred from last week to prioritize benchmark work.
- Prepare routine/nightly runs of benchmarks.
- Deferred from last week to prioritize benchmark work.
- Jeremy to set up "smoke test" SPEC CPU 2017 run using the integer and floating point
specrand
programs (should run in minutes).- See details below.
- Get baseline scores for memory operation benchmarks.
-
WP1:
- Identify the most promising load/store vector instruction to optimize.
- Prepare routine/nightly runs of benchmarks.
-
WP2:
- Identify optimizations in the TCG vector ld/st helper functions.
Our current set of agreed priorities are taken from the Statement of Work. This has the following priorities, which trade off functionality targeted versus architectures supported.
- vector load/store ops for x86_64 AVX
- vector load/store ops for AArch64/Neon
- vector integer ALU ops for x86_64 AVX
- vector load/store ops for Intel AVX10
For each of these there will be an analysis phase and an optimization phase, leading to the following set of work packages.
- WP0: Infrastructure
- WP1: Analysis of vector load/store ops on x86_64 AVX
- WP2: Optimization of vector load/store ops on x86_64 AVX
- WP3: Analysis of vector load/store ops on AArch64/Neon
- WP4: Optimization of vector load/store ops on AArch64/Neon
- WP5: Analysis of integer ALU ops on x86_64 AVX
- WP6: Optimization of integer ALU ops on x86_64 AVX
- WP7: Analysis of vector load/store ops on Intel AVX10
- WP8: Optimization of vector load/store ops on Intel AVX10
These priorities can be revised by agreement with RISE during the project.
We have both scalar and vector implementations of memcpy running with different sizes of the array being copied.
An array of unsigned 8-bit integers is filled with random values and is then copied to an empty array of equal size.
uint8_t *src = (uint8_t *) malloc (len);
uint8_t *dst = (uint8_t *) malloc (len);
mem_init_random (src, len);
for (size_t i = 0; i < WARMUP; i++)
vmemcpy (dst, src, len); // same for the scalar version (smemcpy)
for (size_t i = 0; i < iterations; i++)
vmemcpy (dst, src, len); // same for the scalar version (smemcpy)
free(src);
free(dst);
The same program is used for scalar measurements, but replacing vmemcpy
by standard memcpy
from newlib.
The tests are run first with 2 million iterations of the main loop and then with 1 million. The timings are then subtracted to cut out the overhead and the warmup time. There is some minor variability in the results but it is not big enough to be statistically relevant.
The newlib implementation is quite optimized for smaller block sizes. It is only with data sizes above 128 bytes that we see execution time increasing.
vmemcpy
uses the reference assembler implementation from the RVV standard
vmemcpy:
mv a3, a0 // Copy destination
loop_cpy:
vsetvli t0, a2, e8, m8, ta, ma // Vectors of 8 regs
vle8.v v0, (a1) // Load bytes
add a1, a1, t0 // Bump pointer
sub a2, a2, t0 // Decrement count
vse8.v v0, (a3) // Store bytes
add a3, a3, t0 // Bump pointer
bnez a2, loop_cpy // Any more?
ret
.size vmemcpy, .-vmemcpy
All the programs are compiled with the gcc released version 14.1 and without optimization (-O0).
The following table compares QEMU performance using 1000000 iterations with both vector and scalar implementations of memcpy
. We show execution time, instruction count (millions) and nanoseconds per instruction executed.
length | s time | v8 time | s Micount | v8 Micount | s ns/inst | v8 ns/inst |
---|---|---|---|---|---|---|
1 | 0.17 | 0.11 | 73.0 | 19.0 | 2.33 | 5.79 |
2 | 0.23 | 0.15 | 89.0 | 19.0 | 2.58 | 7.89 |
4 | 0.27 | 0.14 | 121.0 | 19.0 | 2.23 | 7.37 |
8 | 0.23 | 0.33 | 95.0 | 19.0 | 2.42 | 17.37 |
16 | 0.27 | 0.44 | 111.0 | 19.0 | 2.43 | 23.16 |
32 | 0.32 | 0.93 | 143.0 | 19.0 | 2.24 | 48.95 |
64 | 0.44 | 1.55 | 207.0 | 19.0 | 2.13 | 81.58 |
128 | 0.64 | 3.03 | 293.0 | 19.0 | 2.18 | 159.47 |
256 | 0.95 | 6.44 | 451.0 | 26.0 | 2.11 | 247.69 |
512 | 1.59 | 11.87 | 767.0 | 40.0 | 2.07 | 296.75 |
1024 | 2.81 | 22.54 | 1448.0 | 68.0 | 1.94 | 331.47 |
2048 | 5.46 | 44.94 | 2810.0 | 124.0 | 1.94 | 362.42 |
For the scalar version, optimized for small blocks, the number of instructions is roughly the ame until we get to 32/64 bytes, and then it starts growing, becoming linear with larger block sizes. As expected the average time taken to execute an instruction is roughly the same throughout
For the vector version, we see the number of instructions is constant up to block size 128, the size at which the copy can be achieved with a single loop iteration (vector length of 128 bits and LMUL=8). However the time taken per instruction grows, indicating that the time for the vle8.v
and vse8.v
instructions depends on the size of the data being loaded/stored.
We also looked at the impact of LMUL=8. The following table shows the same results as above, but this time comparing vector implementations with LMUL=8 and LMUL=1 (changing the vsetvli
instruction in the code above.
length | v8 time | v1 time | v8 Micount | v1 Micount | v8 ns/inst | v1 ns/inst |
---|---|---|---|---|---|---|
1 | 0.11 | 0.11 | 19.0 | 19.0 | 5.79 | 5.79 |
2 | 0.15 | 0.15 | 19.0 | 19.0 | 7.89 | 7.89 |
4 | 0.14 | 0.22 | 19.0 | 19.0 | 7.37 | 11.58 |
8 | 0.33 | 0.33 | 19.0 | 19.0 | 17.37 | 17.37 |
16 | 0.44 | 0.47 | 19.0 | 19.0 | 23.16 | 24.74 |
32 | 0.93 | 1.03 | 19.0 | 26.0 | 48.95 | 39.62 |
64 | 1.55 | 2.03 | 19.0 | 40.0 | 81.58 | 50.75 |
128 | 3.03 | 3.58 | 19.0 | 68.0 | 159.47 | 52.65 |
256 | 6.44 | 7.05 | 26.0 | 124.0 | 247.69 | 56.85 |
512 | 11.87 | 13.10 | 40.0 | 236.0 | 296.75 | 55.51 |
1024 | 22.54 | 26.38 | 68.0 | 460.0 | 331.47 | 57.35 |
2048 | 44.94 | 52.73 | 124.0 | 908.0 | 362.42 | 58.07 |
As expected, with LMUL=1 the number of instructions grows once the block size exceeds 16. However there is little variation in overall QEMU execution time. While more instructions are needed, they execute faster. We can see as the blocks get larger, the ratio between ns/inst approaches 8, suggesting the vle8.v
and vse8.v
instruction execution time is proporational to LMUL.
The 4 SPECrand programs are simple programs used primarily to validate the SPEC CPU 2017 scripts. As such they are a useful near instant smoke test.
The following tests were run using release versions of GCC (14.1), binutils (2.42) and Glibc (2.39). In both scalar and vector cases only scalar multilibs of Glibc were available. Vector was enabled by adding v
to the architecture string and adding the -ftree-vectorize
flag.
Benchmark | Tye | Real S | Total S | Real V | Total V | Ratio |
---|---|---|---|---|---|---|
996.specrand_fs |
fp | 1.31 | 1.31 | 1.45 | 1.45 | 1.11 |
997.specrand_fr |
fp | 1.27 | 1.27 | 1.45 | 1.45 | 1.14 |
998.specrand_is |
int | 1.31 | 1.31 | 1.46 | 1.46 | 1.11 |
999.specrand_ir |
int | 1.27 | 1.27 | 1.47 | 1.47 | 1.16 |
Total | 5.16 | 5.16 | 5.83 | 5.83 | 1.13 |
All tests compiled and ran correctly.
Built with gcc 14.1 without optimization (-O0) and executed for 1000000 iterations. The results are expressed in seconds. This benchmark uses 8-bit loads and stores vle8.v/vse8.v with register groups of 8 (m8).
length | s time | v8 time | s Micount | v8 Micount | s ns/inst | v8 ns/inst |
---|---|---|---|---|---|---|
1 | 0.17 | 0.11 | 73.0 | 19.0 | 2.33 | 5.79 |
2 | 0.23 | 0.15 | 89.0 | 19.0 | 2.58 | 7.89 |
4 | 0.27 | 0.14 | 121.0 | 19.0 | 2.23 | 7.37 |
8 | 0.23 | 0.33 | 95.0 | 19.0 | 2.42 | 17.37 |
16 | 0.27 | 0.44 | 111.0 | 19.0 | 2.43 | 23.16 |
32 | 0.32 | 0.93 | 143.0 | 19.0 | 2.24 | 48.95 |
64 | 0.44 | 1.55 | 207.0 | 19.0 | 2.13 | 81.58 |
128 | 0.64 | 3.03 | 293.0 | 19.0 | 2.18 | 159.47 |
256 | 0.95 | 6.44 | 451.0 | 26.0 | 2.11 | 247.69 |
512 | 1.59 | 11.87 | 767.0 | 40.0 | 2.07 | 296.75 |
1024 | 2.81 | 22.54 | 1448.0 | 68.0 | 1.94 | 331.47 |
2048 | 5.46 | 44.94 | 2810.0 | 124.0 | 1.94 | 362.42 |
No data to report yet.
You can find the baseline execution time and instruction count of the SPEC CPU 2017 benchmarks here
Making a clean install of SPEC CPU and building the benchmarks took around 20-30 minutes on one of our large AMD servers. Thereafter, the scalar run completed in 22 minutes and the vector run in 63 minutes. It is worth noting that a single benchmark (625.x264_s
) took a long time to complete, without which the run times would have been 12 minutes and 17 minutes respectively. Notwithstanding, these are highly suitable as a quick test under jenkins.
The following tests were run using release versions of GCC (14.1), binutils (2.42) and Glibc (2.39). In both scalar and vector cases only scalar multilibs of Glibc were available. Vector was enabled by adding v
to the architecture string.
Issues:
- 7 tests with both scalar and integer failed either to run or failed their post run checks (this seems to be a scripting issue, at least in some cases); and
- at least one test (
602.gcc_s
) is suspiciously quick although not reporting any failures.
Benchmark | Type | Status | Real S | Compute S | Real V | Compute V | Ratio |
---|---|---|---|---|---|---|---|
600.perlbench_s |
int | Maybe | 70 | 16 | 72 | 19 | 1.18 |
602.gcc_s |
int | Good | 3 | 1 | 6 | 2 | 1.58 |
603.bwaves_s |
fp | Good | 253 | 1,783 | 296 | 2,050 | 1.15 |
605.mcf_s |
int | Good | 388 | 125 | 427 | 138 | 1.10 |
607.cactuBSSN_s |
fp | Good | 340 | 2,966 | 353 | 2,850 | 0.96 |
619.lbm_s |
fp | Good | 263 | 1,122 | 313 | 1,327 | 1.18 |
620.omnetpp_s |
int | Good | 340 | 92 | 397 | 117 | 1.27 |
621.wrf_s |
fp | Maybe | 628 | 9,192 | 741 | 12,191 | 1.33 |
623.xalancbmk_s |
int | Good | 9 | 3 | 19 | 6 | 1.63 |
625.x264_s |
int | Maybe | 1,361 | 1,009 | 3,827 | 3,487 | 3.46 |
627.cam4_s |
fp | Maybe | 192 | 473 | 348 | 807 | 1.71 |
628.pop2_s |
fp | Good | 211 | 49 | 428 | 127 | 2.58 |
631.deepsjeng_s |
int | Good | 610 | 261 | 736 | 359 | 1.37 |
638.imagick_s |
fp | Maybe | 2 | 1 | 3 | 2 | 1.67 |
641.leela_s |
int | Good | 418 | 128 | 548 | 209 | 1.63 |
644.nab_s |
fp | Good | 528 | 7,960 | 553 | 7,885 | 0.99 |
648.exchange2_s |
int | Good | 520 | 177 | 957 | 585 | 3.30 |
649.fotonik3d_s |
fp | Bad | - | - | - | - | - |
654.roms_s |
fp | Maybe | 2 | 2 | 2 | 3 | 1.44 |
657.xz_s |
int | Good | 705 | 186 | 1,024 | 267 | 1.44 |
996.specrand_fs |
fp | Good | 1 | 0 | 1 | 0 | 1.14 |
998.specrand_is |
int | Good | 1 | 0 | 1 | 0 | 1.15 |
Total | 6,842 | 25,548 | 11,050 | 32,434 | 1.45 | ||
- Good. Compiled and passed post-run checks
- Maybe. Compiled, but failed post-run checks
- Bad. Failed to compile or failed to run
2024-05-15
- Jeremy to look at impact of masked v unmasked and strided v unstrided on vector operations.
- Jeremy to look at impact of VLEN > 128:
- QEMU currently supports up to 1024, RVV standard permits up to 65536.
2024-05-08
- Jeremy to characterise QEMU floating point performance and file it as a performance regression issue in QEMU GitLab.
- low priority, deferred to prioritize the smoke tests work.
- Jeremy to set up "smoke test" SPEC CPU 2017 run using the integer and floating point
specrand
programs (should run in minutes).- COMPLETE
2024-05-01
- Paolo to review the generic issue from Palmer Dabbelt to identify ideas for optimization and benchmarks to reuse.
- IN PROGRESS: we are working on running the reproducers to see the TCG ops generated by QEMU.
- Deferred to prioritize benchmark work.
- Daniel to advise Paolo on best practice for preparing QEMU upstream submissions.
- The bionic benchmarks may be a useful source of small benchmarks.
The risk register is held in a shared spreadsheet
We will keep it updated continuously and report any changes each week.
There are no changes to the risk register this week.
No planned vacations for the rest of the month.