Skip to content

Latest commit

 

History

History
272 lines (205 loc) · 15.3 KB

20240515.md

File metadata and controls

272 lines (205 loc) · 15.3 KB

RISE RP005 QEMU weekly report 2024-05-15

Overview

Introducing the project team

  • Paolo Savini (Embecosm)
  • Helene Chelin (Embecosm)
  • Jeremy Bennett (Embecosm)
  • Hugh O'Keeffe (Ashling)
  • Nadim Shehayed (Ashling)
  • Daniel Barboza (Ventana)

Work completed since last report

  • WP1:
    • Get baseline scores for memory operation benchmarks.
      • The vector memcpy benchmark shows a steep increase of execution time with the size of the data. (See details below)
    • Identify the most promising load/store vector instruction to optimize.
      • Deferred from last week to prioritize benchmark work.
    • Prepare routine/nightly runs of benchmarks.
      • Deferred from last week to prioritize benchmark work.
    • Jeremy to set up "smoke test" SPEC CPU 2017 run using the integer and floating point specrand programs (should run in minutes).
      • See details below.

Work planned for the coming week

  • WP1:

    • Identify the most promising load/store vector instruction to optimize.
    • Prepare routine/nightly runs of benchmarks.
  • WP2:

    • Identify optimizations in the TCG vector ld/st helper functions.

Current priorties

Our current set of agreed priorities are taken from the Statement of Work. This has the following priorities, which trade off functionality targeted versus architectures supported.

  • vector load/store ops for x86_64 AVX
  • vector load/store ops for AArch64/Neon
  • vector integer ALU ops for x86_64 AVX
  • vector load/store ops for Intel AVX10

For each of these there will be an analysis phase and an optimization phase, leading to the following set of work packages.

  • WP0: Infrastructure
  • WP1: Analysis of vector load/store ops on x86_64 AVX
  • WP2: Optimization of vector load/store ops on x86_64 AVX
  • WP3: Analysis of vector load/store ops on AArch64/Neon
  • WP4: Optimization of vector load/store ops on AArch64/Neon
  • WP5: Analysis of integer ALU ops on x86_64 AVX
  • WP6: Optimization of integer ALU ops on x86_64 AVX
  • WP7: Analysis of vector load/store ops on Intel AVX10
  • WP8: Optimization of vector load/store ops on Intel AVX10

These priorities can be revised by agreement with RISE during the project.

Detailed description of work

WP1

Memory/string operations: comparing scalar and vector

We have both scalar and vector implementations of memcpy running with different sizes of the array being copied.

An array of unsigned 8-bit integers is filled with random values and is then copied to an empty array of equal size.

uint8_t *src = (uint8_t *) malloc (len);
uint8_t *dst = (uint8_t *) malloc (len);
mem_init_random (src, len);

for (size_t i = 0; i < WARMUP; i++)
  vmemcpy (dst, src, len); // same for the scalar version (smemcpy)

for (size_t i = 0; i < iterations; i++)
  vmemcpy (dst, src, len); // same for the scalar version (smemcpy)

free(src);
free(dst);

The same program is used for scalar measurements, but replacing vmemcpy by standard memcpy from newlib.

The tests are run first with 2 million iterations of the main loop and then with 1 million. The timings are then subtracted to cut out the overhead and the warmup time. There is some minor variability in the results but it is not big enough to be statistically relevant.

The newlib implementation is quite optimized for smaller block sizes. It is only with data sizes above 128 bytes that we see execution time increasing.

vmemcpy uses the reference assembler implementation from the RVV standard

vmemcpy:
	mv	a3, a0			             // Copy destination
loop_cpy:
	vsetvli	t0, a2, e8, m8, ta, ma	 // Vectors of 8 regs
	vle8.v	v0, (a1)		         // Load bytes
	add	a1, a1, t0		             // Bump pointer
	sub	a2, a2, t0		             // Decrement count
	vse8.v	v0, (a3)		         // Store bytes
	add	a3, a3, t0		             // Bump pointer
	bnez	a2, loop_cpy		     // Any more?
	ret
	.size	vmemcpy, .-vmemcpy

All the programs are compiled with the gcc released version 14.1 and without optimization (-O0).

The following table compares QEMU performance using 1000000 iterations with both vector and scalar implementations of memcpy. We show execution time, instruction count (millions) and nanoseconds per instruction executed.

length s time v8 time s Micount v8 Micount s ns/inst v8 ns/inst
1 0.17 0.11 73.0 19.0 2.33 5.79
2 0.23 0.15 89.0 19.0 2.58 7.89
4 0.27 0.14 121.0 19.0 2.23 7.37
8 0.23 0.33 95.0 19.0 2.42 17.37
16 0.27 0.44 111.0 19.0 2.43 23.16
32 0.32 0.93 143.0 19.0 2.24 48.95
64 0.44 1.55 207.0 19.0 2.13 81.58
128 0.64 3.03 293.0 19.0 2.18 159.47
256 0.95 6.44 451.0 26.0 2.11 247.69
512 1.59 11.87 767.0 40.0 2.07 296.75
1024 2.81 22.54 1448.0 68.0 1.94 331.47
2048 5.46 44.94 2810.0 124.0 1.94 362.42

For the scalar version, optimized for small blocks, the number of instructions is roughly the ame until we get to 32/64 bytes, and then it starts growing, becoming linear with larger block sizes. As expected the average time taken to execute an instruction is roughly the same throughout

For the vector version, we see the number of instructions is constant up to block size 128, the size at which the copy can be achieved with a single loop iteration (vector length of 128 bits and LMUL=8). However the time taken per instruction grows, indicating that the time for the vle8.v and vse8.v instructions depends on the size of the data being loaded/stored.

Memory/string operations: impact of LMUL

We also looked at the impact of LMUL=8. The following table shows the same results as above, but this time comparing vector implementations with LMUL=8 and LMUL=1 (changing the vsetvli instruction in the code above.

length v8 time v1 time v8 Micount v1 Micount v8 ns/inst v1 ns/inst
1 0.11 0.11 19.0 19.0 5.79 5.79
2 0.15 0.15 19.0 19.0 7.89 7.89
4 0.14 0.22 19.0 19.0 7.37 11.58
8 0.33 0.33 19.0 19.0 17.37 17.37
16 0.44 0.47 19.0 19.0 23.16 24.74
32 0.93 1.03 19.0 26.0 48.95 39.62
64 1.55 2.03 19.0 40.0 81.58 50.75
128 3.03 3.58 19.0 68.0 159.47 52.65
256 6.44 7.05 26.0 124.0 247.69 56.85
512 11.87 13.10 40.0 236.0 296.75 55.51
1024 22.54 26.38 68.0 460.0 331.47 57.35
2048 44.94 52.73 124.0 908.0 362.42 58.07

As expected, with LMUL=1 the number of instructions grows once the block size exceeds 16. However there is little variation in overall QEMU execution time. While more instructions are needed, they execute faster. We can see as the blocks get larger, the ratio between ns/inst approaches 8, suggesting the vle8.v and vse8.v instruction execution time is proporational to LMUL.

Smoke test: specrand programs

The 4 SPECrand programs are simple programs used primarily to validate the SPEC CPU 2017 scripts. As such they are a useful near instant smoke test.

The following tests were run using release versions of GCC (14.1), binutils (2.42) and Glibc (2.39). In both scalar and vector cases only scalar multilibs of Glibc were available. Vector was enabled by adding v to the architecture string and adding the -ftree-vectorize flag.

Benchmark Tye Real S Total S Real V Total V Ratio
996.specrand_fs fp 1.31 1.31 1.45 1.45 1.11
997.specrand_fr fp 1.27 1.27 1.45 1.45 1.14
998.specrand_is int 1.31 1.31 1.46 1.46 1.11
999.specrand_ir int 1.27 1.27 1.47 1.47 1.16
Total 5.16 5.16 5.83 5.83 1.13

All tests compiled and ran correctly.

Statistics

Memory/string operations

Memcpy

Built with gcc 14.1 without optimization (-O0) and executed for 1000000 iterations. The results are expressed in seconds. This benchmark uses 8-bit loads and stores vle8.v/vse8.v with register groups of 8 (m8).

length s time v8 time s Micount v8 Micount s ns/inst v8 ns/inst
1 0.17 0.11 73.0 19.0 2.33 5.79
2 0.23 0.15 89.0 19.0 2.58 7.89
4 0.27 0.14 121.0 19.0 2.23 7.37
8 0.23 0.33 95.0 19.0 2.42 17.37
16 0.27 0.44 111.0 19.0 2.43 23.16
32 0.32 0.93 143.0 19.0 2.24 48.95
64 0.44 1.55 207.0 19.0 2.13 81.58
128 0.64 3.03 293.0 19.0 2.18 159.47
256 0.95 6.44 451.0 26.0 2.11 247.69
512 1.59 11.87 767.0 40.0 2.07 296.75
1024 2.81 22.54 1448.0 68.0 1.94 331.47
2048 5.46 44.94 2810.0 124.0 1.94 362.42

Individual RVV instruction performance

No data to report yet.

SPEC CPU 2017 performance

You can find the baseline execution time and instruction count of the SPEC CPU 2017 benchmarks here

Quick test: just the "test" input datasets

Making a clean install of SPEC CPU and building the benchmarks took around 20-30 minutes on one of our large AMD servers. Thereafter, the scalar run completed in 22 minutes and the vector run in 63 minutes. It is worth noting that a single benchmark (625.x264_s) took a long time to complete, without which the run times would have been 12 minutes and 17 minutes respectively. Notwithstanding, these are highly suitable as a quick test under jenkins.

The following tests were run using release versions of GCC (14.1), binutils (2.42) and Glibc (2.39). In both scalar and vector cases only scalar multilibs of Glibc were available. Vector was enabled by adding v to the architecture string.

Issues:

  • 7 tests with both scalar and integer failed either to run or failed their post run checks (this seems to be a scripting issue, at least in some cases); and
  • at least one test (602.gcc_s) is suspiciously quick although not reporting any failures.
Benchmark Type Status Real S Compute S Real V Compute V Ratio
600.perlbench_s int Maybe 70 16 72 19 1.18
602.gcc_s int Good 3 1 6 2 1.58
603.bwaves_s fp Good 253 1,783 296 2,050 1.15
605.mcf_s int Good 388 125 427 138 1.10
607.cactuBSSN_s fp Good 340 2,966 353 2,850 0.96
619.lbm_s fp Good 263 1,122 313 1,327 1.18
620.omnetpp_s int Good 340 92 397 117 1.27
621.wrf_s fp Maybe 628 9,192 741 12,191 1.33
623.xalancbmk_s int Good 9 3 19 6 1.63
625.x264_s int Maybe 1,361 1,009 3,827 3,487 3.46
627.cam4_s fp Maybe 192 473 348 807 1.71
628.pop2_s fp Good 211 49 428 127 2.58
631.deepsjeng_s int Good 610 261 736 359 1.37
638.imagick_s fp Maybe 2 1 3 2 1.67
641.leela_s int Good 418 128 548 209 1.63
644.nab_s fp Good 528 7,960 553 7,885 0.99
648.exchange2_s int Good 520 177 957 585 3.30
649.fotonik3d_s fp Bad - - - - -
654.roms_s fp Maybe 2 2 2 3 1.44
657.xz_s int Good 705 186 1,024 267 1.44
996.specrand_fs fp Good 1 0 1 0 1.14
998.specrand_is int Good 1 0 1 0 1.15
Total 6,842 25,548 11,050 32,434 1.45
  • Good. Compiled and passed post-run checks
  • Maybe. Compiled, but failed post-run checks
  • Bad. Failed to compile or failed to run

Actions

2024-05-15

  • Jeremy to look at impact of masked v unmasked and strided v unstrided on vector operations.
  • Jeremy to look at impact of VLEN > 128:
    • QEMU currently supports up to 1024, RVV standard permits up to 65536.

2024-05-08

  • Jeremy to characterise QEMU floating point performance and file it as a performance regression issue in QEMU GitLab.
    • low priority, deferred to prioritize the smoke tests work.
  • Jeremy to set up "smoke test" SPEC CPU 2017 run using the integer and floating point specrand programs (should run in minutes).
    • COMPLETE

2024-05-01

  • Paolo to review the generic issue from Palmer Dabbelt to identify ideas for optimization and benchmarks to reuse.
    • IN PROGRESS: we are working on running the reproducers to see the TCG ops generated by QEMU.
    • Deferred to prioritize benchmark work.
  • Daniel to advise Paolo on best practice for preparing QEMU upstream submissions.
  • The bionic benchmarks may be a useful source of small benchmarks.

Risk register

The risk register is held in a shared spreadsheet

We will keep it updated continuously and report any changes each week.

There are no changes to the risk register this week.

Planned absences

No planned vacations for the rest of the month.