RISE RP005 QEMU weekly report 2024-05-15

Overview

Introducing the project team

Paolo Savini (Embecosm)
Helene Chelin (Embecosm)
Jeremy Bennett (Embecosm)
Hugh O'Keeffe (Ashling)
Nadim Shehayed (Ashling)
Daniel Barboza (Ventana)

Work completed since last report

WP1:
- Get baseline scores for memory operation benchmarks.
  - The vector memcpy benchmark shows a steep increase of execution time with the size of the data. (See details below)
- Identify the most promising load/store vector instruction to optimize.
  - Deferred from last week to prioritize benchmark work.
- Prepare routine/nightly runs of benchmarks.
  - Deferred from last week to prioritize benchmark work.
- Jeremy to set up "smoke test" SPEC CPU 2017 run using the integer and floating point specrand programs (should run in minutes).
  - See details below.

Work planned for the coming week

WP1:
- Identify the most promising load/store vector instruction to optimize.
- Prepare routine/nightly runs of benchmarks.
WP2:
- Identify optimizations in the TCG vector ld/st helper functions.

Current priorties

Our current set of agreed priorities are taken from the Statement of Work. This has the following priorities, which trade off functionality targeted versus architectures supported.

vector load/store ops for x86_64 AVX
vector load/store ops for AArch64/Neon
vector integer ALU ops for x86_64 AVX
vector load/store ops for Intel AVX10

For each of these there will be an analysis phase and an optimization phase, leading to the following set of work packages.

WP0: Infrastructure
WP1: Analysis of vector load/store ops on x86_64 AVX
WP2: Optimization of vector load/store ops on x86_64 AVX
WP3: Analysis of vector load/store ops on AArch64/Neon
WP4: Optimization of vector load/store ops on AArch64/Neon
WP5: Analysis of integer ALU ops on x86_64 AVX
WP6: Optimization of integer ALU ops on x86_64 AVX
WP7: Analysis of vector load/store ops on Intel AVX10
WP8: Optimization of vector load/store ops on Intel AVX10

These priorities can be revised by agreement with RISE during the project.

Detailed description of work

WP1

Memory/string operations: comparing scalar and vector

We have both scalar and vector implementations of memcpy running with different sizes of the array being copied.

An array of unsigned 8-bit integers is filled with random values and is then copied to an empty array of equal size.

uint8_t *src = (uint8_t *) malloc (len);
uint8_t *dst = (uint8_t *) malloc (len);
mem_init_random (src, len);

for (size_t i = 0; i < WARMUP; i++)
  vmemcpy (dst, src, len); // same for the scalar version (smemcpy)

for (size_t i = 0; i < iterations; i++)
  vmemcpy (dst, src, len); // same for the scalar version (smemcpy)

free(src);
free(dst);

The same program is used for scalar measurements, but replacing vmemcpy by standard memcpy from newlib.

The tests are run first with 2 million iterations of the main loop and then with 1 million. The timings are then subtracted to cut out the overhead and the warmup time. There is some minor variability in the results but it is not big enough to be statistically relevant.

The newlib implementation is quite optimized for smaller block sizes. It is only with data sizes above 128 bytes that we see execution time increasing.

vmemcpy uses the reference assembler implementation from the RVV standard

vmemcpy:
	mv	a3, a0			             // Copy destination
loop_cpy:
	vsetvli	t0, a2, e8, m8, ta, ma	 // Vectors of 8 regs
	vle8.v	v0, (a1)		         // Load bytes
	add	a1, a1, t0		             // Bump pointer
	sub	a2, a2, t0		             // Decrement count
	vse8.v	v0, (a3)		         // Store bytes
	add	a3, a3, t0		             // Bump pointer
	bnez	a2, loop_cpy		     // Any more?
	ret
	.size	vmemcpy, .-vmemcpy

All the programs are compiled with the gcc released version 14.1 and without optimization (-O0).

The following table compares QEMU performance using 1000000 iterations with both vector and scalar implementations of memcpy. We show execution time, instruction count (millions) and nanoseconds per instruction executed.

length	s time	v8 time	s Micount	v8 Micount	s ns/inst	v8 ns/inst
1	0.17	0.11	73.0	19.0	2.33	5.79
2	0.23	0.15	89.0	19.0	2.58	7.89
4	0.27	0.14	121.0	19.0	2.23	7.37
8	0.23	0.33	95.0	19.0	2.42	17.37
16	0.27	0.44	111.0	19.0	2.43	23.16
32	0.32	0.93	143.0	19.0	2.24	48.95
64	0.44	1.55	207.0	19.0	2.13	81.58
128	0.64	3.03	293.0	19.0	2.18	159.47
256	0.95	6.44	451.0	26.0	2.11	247.69
512	1.59	11.87	767.0	40.0	2.07	296.75
1024	2.81	22.54	1448.0	68.0	1.94	331.47
2048	5.46	44.94	2810.0	124.0	1.94	362.42

For the scalar version, optimized for small blocks, the number of instructions is roughly the ame until we get to 32/64 bytes, and then it starts growing, becoming linear with larger block sizes. As expected the average time taken to execute an instruction is roughly the same throughout

For the vector version, we see the number of instructions is constant up to block size 128, the size at which the copy can be achieved with a single loop iteration (vector length of 128 bits and LMUL=8). However the time taken per instruction grows, indicating that the time for the vle8.v and vse8.v instructions depends on the size of the data being loaded/stored.

Memory/string operations: impact of LMUL

We also looked at the impact of LMUL=8. The following table shows the same results as above, but this time comparing vector implementations with LMUL=8 and LMUL=1 (changing the vsetvli instruction in the code above.

length	v8 time	v1 time	v8 Micount	v1 Micount	v8 ns/inst	v1 ns/inst
1	0.11	0.11	19.0	19.0	5.79	5.79
2	0.15	0.15	19.0	19.0	7.89	7.89
4	0.14	0.22	19.0	19.0	7.37	11.58
8	0.33	0.33	19.0	19.0	17.37	17.37
16	0.44	0.47	19.0	19.0	23.16	24.74
32	0.93	1.03	19.0	26.0	48.95	39.62
64	1.55	2.03	19.0	40.0	81.58	50.75
128	3.03	3.58	19.0	68.0	159.47	52.65
256	6.44	7.05	26.0	124.0	247.69	56.85
512	11.87	13.10	40.0	236.0	296.75	55.51
1024	22.54	26.38	68.0	460.0	331.47	57.35
2048	44.94	52.73	124.0	908.0	362.42	58.07

As expected, with LMUL=1 the number of instructions grows once the block size exceeds 16. However there is little variation in overall QEMU execution time. While more instructions are needed, they execute faster. We can see as the blocks get larger, the ratio between ns/inst approaches 8, suggesting the vle8.v and vse8.v instruction execution time is proporational to LMUL.

Smoke test: specrand programs

The 4 SPECrand programs are simple programs used primarily to validate the SPEC CPU 2017 scripts. As such they are a useful near instant smoke test.

The following tests were run using release versions of GCC (14.1), binutils (2.42) and Glibc (2.39). In both scalar and vector cases only scalar multilibs of Glibc were available. Vector was enabled by adding v to the architecture string and adding the -ftree-vectorize flag.

Benchmark	Tye	Real S	Total S	Real V	Total V	Ratio
`996.specrand_fs`	fp	1.31	1.31	1.45	1.45	1.11
`997.specrand_fr`	fp	1.27	1.27	1.45	1.45	1.14
`998.specrand_is`	int	1.31	1.31	1.46	1.46	1.11
`999.specrand_ir`	int	1.27	1.27	1.47	1.47	1.16

Total		5.16	5.16	5.83	5.83	1.13

All tests compiled and ran correctly.

Statistics

Memory/string operations

Memcpy

Built with gcc 14.1 without optimization (-O0) and executed for 1000000 iterations. The results are expressed in seconds. This benchmark uses 8-bit loads and stores vle8.v/vse8.v with register groups of 8 (m8).

length	s time	v8 time	s Micount	v8 Micount	s ns/inst	v8 ns/inst
1	0.17	0.11	73.0	19.0	2.33	5.79
2	0.23	0.15	89.0	19.0	2.58	7.89
4	0.27	0.14	121.0	19.0	2.23	7.37
8	0.23	0.33	95.0	19.0	2.42	17.37
16	0.27	0.44	111.0	19.0	2.43	23.16
32	0.32	0.93	143.0	19.0	2.24	48.95
64	0.44	1.55	207.0	19.0	2.13	81.58
128	0.64	3.03	293.0	19.0	2.18	159.47
256	0.95	6.44	451.0	26.0	2.11	247.69
512	1.59	11.87	767.0	40.0	2.07	296.75
1024	2.81	22.54	1448.0	68.0	1.94	331.47
2048	5.46	44.94	2810.0	124.0	1.94	362.42

Individual RVV instruction performance

No data to report yet.

SPEC CPU 2017 performance

You can find the baseline execution time and instruction count of the SPEC CPU 2017 benchmarks here

Quick test: just the "test" input datasets

Making a clean install of SPEC CPU and building the benchmarks took around 20-30 minutes on one of our large AMD servers. Thereafter, the scalar run completed in 22 minutes and the vector run in 63 minutes. It is worth noting that a single benchmark (625.x264_s) took a long time to complete, without which the run times would have been 12 minutes and 17 minutes respectively. Notwithstanding, these are highly suitable as a quick test under jenkins.

The following tests were run using release versions of GCC (14.1), binutils (2.42) and Glibc (2.39). In both scalar and vector cases only scalar multilibs of Glibc were available. Vector was enabled by adding v to the architecture string.

Issues:

7 tests with both scalar and integer failed either to run or failed their post run checks (this seems to be a scripting issue, at least in some cases); and
at least one test (602.gcc_s) is suspiciously quick although not reporting any failures.

Benchmark	Type	Status	Real S	Compute S	Real V	Compute V	Ratio
`600.perlbench_s`	int	Maybe	70	16	72	19	1.18
`602.gcc_s`	int	Good	3	1	6	2	1.58
`603.bwaves_s`	fp	Good	253	1,783	296	2,050	1.15
`605.mcf_s`	int	Good	388	125	427	138	1.10
`607.cactuBSSN_s`	fp	Good	340	2,966	353	2,850	0.96
`619.lbm_s`	fp	Good	263	1,122	313	1,327	1.18
`620.omnetpp_s`	int	Good	340	92	397	117	1.27
`621.wrf_s`	fp	Maybe	628	9,192	741	12,191	1.33
`623.xalancbmk_s`	int	Good	9	3	19	6	1.63
`625.x264_s`	int	Maybe	1,361	1,009	3,827	3,487	3.46
`627.cam4_s`	fp	Maybe	192	473	348	807	1.71
`628.pop2_s`	fp	Good	211	49	428	127	2.58
`631.deepsjeng_s`	int	Good	610	261	736	359	1.37
`638.imagick_s`	fp	Maybe	2	1	3	2	1.67
`641.leela_s`	int	Good	418	128	548	209	1.63
`644.nab_s`	fp	Good	528	7,960	553	7,885	0.99
`648.exchange2_s`	int	Good	520	177	957	585	3.30
`649.fotonik3d_s`	fp	Bad	-	-	-	-	-
`654.roms_s`	fp	Maybe	2	2	2	3	1.44
`657.xz_s`	int	Good	705	186	1,024	267	1.44
`996.specrand_fs`	fp	Good	1	0	1	0	1.14
`998.specrand_is`	int	Good	1	0	1	0	1.15

Total			6,842	25,548	11,050	32,434	1.45

Good. Compiled and passed post-run checks
Maybe. Compiled, but failed post-run checks
Bad. Failed to compile or failed to run

Actions

2024-05-15

Jeremy to look at impact of masked v unmasked and strided v unstrided on vector operations.
Jeremy to look at impact of VLEN > 128:
- QEMU currently supports up to 1024, RVV standard permits up to 65536.

2024-05-08

Jeremy to characterise QEMU floating point performance and file it as a performance regression issue in QEMU GitLab.
- low priority, deferred to prioritize the smoke tests work.
Jeremy to set up "smoke test" SPEC CPU 2017 run using the integer and floating point specrand programs (should run in minutes).
- COMPLETE

2024-05-01

Paolo to review the generic issue from Palmer Dabbelt to identify ideas for optimization and benchmarks to reuse.
- IN PROGRESS: we are working on running the reproducers to see the TCG ops generated by QEMU.
- Deferred to prioritize benchmark work.
Daniel to advise Paolo on best practice for preparing QEMU upstream submissions.
The bionic benchmarks may be a useful source of small benchmarks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

20240515.md

20240515.md

RISE RP005 QEMU weekly report 2024-05-15

Overview

Introducing the project team

Work completed since last report

Work planned for the coming week

Current priorties

Detailed description of work

WP1

Memory/string operations: comparing scalar and vector

Memory/string operations: impact of LMUL

Smoke test: specrand programs

Statistics

Memory/string operations

Memcpy

Individual RVV instruction performance

SPEC CPU 2017 performance

Quick test: just the "test" input datasets

Actions

Risk register

Planned absences

Files

20240515.md

Latest commit

History

20240515.md

File metadata and controls

RISE RP005 QEMU weekly report 2024-05-15

Overview

Introducing the project team

Work completed since last report

Work planned for the coming week

Current priorties

Detailed description of work

WP1

Memory/string operations: comparing scalar and vector

Memory/string operations: impact of LMUL

Smoke test: specrand programs

Statistics

Memory/string operations

Memcpy

Individual RVV instruction performance

SPEC CPU 2017 performance

Quick test: just the "test" input datasets

Actions

Risk register

Planned absences