Cleanup CUDA, Reuse Memory, Add Serial Model, Cleaup Std Parallelism #202

gonzalobg · 2024-06-03T19:04:32Z

Cleanup CUDA

Refactor all kernels into a generic "parallel for" algorithm that supports grid-stride and block-stride loops, configurable with model flag.
Use Occupancy APIs to portably handle devices of all sizes.
Refactor CUDA memory allocation APIs.
Prints more GPU details, in particular, the theoretical peak BW in GB/s of the current device, using the NVML library (which is part of the CUDA Toolkit and always available)
Fixes 2 bugs:
- Prints the "order" used to run the benchmarks (e.g. classic vs isolated)
- Fixes a division by zero bug in the solution checking

Add Serial

By @tom91136 Good thing to have when comparing with other parallel programming models, mostly for syntax.
This also makes us consistent with CloverLeaf, TeaLeaf, and miniBUDE.

Reuse Memory

This PR puts benchmarks in control of allocating the host
memory used for verifying the results.

This enables benchmarks that use Unified Memory for the device
allocations, to avoid the host-side allocation and just pass
pointers to the device allocation to the benchmark driver.

Closes #128 .

Cleanup C++ Standard Parallelism

Merge the 3 implementations into one with different flags for data c++17, data c++23, and indices.
Also annotate workarounds with a #define WORKAROUND and print whether the current implementation is not conforming.
Adds support for AdaptiveCpp (CI not added yet; will be done later as part of removing hipSYCL).

This commit puts benchmarks in control of allocating the host memory used for verifying the results. This enables benchmarks that use Unified Memory for the device allocations, to avoid the host-side allocation and just pass pointers to the device allocation to the benchmark driver. Closes UoB-HPC#128 .

gonzalobg · 2024-09-19T10:23:19Z

src/std/STDStream.cpp

+#ifdef INDICES
+// NVHPC workaround: TODO: remove this eventually 
+#if defined(__NVCOMPILER) && defined(_NVHPC_STDPAR_GPU)
+#define WORKAROUND   


Have a pragma message to print that workarounds are enabled.

gonzalobg · 2024-09-19T10:24:10Z

src/dpl_shim.h

 #else

 // auto exe_policy = dpl::execution::seq;
 // auto exe_policy = dpl::execution::par;
 static constexpr auto exe_policy = dpl::execution::par_unseq;
 #define USE_STD_PTR_ALLOC_DEALLOC
+#define WORKAROUND


pragma message to highlight that there is a workaround

gonzalobg · 2024-09-19T10:40:44Z

src/sycl2020/SYCLStream2020.cpp

@@ -1,5 +1,5 @@

-// Copyright (c) 2015-23 Tom Deakin, Simon McIntosh-Smith, and Tom Lin
+// Copyright (c) 2015-16 Tom Deakin, Simon McIntosh-Smith,


Undo this change

gonzalobg added 6 commits May 27, 2024 11:45

Refactor CUDA memory allocation

759f7b1

Refactor CUDA kernels and support block-stride loops

293ed77

Bugfix: division by zero in solution-check for individual benchmarks

13e870f

Print order used

46b6d41

Print device peak BW using NVML

51231ac

Capitalize order options for consistency with benchmarks

321ba62

gonzalobg force-pushed the reuse_memory branch 7 times, most recently from 2b9129e to 6c83420 Compare June 4, 2024 16:31

gonzalobg force-pushed the reuse_memory branch from 6c83420 to 616b3b2 Compare June 4, 2024 17:21

Add serial model

2929a6c

gonzalobg mentioned this pull request Jun 5, 2024

Cleanup CUDA implementation a bit #199

Closed

Update Serial

9ff46ec

gonzalobg mentioned this pull request Jun 5, 2024

Add serial model #200

Closed

gonzalobg changed the title ~~Reuse memory~~ Cleanup CUDA, Reuse Memory, Add Serial Model Jun 5, 2024

gonzalobg changed the title ~~Cleanup CUDA, Reuse Memory, Add Serial Model~~ Cleanup CUDA, Reuse Memory, Add Serial Model, Cleaup Std Parallelism Jun 5, 2024

gonzalobg force-pushed the reuse_memory branch from c1952cd to 09868f4 Compare June 5, 2024 12:21

Merge C++ Standard Parallelism and SYCL2020 implementations

b3786f6

gonzalobg force-pushed the reuse_memory branch from 8d3f416 to b3786f6 Compare June 5, 2024 14:05

gonzalobg and others added 4 commits June 5, 2024 07:42

Add support for AdaptiveCpp to C++ Standard Parallelism

62a5051

Add support for hipstdpar

6f0c2a1

Arm toolchains prefer -mcpu=native

084ef3b

Support NVHPC stdpar=multicore and AMDGPU interpose-alloc

4cebcde

gonzalobg mentioned this pull request Aug 14, 2024

Add hipstdpar support to BabelStream #195

Open

gonzalobg commented Sep 19, 2024

View reviewed changes

tomdeakin mentioned this pull request Sep 19, 2024

OpenACC implementation requires GCC 6 and patch #17

Open

gonzalobg commented Sep 19, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cleanup CUDA, Reuse Memory, Add Serial Model, Cleaup Std Parallelism #202

Cleanup CUDA, Reuse Memory, Add Serial Model, Cleaup Std Parallelism #202

gonzalobg commented Jun 3, 2024 •

edited

Loading

gonzalobg Sep 19, 2024

gonzalobg Sep 19, 2024

gonzalobg Sep 19, 2024

		@@ -1,5 +1,5 @@

		// Copyright (c) 2015-23 Tom Deakin, Simon McIntosh-Smith, and Tom Lin
		// Copyright (c) 2015-16 Tom Deakin, Simon McIntosh-Smith,

Cleanup CUDA, Reuse Memory, Add Serial Model, Cleaup Std Parallelism #202

Are you sure you want to change the base?

Cleanup CUDA, Reuse Memory, Add Serial Model, Cleaup Std Parallelism #202

Conversation

gonzalobg commented Jun 3, 2024 • edited Loading

Cleanup CUDA

Add Serial

Reuse Memory

Cleanup C++ Standard Parallelism

gonzalobg Sep 19, 2024

Choose a reason for hiding this comment

gonzalobg Sep 19, 2024

Choose a reason for hiding this comment

gonzalobg Sep 19, 2024

Choose a reason for hiding this comment

gonzalobg commented Jun 3, 2024 •

edited

Loading