Releases: LLNL/RAJAPerf
v2024.07.0
This release contains new features, bug fixes, and build improvements.
Please download the RAJAPerf-v2024.07.0.tar.gz file below. The others will not work due to the way RAJAPerf uses git submodules.
-
New features and usage changes:
- Added MATVEC_3D_STENCIL kernel to Apps group
- Added MULTI_REDUCE kernel to Basic group. Multi-reduce is a new capability in RAJA.
- Added HISTOGRAM kernel to Algorithm group. This kernel tests the RAJA multi-reduce capability and has algorithm options involving atomic operations, such as the ability to assess various degrees of atomic contention.
- Added many SYCL kernel variants (note -- all RAJA SYCL variant kernels with reductions use the new RAJA reduction interface):
- Basic group: ARRAY_OF_PTRS, COPY8, DAXPY, IF_QUAD, INIT3, INIT_VIEW_1D, INIT_VIEW1D_OFFSET, MAT_MAT_SHARED, MULADDSUB, NESTED_INIT, REDUCE3_INT, TRAP_INT
- Lcals group: DIFF_PREDICT, EOS, FIRST_DIFF, FIRST_MIN, FIRST_SUM, GEN_LIN_RECUR, HYDRO_1D, HYDRO_2D, INT_PREDICT, PLANCKIAN, TRIDIAG_ELIM
- Polybench group: POLYBENCH_2MM, POLYBENCH_3MM, POLYBENCH_ADI, POLYBENCH_ATAX, POLYBENCH_FDTD_2D, POLYBENCH_FLOYD_WARSHALL, POLYBENCH_GEMM, POLYBENCH_GEMVER, POLYBENCH_GESUMMV, POLYBENCH_HEAT_3D, POLYBENCH_JACOBI_1D, POLYBENCH_JACOBI_2D, POLYBENCH_MVT
- Stream group: ADD, COPY, DOT, MUL, TRIAD
- Apps group: CONVECTION3DPA, DEL_DOT_VEC_2D, DIFFUSION3DPA, EDGE3D, ENERGY, FIRS, LTIMES, LTIMES_NOVIEW, MASS3DEA, MASS3DPA, MATVEC_3D_STENCIL, PRESSURE, VOL3D, ZONAL_ACCUMULATION_3D
- Alg group: REDUCE_SUM
- Add new kernel group
Comm
, which now contains allHALO*
kernels. - Add occupancy calculator grid stride (occgs_<block_size>) tuning for CUDA and HIP variants of kernels with reductions. This generally improves performance at problem sizes greater than the maximum occupancy of a device because the amount of work to finalize a reduction is proportional to the number of blocks. The occupancy calculator grid stride tuning works by launching fewer total threads than iterates and uses a grid stride loop to assign multiple iterates to each thread. The maximum number of blocks is determined by the occupancy calculator to maximize occupancy.
- Add reduction "tunings" for RAJA variants of kernels with reductions to compare performance of RAJA's default reduction interface and RAJA's new (experimental) reduction interface.
- Change to use pinned memory for HIP variants of kernels with reductions and "fused" kernels. This improves performance as pinned memory can be cached on a HIP device.
- Add additional CUDA memory space options for kernel data to compare performance, specifically CudaManagedHostPreferred, CudaManagedDevicePreferred, CudaManagedHostPreferredDeviceAccessed, CudaManagedDevicePreferredHostAccessed (see the output generated by the
--help
option for more information). - Make comparison of performance of kernels with reduction more fair by adding a RAJA GPU block atomic (blkatm) tuning that more closely matches the base GPU kernel variant implementations. Note there is currently false sharing/extra atomic contention when there are multiple reductions in kernels run with the blkatm tunings. This is not addressed (yet).
- Apply new RAJA "dispatch" policies in Comm_HALOEXCHANGE_FUSED kernel.
- Adds real multi-rank MPI implementations of Comm_HALOEXCHANGE and Comm_HALOEXCHANGE_FUSED kernels.
- Kernels that had problematic implementations (causing correctness issues) have all been fixed. Earlier these kernels were disabled by default and there was a command line option to run them. That option has been removed and all previously-problematic kernels are enabled by default.
- Added new ATOMIC kernel and various options to check atomic performance related to the contention level.
- Generalize and add manual tunings to scan kernels.
- Split "bytes per rep" counters into bytes read, bytes written, and bytes atomic modify-write. The goal is to better understand performance on hardware where the total bandwidth is not the same as the sum of the read and write bandwidth.
- Added command line options to support the selection of more kernel-specific execution settings. Give the
-h
or--help
command line option for usage information. - Added command line option to specify a set of "warmup" kernels to run to override the default behavior of running a set of warmup kernels based on the features used in the set of kernels specified to run.
-
Build changes / improvements:
- The RAJA submodule has been updated to v2024.07.0.
- The BLT submodule has been updated to v0.6.2, which is the version used by the RAJA submodule version.
- Improvements to the Caliper instrumentation to include per-kernel RAJA features in the performance metrics.
- Fixed issues with install process of shared library builds.
- Some changes to default "tuning" names that are reported so they are more consistent across different RAJA back-ends.
- Many CI testing improvements and updates including added CI testing for the case when MPI is enabled to test the
Comm_*
kernels in a manner that more closely resembles how real application codes run.
-
Bug fixes / improvements:
- Make Basic_INDEXLIST_3LOOP kernel implementations consistent. That is, change it to read the last member of the counts array instead of using a reducer for the RAJA variants.
- Change the internal size type for arrays to allow running benchmarks at much larger problem sizes.
- Fix issue that caused the Basic_INDEXLIST kernel to hang occasionally.
- A variety of fixes and cleanups in the LC build scripts (of interest to users with access to LC machines).
- Fix an issue where a command line option requesting information only would run the Suite when it shouldn't.
- Make memory use of the base GPU variants of kernels with reductions more consistent with the RAJA reduction implementation. These variants were using memory poorly. They now use device based memory that is host accessible to avoid making two cuda/hipMemcpy calls. This significantly reduces host side overheads and improves performance of base GPU reduction kernels when run at smaller problem sizes.
- Fix compilation issues with the OpenMP target offload variants of the Basic_COPY8 kernel.
- Fix issue with Lcals_FIRST_MIN GPU kernel reduction implementation.
- Convert all non-RAJA base and lambda GPU kernel variants so that all GPU kernel variants use the same kernel launch methods that RAJA uses internally. Also added are compile-time checks for number of kernel arguments and their types so that calls to launch methods always matches the kernel definitions.
- Fix the Base_HIP variant of the INDEXLIST kernel, which would occasionally deadlock.
- Made internal memory usage (allocation, initialization, deallocation) for SYCL kernel variants consistent with all other variants.
- Fixed Sphinx theme in Read The Docs documentation.
v2023.06.0
This release contains new features, bug fixes, and build improvements.
Please download the RAJAPerf-v2023.06.0.tar.gz file below. The others will not work due to the way RAJAPerf uses git submodules.
-
New features and usage changes:
- User and developer documentation formerly in top-level README.md file (parsed to appear on GitHub project home page) has been expanded and moved to Sphinx documentation hosted on ReadTheDocs.
- Caliper integration and annotations have been added in addition to Adiak metadata fields. Basic documentation is available in the User Guide. More documentation along with a tutorial will be available in the future.
- Execution policies for many RAJA variants of GPU kernels were changed to take advantage of recent performance improvements in RAJA where we make better use of compile-time knowledge of block sizes. THis brings RAJA variants into closer alignment with GPU base variants.
- The feature called 'Teams' has been changed to 'Launch' to be consistent with the RAJA feature.
- A runtime option was added to change the memory space used for kernel data allocations. This allows us to compare performance using different memory spaces. Please see the user documentation or the '-h' help option output for details.
- Warmup kernels were restructured so that only those relevant to kernels selected to run will be run.
- New kernels have been added:
* Basic_COPY8, which allows us to explore what bandwidth looks like with more memory accesses per iteration
* Apps_MASS3DEA, which represents local element matrix assembly operations in finite element applications
* Apps_ZONAL_ACCUMULATION_3D, which has the same data access patterns Apps_NODAL_ACCUMULATION_3D, but without the need for atomics
* Basic_ARRAY_OF_PTRS, which involves a use case where a kernel captures an array and uses a runtime sized portion of it. This pattern exhibits different performance behavior for CUDA vs. HIP.
* Apps_EDGE3D, which computes the summed mass + stiffness matrix of low order edge bases (relevant to MHD discretizations) - Added new command line options:
* '--align' which allows one to change the alignment of host memory allocations.
* '--disable_warmup' which allows one to turn off warmup kernels if desired.
* '--tunings' or '-t', which allows a user to specify which blocksize tunings to run for GPU kernel variants. Please see the '-h' help output for more information.
* '--gpu_stream_0', which allows a user to switch between GPU stream zero and the RAJA default stream. - Also, command line option '--help' or '-h' output was reorganized and improved for readability and clarity.
- All 'loop_exec' RAJA execution policy usage has been replaced with the RAJA 'seq_exec' policy. The 'loop_exec' policy in RAJA is now deprecated and will be removed in the next RAJA (no-patch) release.
- An environment variable 'RAJA_PERFSUITE_UNIT_TEST' has been added that allows one to select a single kernel to run via an alternative mechanism to the command line.
-
Build changes / improvements:
- The RAJA submodule has been updated to v2023.06.1.
- The BLT submodule has been updated to v0.5.3, which is the version used by the RAJA submodule version.
- Moved RAJA Perf Spack package to RADIUSS Spack Configs project where it will be curated and upstreamed to Spack like packages for other RAJA-related projects.
- For various reasons the Apps_COUPLE kernel has been removed from the default build since it was incomplete and needed lightweight device side support for complex arithmetic. It may be resurrected at some point and re-added to the Suite.
-
Bug fixes / improvements:
- Fix issue related to improper initialization of reduction variable in OpenMP variants of Lcals_FIRST_MIN kernel. Interestingly, the issue only appeared at larger core counts possible on newer multi-core architectures.
- Fix issue in Lcals_FIRST_MIN kernel where base CUDA and HIP variants were using an output array before it was initialized.
v2022.10.0
This release contains new features, bug fixes, and build improvements.
Please download the RAJAPerf-v2022.10.0.tar.gz file below. The others will not work due to the way RAJAPerf uses git submodules.
Notable changes include:
-
Release version name change:
-
Following the naming scheme for coordinated RAJA Portability Suite releases, this release of the RAJA Performance Suite is v2022.10.0 to indicate that it corresponds to the v2022.10.x releases of RAJA and camp.
-
We've been doing coordinated releases of RAJA Portability Suite projects (RAJA, Umpire, CHAI, and camp) for a while, and we changed the version naming scheme for those projects to reflect that. For example, the version number for the last release of these projects is v2022.10.x, meaning the release occurred in October 2022. The intent is that the v2022.10.x project releases are consistent in terms of their dependencies and they are tested together. The 'x' patch version number is applied to each project independently if a bugfix or other patch is needed. Any combination of v2022.10.x versioned libraries should be compatible.
-
-
New features and usage changes:
- Add CONVECTION3DPA finite element kernel.
- Add basic memory operation kernels MEMSET and MEMCPY
-
Build changes / improvements:
- Improved CI testing, including using test infrastructure in RAJA (eliminate redundancies).
- Fix 'make install' so that executable is installed as well.
- Update all submodules to be consistent with RAJA v2022.10.4 release, including that version of RAJA.
-
Bug fixes / improvements:
- Fix race condition in FIRST_MIN kernel (Thanks C. Robeck from AMD).
- Fix broken OpenMP target variant of REDUCE_STRUCT kernel.
- Fix MPI hang when rank zero does not enter a barrier if no path name is given for creating directories.
- Support long double with MPI all reduce even when MPI implementation does not support long double.
- Fix message printing to be rank zero only.
v0.12.0
Version 0.12.0
This release contains new features, bug fixes, and build improvements. Please see the RAJA user guide for more information about items in this release.
Please download the RAJAPerf-v0.12.0.tar.gz file below. The others will not work due to the way RAJAPerf uses git submodules.
Notable changes include:
-
New features / API changes:
- Add command line options to exclude individual kernels and/or variants, and kernels using specified RAJA features. Please use '-h' option to see available options and what they do.
- Add command line option to output min, max, and/or average of kernel timing data over number of passes through the suite. Please use '-h' option to see available options and what they do.
- Added basic MPI support, which enables the code to run on multiple MPI ranks simultaneously. This makes analysis of node performance more realistic since it mimics how real applications exercise memory bandwidth, for example.
- Add a new checksum calculation for verifying correctness of results generated by kernel variants. The new algorithm uses a new weighting scheme to reduce the amount of bias towards later elements in the results arrays, and employs a Kahan sum to reduce error in the summation of many terms.
- Added support for running multiple GPU block size "tunings" of kernels so that experiments can be run to assess how kernel performance depends on block size for different programming models and hardware architectures. By default, the Suite will run all tunings when executed, but a subset of tunings may be chosen at runtime via command line arguments.
- Add DIFFUSION3DPA kernel, which is a high-order FEM kernel that stresses shared memory usage.
- Add NODAL_ACCUMULATION_3D and DAXPY_ATOMIC kernels which exercise atomic operations in cases with few or unlikely collisions.
- Add REDUCE_STRUCT kernel, which tests compilers' ability to optimize load operations when using data arrays accessed through pointer members of a struct.
- Add REDUCE_SUM kernel so we can more easilyt compare reduction implementations.
- Add SCAN, INDEXLIST, and INDEXLIST_3LOOP kernels that include scan operations, and operations to create lists of indices based on where a condition is satisfied by elements of a vector (common type of operation used in mesh-based physics codes).
- Following improvements in RAJA, removed unused execution policies in RAJA "Teams" kernels: DIFFUSION3DPA, MASS3DPA, MAT_MAT_SHARED. Kernel implementations are unchanged.
-
Build changes/improvements
- Updated versions of RAJA and BLT submodules.
- RAJA is at the SHA-1 commit 87a5cac, which is a few commits ahead of the v2022.03.0 release. The post-release changes are used here for CI testing improvements.
- BLT v0.5.0.
See the release documentation for those libraries for details.
- With this release, the RAJA Perf Suite requires C++14 (due to use of RAJA v2022.03.0).
- With this release, the RAJA Perf Suite requires CMake 3.14.5 or newer.
- BLT v0.5.0 includes improved support for ROCm/HIP builds. Although the option CMAKE_HIP_ARCHITECTURES to specify the HIP target architecture is not available until CMake version 3.21, the option is supported in the new BLT version and works with all versions of CMake.
- Updated versions of RAJA and BLT submodules.
-
Bug fixes/improvements:
- Fixed index ordering in GPU variants of HEAT_3D kernel, which was preventing coalesced memory accesses.
- Squashed warnings related to unused variables.
v0.11.0
The release adds new kernels, new features, and resolves some issues. New kernels exercise RAJA features that are not used in pre-existing kernels.
Please download the RAJAPerf-v0.11.0.tar.gz file below. The others will not work due to the way RAJAPerf uses git submodules.
Notable changes include:
- Update RAJA submodule to v0.14.0 release.
- Update BLT submodule to v0.4.1 release (same one used in RAJA v0.14.0)
- New kernels added:
- 'Basic' group: MAT_MAT_SHARED, PI_ATOMIC, PI_REDUCE
- 'Apps' group: HALOEXCHANGE, HALOEXCHANGE_FUSED, MASS3DPA
- New group 'Algorithm' added and kernels in that group: SORT, SORTPAIRS
- New Lambda_CUDA and Lambda_HIP variants added to various kernels to help isolate performance issues when observed.
- Default problem size for all kernels is no ~1M so this is consistent across all kernels. Please refer to Suite documentation on main GitHub page for a discuss of problem size definitions.
- Execution of all GPU kernel variants has been modified (RAJA execution policies, base variant launches) to allow arbitrary problem sizes to be run.
- New runtime options:
- Option to run kernels with a specified size. This makes it easier to run scaling studies with the Suite.
- Option to filter kernels to run based on which RAJA features they use.
- More kernel information output added, such as features, iterations per rep, kernels per rep, bytes per rep, and FLOPs per rep. This and other information is printed to the screen before the Suite is run and is also output to a new CSV report file. Please see Suite documentation on main GitHub page for details.
- Additional warmup kernels enabled to initialize internal RAJA data structures so that initial kernel execution timings are more realistic.
- Error checking for base GPU variants added to catch launch failures where they occur.
- Compilation of RAJA exercises, examples, and tests is disabled by default. This makes compilation times much faster for users who do not want to build those parts of RAJA. These things can be enabled, if desired, with a CMake option.
v0.10.0
This release changes the way kernel variants are managed to handle cases where not all kernels implement all variants and where not all variants apply to all kernels. Future releases of the RAJA Performance Suite will include such kernels and variants. The README documentation visible on the main project page describes the new process to add new kernels and variants, which is a fairly minor perturbation to what existed previously.
Please download the RAJAPerf-v0.10.0.tar.gz file below. The others will not work due to the way RAJAPerf uses git submodules.
v0.9.0
This release adds HIP variants (baseline and RAJA) for each kernel in the suite.
Please download the RAJAPerf-v0.9.0.tar.gz file below. The others will not work due to the way RAJAPerf uses git submodules.
v0.8.0
This release updates the RAJA submodule to v0.13.0 and the BLT submodule to match what is used in RAJA and also fixes some issues.
The main changes in this release are:
- Updates to most of the RAJA::kernel execution policies used in nested loop kernels in this suite with newer RAJA usage in which 'Lambda' statements specify which arguments are used in each lambda.
- Fixes to the RAJA OpenMP target back-end allow all OpenMP target kernels in this suite to compile and execute properly.
- Kernel variant fixes and a timing data fix pointed out by individuals who submitted issues.
Please download the RAJAPerf-v0.8.0.tar.gz file below. The others will not work due to the way RAJAPerf uses git submodules.
v0.7.0
This release updates the RAJA submodule to v0.11.0 and the BLT submodule to v0.3.0.
Please download the RAJAPerf-v0.7.0.tar.gz file below. The others will not work due to the way RAJAPerf uses git submodules.
v0.6.0 Release
This release contains two new variants of each kernel and several new kernels.
-
The new variants are sequential-lambda and OpenMP-lambda. They do not use RAJA (like the baseline variants), but use lambda expressions for the loop bodies. The hope is that these variants can help isolate performance issues to RAJA internals or compiler struggles to optimize code containing lambda expressions.
-
New kernels appear in the Basic and Lcals kernel subsets.
Please download the RAJAPerf-v0.6.0.tar.gz file below. The others will not work due to the way RAJAPerf uses git submodules.