Releases: LLNL/RAJA
v2024.07.0
This release contains new features, improvements, and bugfixes.
Please download the RAJA-v2024.07.0.tar.gz file below. The others, generated by GitHub, may not work for you due to RAJA's dependencies on git submodules.
Notable changes include:
-
New features / API changes:
- Added support for a "multi-reduction" operation which allows users to perform a run time-defined number of reduction operations in a kernel. Please see the RAJA User Guide for details and examples.
- Added first couple of sections for a "RAJA Cookbook" in the RAJA User Guide. The goal is to provide users with more detailed guidance about using RAJA features, choosing execution policies, etc. Additional content will be provided in future releases.
- Added atomicLoad and atomicStore routines for correctness in some use cases.
- Added OpenMP 5.1 implementations for atomicMin and atomicMax.
- Add SYCL reduction support in RAJA::launch
-
Build changes/improvements:
- Update camp submodule to v2024.07.0 release. There will be a version constraint for this release in RAJA Spack package when that is pushed upstream to Spack.
- Minimum required CMake version bumped to 3.23.
-
Bug fixes/improvements:
- Fix CMake issue for case when RAJA is used as a submodule dependency.
- Various fixes and improvements to builtin atomic support.
- Fixes and improvements to other atomic operations:
- Modified HIP and CUDA generic atomic compare and swap algorithms to use atomic loads instead of relying on volatile.
- Re-implemented atomic loads in terms of builtin atomics for CUDA and HIP so that the generic compare and swap functions can use it.
- Removes volatile qualifier in atomic function signatures.
- Use cuda::atomic_ref in newer versions of CUDA to back atomicLoad/atomicStore.
- Use atomicAdd as a fallback for atomicSub in CUDA.
- Removed checks where CUDA_ARCH is less than 350 since RAJA requires that as the minimum supported architecture (CMake check).
- Fixed issues with naming RAJA forall::kernels when using CUDA.
- Fixes in SYCL back-end for RAJA::launch.
- Fixed some issues in examples.
- Bugfixes and cleanup in parts of the SYCL back-end needed to support a bunch of new SYCL kernels that will appear in RAJA Performance Suite release.
- Fix type naming issue that was exposed with a new version of the Intel oneAPI compiler.
- Fix issue in User Guide documentation for configuring a project using RAJA CMake configuration.
v2024.02.2
This release contains a bugfix and new execution policies that improve performance for GPU kernels with reductions.
Please download the RAJA-v2024.02.2.tar.gz file below. The others, generated by GitHub, may not work for you due to RAJA's dependencies on git submodules.
Notable changes include:
-
New features / API changes:
- RAJA::loop_exec and associated policies (loop_reduce, etc.) have been removed. These were deprecated in an earlier release and type aliased to RAJA::seq_exec, etc. which have the same behavior as RAJA::loop_exec, etc. in the past. When you update to this version of RAJA, please change use of loop_exec too seq_exec in your code.
- New GPU execution policies for CUDA and HIP added which provide improved performance for GPU kernels with reductions. Please see the RAJA User Guide for more information. Short summary:
- Option added to change max grid size in policies that use the occupancy calculator.
- Policies added to run with max occupancy, a fraction of of the max occupancy, and to run with a "concretizer" which allows a user to determine how to run based on what the occupancy calculator determines about a kernel.
- Additional options to tune kernels containing reductions, such as
- an option to initialize data on host for reductions that use atomic operations
- an option to avoid device scope memory fences
- Change ordering of SYCL thread index ordering in RAJA::launch to follow the SYCL "row-major" convention. Please see RAJA User Guide for more information.
-
Build changes/improvements:
- NONE.
-
Bug fixes/improvements:
- Fixed issue in bump-style allocator used internally in RAJA::launch.
v2024.02.1
This release contains submodule updates and minor RAJA improvements.
Please download the RAJA-v2024.02.1.tar.gz file below. The others, generated by GitHub, may not work for you due to RAJA's dependencies on git submodules.
Notable changes include:
-
New features / API changes:
- NONE.
-
Build changes/improvements:
- Update BLT submodule to v0.6.2 release.
- Update camp submodule to v2024.02.1 release.
-
Bug fixes/improvements:
- Various changes to quiet compiler warnings in SYCL builds related to deprecated usage.
v2024.02.0
This release contains several RAJA improvements and submodule updates.
Please download the RAJA-v2024.02.0.tar.gz file below. The others, generated by GitHub, may not work for you due to RAJA's dependencies on git submodules.
Notable changes include:
-
New features / API changes:
- BREAKING CHANGE (ALMOST): The
loop_exec
and associated policies such asloop_atomic
,loop_reduce
, etc. were deprecated in the v2023.06.0 release (please see the release notes for that version for details). Users should replace these withseq_exec
and associated policies for sequential CPU execution. The code behavior will be identical to what you observed withloop_exec
, etc. However, due to a request from some users with special circumstances, theloop_*
policies still exist in this release as type aliases to theirseq_*
analogues. Theloop_*
policies will be removed in a future release. - BREAKING CHANGE: RAJA TBB back-end support has been removed. It was not feature complete and the TBB API has changed so that the code no longer compiles with newer Intel compilers. Since we know of no project that depends on it, we have removed it.
- An
IndexLayout
concept was added, which allows for accessing elements of a RAJAView
via a collection of indicies and use a different indexing strategy along different dimensions of a multi-dimensionalView
. Please the RAJA User Guide for more information. - Add support for SYCL reductions using the new RAJA reduction API.
- Add support for new reduction API for all back-ends in RAJA::launch.
- BREAKING CHANGE (ALMOST): The
-
Build changes/improvements:
- Update BLT submodule to v0.6.1 and incorporate its new macros for managing TPL targets in CMake.
- Update camp submodule to v2024.02.0, which contains changes to support ROCm 6.x compilers.
- Update desul submodule to afbd448.
- Replace internal use of HIP and CUDA platform macros to their newer versions to support latest compilers.
-
Bug fixes/improvements:
- Change internal memory allocation for HIP to use coarse-grained pinned memory, which improves performance because it can be cached on a device.
- Fix compilation error resulting from incorrect namespacing of OpenMP execution policy.
- Several fixes to internal implementation of Reducers and Operators.
v2023.06.1
This release contains various smallish RAJA improvements.
Please download the RAJA-v2023.06.1.tar.gz file below. The others, generated by GitHub, may not work for you due to RAJA's dependencies on git submodules.
Notable changes include:
-
New features / API changes:
- Add compile time block size optimization for new reduction interface.
- Changed default stream usage in Workgroup constructs to use the stream associated with the default (camp) resource. Previously, RAJA used stream zero. Specifically, this change affects where memset memory is zeroed in the device memory pool and where we get device function pointers for WorkGroup.
-
Build changes/improvements:
- RAJA_ENABLE_OPENMP_TASK CMake option added to enable/disable algorithm options based on OpenMP task construct. Currently, this only applies to RAJA's OpenMP sort implementation. The default is 'Off'. The option allows users to choose a task implementation if they wish.
-
Bug fixes/improvements:
- Fix compilation of GPU occupancy calculator and use common types for HIP and CUDA backends in the occupancy calculator, kernel policies, and kernel launch helper routines.
- Fix direct cudaMalloc/hipMalloc calls and memory leaks.
v2023.06.0
This release contains new features to improve GPU kernel performance and some bug fixes. It contains one breaking change described below and an execution policy deprecation also described below. The policy deprecation is not a breaking change in this release, but will result in a breaking change in the next release.
Please download the RAJA-v2023.06.0.tar.gz file below. The others, generated by GitHub, may not work for you due to RAJA's dependencies on git submodules.
Notable changes include:
-
New features / API changes:
- In this release the loop_exec execution policy is deprecated and will be removed in the next release. RAJA has had two sequential execution policies for some time, seq_exec and loop_exec. When using the seq_exec execution policy, RAJA would attach #pragma novector, or similar depending on the compiler, to force strictly sequential execution of a loop; e.g., by preventing a compiler from vectorizing a loop, even if it was correct to do so. When the loop_exec policy was specified, the compiler was allowed to apply any optimizations, including SIMD, that its heuristics determined were appropriate. In this release, seq_exec behaves the same as how loop_exec behaves historically and the loop_exec and associated policies, such as loop_atomic, loop_reduce, etc. are type aliases to the analogous seq_exec policies. This prevents breaking user code with this release. However, users should prepare to switch loop_exec policies to the seq_exec policy variants in the future.
- GPU global (thread and block) indexing has been refactored to abstract indexing in a given dimension. The result is that users can now specify a block size or a grid size at compile time or get those values at run time. You can also ignore blocks and index only with threads and vice versa. Kernel and launch policies are now shared. Such policies are now multi-part and contain global indexing information, a way to map global indices like direct or strided loops, and have a synchronization requirement. The synchronization allows one to request that all threads complete even if some have no work so you can synchronize a block. Aliases have been added for all of the preexisting policies and some are deprecated in favor of policies named more consistently. One BREAKING CHANGE is that thread loop policies are no longer safe to block synchronize. That feature still exists but can only be accessed with a custom policy. The RAJA User Guide contains descriptions of the new policy mechanics.
-
Build changes/improvements:
- Update BLT submodule to v0.5.3
- Update camp submodule to v2023.06.0
-
Bug fixes/improvements:
- Fixes a Windows build issue due to macro file definition logic in a RAJA header file. Specifically, the macro constant RAJA_COMPILER_MSVC was not getting defined properly when building on a Windows platform using a compiler other than MSVC.
- Kernels using the RAJA OpenMP target back-end were not properly seg faulting when expected to do so. This has been fixed.
- Various improvements, compilation and execution, in RAJA SIMD support.
- Various improvements and additions to RAJA tests to cover more end-user cases.
v2022.10.5
This release fixes an issue that was found after the v2022.10.4 release.
- Fixes CUDA and HIP separable compilation option that was broken before the v2022.10.0 release. For the curious reader, the issue was that resources were constructed and calling CUDA/HIP API routines before either runtime was initialized.
Please download the RAJA-v2022.10.5.tar.gz file below. The others, generated by GitHub, may not work for you due to RAJA's dependencies on git submodules.
v2022.10.4
This release fixes a few issues that were found after the v2022.10.3 patch release and updates a few other things.
Please download the RAJA-v2022.10.4.tar.gz file below. The others, generated by GitHub, may not work for you due to RAJA's dependencies on git submodules.
Notable changes include:
- Fixes device alignment bug in workgroups which led to missing symbol errors with the AMD clang compiler.
v2022.10.3
This release fixes a few issues that were found after the v2022.10.3 patch release and updates a few other things.
Please download the RAJA-v2022.10.3.tar.gz file below. The others, generated by GitHub, may not work for you due to RAJA's dependencies on git submodules.
Notable changes include:
-
Update camp submodule to v2022.10.1
-
Update BLT submodule to commit 8c229991 (includes fixes for crayftn + hip)
-
Properly export 'roctx' target when CMake variable RAJA_ENABLE_ROCTX is on.
-
Fix CMake logic for exporting desul targets when desul atomics are enabled.
-
Fix the way we use CMake to find the rocPRIM module to follow CMake best practices.
-
Add missing template parameter pack argument in RAJA::statement::For execution policy construct used in RAJA::kernel implementation for OpenMP target back-end.
-
Change to use compile-time GPU thread block size in RAJA::forall implementation. This improves performance of GPU kernels, especially those using the RAJA HIP back-end.
-
Added RAJA plugin support, including CHAI support, for RAJA::launch.
-
Replaced 'DEVICE' macro with alias to 'device_mem_pool_t' to prevent name conflicts with other libraries.
-
Updated User Guide documentation about CMake variable used to pass compiler flags for OpenMP target back-end. This changed with CMake minimum required version bump in v2022.10.0.
-
Adjust ordering of BLT and camp target inclusion in RAJA CMake usage to fix an issue with projects using external camp vs. RAJA submodule.
v2022.10.2
This release fixes a few issues that were found after the v2022.10.1 patch release and updates a few things. Sorry for the churn, folks.
Please download the RAJA-v2022.10.2.tar.gz file below. The others, generated by GitHub, may not work for you due to RAJA's dependencies on git submodules.
Notable changes include:
-
Update desul submodule to commit e4b65e00.
-
CUDA compute architecture must now be set using the 'CMAKE_CUDA_ARCHITECTURES' CMake variable. For example, by passing '-DCMAKE_CUDA_ARCHITECTURES=70' to CMake for 'sm_70' architecture. Using '-DCUDA_ARCH=sm_*' will not no longer do the right thing. Please see the RAJA User Guide for more information.
-
A linking bug was fixed related to the usage of the new RAJA::KernelName capability.
-
A compilation bug was fixed in the new reduction interface support for OpenMP target offload.
-
An issue was fixed in AVX compiler checking logic for RAJA vectorization intrinsics capabilities.