Skip to content

Commit

Permalink
Develop Stream 2024-10-31 (#421)
Browse files Browse the repository at this point in the history
* Resolve "hipCUB compilation fails with latest rocPRIM changes"

* Rework device_histogram test

* Const-qualify half_t::operator+/*

* Update CUB/Thrust/libcu++ to 2.5.0

* Add tets for large number of items for hipcub::DeviceSelect::If

* Update example_device_radix_sort.cu

* fix: reset error code in device_radix_sort test after out-of-memory error

* Expose DeviceSelect::FlaggedIf

* Add test for DeviceSelect::FlaggedIf

* Add benchmark for DeviceSelect::FlaggedIf

* Set c++ version to 17 and create warning

* Fix ambiguous variable error

* Fix nodiscard warnings

* Set CI tests for both c++14 and 17

* Fix nodiscard warnings in example

* Examples clang-format

* Fixed clang format and dates

* temp fix: wrong error on cuda machines

* Ignore error from hipGetLastError to prevent warning

* Deleted declaration of unecessary hipError_t result

* Deleted declaration of unecessary hipError_t result

* Format amending

* Format amending

* Format amending

* Cleanup

* Formatting

* Added wrapper for BindTexture

* Fixed some issues in the test

* Removed unnecessary code

* Mergeback 6.3 fixes (#420)

* Revert Bit Twiddle change from PR #377 (#397)

An update to the TwiddleIn/Out functions from PR #377 seems to be causing
a build failure in onnxruntime. This change reverts the single commit
(0721c2c) that made those changes.
We can re-apply the change with an appropriate fix in the future.
Note: the commits in the PR were squashed, so that commit will not
show up in the log.

* Remove website URL from comments (#398)

Referencing or using code from some websites is prohibited in this repository.
This change removes an informational reference in the comments.

* Add gfx1151 target (#399) (#401)

Co-authored-by: Stanley Tsang <stanley.tsang@amd.com>

* Spolifroni amd/624 changelogcleanup upcoming (#411)

* edited to conform to standards

* edited to conform to standards

* updated the changelog for 6.3 (#418)

---------

Co-authored-by: amd-garydeng <garydeng@amd.com>
Co-authored-by: Stanley Tsang <stanley.tsang@amd.com>
Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

* Add parameter to specify rocPRIM branch to use (#403)

* added link to documentation (#416)

---------

Co-authored-by: Balint Soproni <balint@streamhpc.com>
Co-authored-by: Beatriz Navidad Vilches <beatriz@streamhpc.com>
Co-authored-by: Saiyang Zhang <saiyang@streamhpc.com>
Co-authored-by: Wayne Franz <wayfranz@amd.com>
Co-authored-by: amd-garydeng <garydeng@amd.com>
Co-authored-by: Stanley Tsang <stanley.tsang@amd.com>
Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>
Co-authored-by: Lauren Wrubleski <Lauren.Wrubleski@amd.com>
  • Loading branch information
9 people authored Nov 20, 2024
1 parent c54ded0 commit a113f98
Show file tree
Hide file tree
Showing 48 changed files with 1,771 additions and 550 deletions.
14 changes: 14 additions & 0 deletions .gitlab-ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -88,6 +88,7 @@ copyright-date:
-D GPU_TARGETS="$GPU_TARGETS"
-D CMAKE_C_COMPILER_LAUNCHER=phc_sccache_c
-D CMAKE_CXX_COMPILER_LAUNCHER=phc_sccache_cxx
-D CMAKE_CXX_STANDARD=14
-B $CI_PROJECT_DIR/rocPRIM/build
-S $CI_PROJECT_DIR/rocPRIM
- cd $CI_PROJECT_DIR/rocPRIM/build
Expand Down Expand Up @@ -116,6 +117,7 @@ build:rocm:
-D ROCM_SYMLINK_LIBS=OFF
-D CMAKE_C_COMPILER_LAUNCHER=phc_sccache_c
-D CMAKE_CXX_COMPILER_LAUNCHER=phc_sccache_cxx
-D CMAKE_CXX_STANDARD="$BUILD_VERSION"
-B $CI_PROJECT_DIR/build
-S $CI_PROJECT_DIR
- cmake --build $CI_PROJECT_DIR/build
Expand All @@ -133,6 +135,9 @@ build:rocm:
- $CI_PROJECT_DIR/build/hipcub*.zip
- $CI_PROJECT_DIR/build/.ninja_log
expire_in: 2 weeks
parallel:
matrix:
- BUILD_VERSION: [14, 17]

build:rocm-benchmark:
extends:
Expand All @@ -153,6 +158,7 @@ build:rocm-benchmark:
-D GPU_TARGETS="$GPU_TARGETS"
-D CMAKE_C_COMPILER_LAUNCHER=phc_sccache_c
-D CMAKE_CXX_COMPILER_LAUNCHER=phc_sccache_cxx
-D CMAKE_CXX_STANDARD=14
-B $CI_PROJECT_DIR/build
-S $CI_PROJECT_DIR
- cmake --build $CI_PROJECT_DIR/build
Expand All @@ -176,6 +182,7 @@ test:rocm:
- cd $CI_PROJECT_DIR/build
- cmake
-D CMAKE_PREFIX_PATH=/opt/rocm
-D CMAKE_CXX_STANDARD=14
-P $CI_PROJECT_DIR/cmake/GenerateResourceSpec.cmake
- cat ./resources.json
# Parallel execution (with other AMDGPU processes) can oversubscribe the SDMA queue.
Expand Down Expand Up @@ -231,6 +238,7 @@ benchmark:rocm:
-G Ninja
-D CMAKE_CXX_FLAGS="-Wall -Wextra -Werror"
"$GPU_TARGETS_ARG"
-D CMAKE_CXX_STANDARD=14
-S $CI_PROJECT_DIR/test/extra
-B $CI_PROJECT_DIR/build/package_test
- cmake --build $CI_PROJECT_DIR/build/package_test
Expand All @@ -251,6 +259,7 @@ benchmark:rocm:
- cmake
-G Ninja
-D BUILD_TEST=OFF
-D CMAKE_CXX_STANDARD=14
-S $CI_PROJECT_DIR
-B $CI_PROJECT_DIR/build_only_install
# Preserve $PATH when sudoing
Expand Down Expand Up @@ -318,6 +327,7 @@ build:nvcc:
-D CMAKE_C_COMPILER_LAUNCHER=phc_sccache_c
-D CMAKE_CXX_COMPILER_LAUNCHER=phc_sccache_cxx
-D CMAKE_CUDA_COMPILER_LAUNCHER=phc_sccache_cuda
-D CMAKE_CXX_STANDARD="$BUILD_VERSION"
-B $CI_PROJECT_DIR/build
-S $CI_PROJECT_DIR
- cmake --build $CI_PROJECT_DIR/build
Expand All @@ -335,6 +345,9 @@ build:nvcc:
- $CI_PROJECT_DIR/build/hipcub*.zip
- $CI_PROJECT_DIR/build/.ninja_log
expire_in: 2 weeks
parallel:
matrix:
- BUILD_VERSION: [14, 17]

build:nvcc-benchmark:
stage: build
Expand All @@ -354,6 +367,7 @@ build:nvcc-benchmark:
-D CMAKE_C_COMPILER_LAUNCHER=phc_sccache_c
-D CMAKE_CXX_COMPILER_LAUNCHER=phc_sccache_cxx
-D CMAKE_CUDA_COMPILER_LAUNCHER=phc_sccache_cuda
-D CMAKE_CXX_STANDARD=14
-B $CI_PROJECT_DIR/build
-S $CI_PROJECT_DIR
- cmake --build $CI_PROJECT_DIR/build
Expand Down
4 changes: 3 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,16 +5,18 @@ Full documentation for hipCUB is available at [https://rocm.docs.amd.com/project
## (Unreleased) hipCUB-x.x.x for ROCm 6.4.0

### Added
* Added extended tests to `rtest.py`. These tests are extra tests that did not fit the criteria of smoke and regression tests. These tests will take much longer to run relative to smoke and regression tests. Use `python rtest.py [--emulation|-e|--test|-t]=extended` to run these tests.
* Added regression tests to `rtest.py`. These tests recreate scenarios that have caused hardware problems in past emulation environments. Use `python rtest.py [--emulation|-e|--test|-t]=regression` to run these tests.
* Added `ForEach`, `ForEachN`, `ForEachCopy`, `ForEachCopyN` and `Bulk` functions to have parity with CUB.
* Added the `hipcub::CubVector` type for CUB parity.
* Added `--emulation` option for `rtest.py`
* Unit tests can be run with `[--emulation|-e|--test|-t]=<test_name>`
* Added `DeviceSelect::FlaggedIf` and its inplace overload.

### Changed
* Changed the subset of tests that are run for smoke tests such that the smoke test will complete with faster run-time and to never exceed 2GB of vram usage. Use `python rtest.py [--emulation|-e|--test|-t]=smoke` to run these tests.
* The `rtest.py` options have changed. `rtest.py` is now run with at least either `--test|-t` or `--emulation|-e`, but not both options.
* The NVIDIA backend now requires CUB, Thrust and libcu++ 2.5.0. If it is not found it will be downloaded from the NVIDIA CCCL repository.
* Changed the C++ version from 14 to 17. C++14 will be deprecated in the next major release.

## hipCUB-3.3.0 for ROCm 6.3.0

Expand Down
13 changes: 11 additions & 2 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# MIT License
#
# Copyright (c) 2017-2023 Advanced Micro Devices, Inc. All rights reserved.
# Copyright (c) 2017-2024 Advanced Micro Devices, Inc. All rights reserved.
#
# Permission is hereby granted, free of charge, to any person obtaining a copy
# of this software and associated documentation files (the "Software"), to deal
Expand Down Expand Up @@ -61,10 +61,19 @@ endif()
set(CMAKE_INSTALL_RPATH_USE_LINK_PATH TRUE CACHE BOOL "Add paths to linker search and installed rpath")

# Set CXX flags
set(CMAKE_CXX_STANDARD 14)
if (NOT DEFINED CMAKE_CXX_STANDARD)
set(CMAKE_CXX_STANDARD 17)
endif()
set(CMAKE_CXX_STANDARD_REQUIRED ON)
set(CMAKE_CXX_EXTENSIONS OFF)


if (CMAKE_CXX_STANDARD EQUAL 14)
message(WARNING "C++14 will be deprecated in the next major release")
elseif(NOT CMAKE_CXX_STANDARD EQUAL 17)
message(FATAL_ERROR "Only C++14 and C++17 are supported")
endif()

# rocm-cmake has to be included early so that it's available to set GPU_TARGETS
# If hip is included prior to setting that then it defaults to building only for the current architecture
include(ROCmCMakeBuildToolsDependency)
Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,7 @@ python3 -m http.server
* Requires CMake 3.16.9 or later
* For NVIDIA GPUs:
* CUDA Toolkit
* CCCL library (>= 2.4.0)
* CCCL library (>= 2.5.0)
* Automatically downloaded and built by the CMake script
* Requires CMake 3.15.0 or later
* Python 3.6 or higher (for HIP on Windows only; this is only required for install scripts)
Expand Down
6 changes: 3 additions & 3 deletions benchmark/benchmark_device_adjacent_difference.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -178,12 +178,12 @@ void run_benchmark(benchmark::State& state, const std::size_t size, const hipStr
state.SetBytesProcessed(state.iterations() * batch_size * size * sizeof(T));
state.SetItemsProcessed(state.iterations() * batch_size * size);

hipFree(d_input);
HIP_CHECK(hipFree(d_input));
if(copy)
{
hipFree(d_output);
HIP_CHECK(hipFree(d_output));
}
hipFree(d_temp_storage);
HIP_CHECK(hipFree(d_temp_storage));
}

} // namespace
Expand Down
32 changes: 16 additions & 16 deletions benchmark/benchmark_device_run_length_encode.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -111,14 +111,14 @@ void run_encode_benchmark(benchmark::State& state,

for(size_t i = 0; i < batch_size; i++)
{
hipcub::DeviceRunLengthEncode::Encode(d_temporary_storage,
temporary_storage_bytes,
d_input,
d_unique_output,
d_counts_output,
d_runs_count_output,
size,
stream);
HIP_CHECK(hipcub::DeviceRunLengthEncode::Encode(d_temporary_storage,
temporary_storage_bytes,
d_input,
d_unique_output,
d_counts_output,
d_runs_count_output,
size,
stream));
}
HIP_CHECK(hipStreamSynchronize(stream));

Expand Down Expand Up @@ -214,14 +214,14 @@ void run_non_trivial_runs_benchmark(benchmark::State& state,

for(size_t i = 0; i < batch_size; i++)
{
hipcub::DeviceRunLengthEncode::NonTrivialRuns(d_temporary_storage,
temporary_storage_bytes,
d_input,
d_offsets_output,
d_counts_output,
d_runs_count_output,
size,
stream);
HIP_CHECK(hipcub::DeviceRunLengthEncode::NonTrivialRuns(d_temporary_storage,
temporary_storage_bytes,
d_input,
d_offsets_output,
d_counts_output,
d_runs_count_output,
size,
stream));
}
HIP_CHECK(hipStreamSynchronize(stream));

Expand Down
Loading

0 comments on commit a113f98

Please sign in to comment.